Protocol warning: Expiring stuck OR connection to fd...

So in theory, this is at protocol warning so shouldn't too problematic but I think this worth looking at it. I've been seeing many of these on a test relay I have (capped at 200KB/s) using KIST scheduler: (redacting the relay addr/port):

Expiring stuck OR connection to fd 380 (IP:PORT). (3747888 bytes to flush; 3000 seconds since last write)

This is pretty big, 3.7MB stuck in the outbuf of a connection. The 3000 seconds since last write means that connection_handle_write_impl() hasn't been called which is very surprising in the first place.

There are currently two ways for the handle write function to be called, either through the libevent write_event which is fired everytime the socket is ready to write (see this as POLLLOUT from poll()). Or, it is directly called from KIST scheduler when cells are put in the outbuf.

This is worrying because it means that KIST did in fact put 3.7MB of cells on the outbuf thinking the socket had its TCP buffer stable enough to put that data in but somehow none got written on the socket.

On possibility is that KIST flushed cells on the connection then tried to write it to the network, that didn't work, the TCP information of the socket is still intact and because KIST doesn't check for errors (#24449 (moved)), nothing happened. Then, somehow, after those 3.7MB were put in the outbuf, the channel was never scheduled again for a write because KIST had no idea that anything was left in the outbuf from previous flush on the network.

So then it comes down to the write_event to write those cells flushed by KIST. Without having a POLLOUT event on the socket, nothing will happen so the question I have is how can this event was never fired up for 50 minutes? I kind of feel that the TCP timeout would have kicked in by then if there was really a problem... ? But also, that is a long time for an idle connection?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information