Opened 11 months ago

Last modified 8 months ago

#24594 new defect

Protocol warning: Expiring stuck OR connection to fd...

Reported by: dgoulet Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-sched, libevent, tor-connection, 034-triage-20180328, 034-removed-20180328
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

So in theory, this is at protocol warning so shouldn't too problematic but I think this worth looking at it. I've been seeing many of these on a test relay I have (capped at 200KB/s) using KIST scheduler: (redacting the relay addr/port):

Expiring stuck OR connection to fd 380 (IP:PORT). (3747888 bytes to flush; 3000 seconds since last write)

This is pretty big, 3.7MB stuck in the outbuf of a connection. The 3000 seconds since last write means that connection_handle_write_impl() hasn't been called which is *very* surprising in the first place.

There are currently two ways for the handle write function to be called, either through the libevent write_event which is fired everytime the socket is *ready* to write (see this as POLLLOUT from poll()). Or, it is directly called from KIST scheduler when cells are put in the outbuf.

This is worrying because it means that KIST did in fact put 3.7MB of cells on the outbuf thinking the socket had its TCP buffer stable enough to put that data in but somehow none got written on the socket.

On possibility is that KIST flushed cells on the connection then tried to write it to the network, that didn't work, the TCP information of the socket is still intact and because KIST doesn't check for errors (#24449), nothing happened. Then, somehow, after those 3.7MB were put in the outbuf, the channel was never scheduled again for a write because KIST had no idea that anything was left in the outbuf from previous flush on the network.

So then it comes down to the write_event to write those cells flushed by KIST. Without having a POLLOUT event on the socket, nothing will happen so the question I have is how can this event was never fired up for 50 minutes? I kind of feel that the TCP timeout would have kicked in by then if there was really a problem... ? But also, that is a _long_ time for an idle connection?

Child Tickets

Change History (5)

comment:1 Changed 11 months ago by dgoulet

This is at 99% chances that it is related to #24665 which is KIST shoving too many bytes in the connection's outbuf.

Still, I think we could drastically bring this timeout down, 50 minutes is too high in my opinion.

comment:2 Changed 10 months ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Moving a bunch of tickets from 033 to 034.

comment:3 Changed 8 months ago by nickm

Keywords: 034-triage-20180328 added

comment:4 Changed 8 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:5 Changed 8 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Note: See TracTickets for help on using tickets.