It appears that connection_handle_write_impl exits quickly when it hears TOR_TLS_WANT_WRITE from flush_from_buf_tls. But flush_from_buf_tls can return that error code even when it has flushed some data. That means that the other code in that function -- like decrementing buckets and calling flushed_some functions -- might not get called at all.
I'm attaching the suggested patch; I think the patch could actually be even simpler.
This could be a backport candidate for 0.2.3.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Whoa, I think this was introduced back in ef2409e4e.
I've done a shorter fix as branch "bug7708_023" ; I think it should get tested in 0.2.4 and backported to 0.2.3 if it seems okay in 0.2.4 after a while.
I'd also like to refactor the fetch_from_buf_tls and read_to_buf_tls functions at long last: Their interface is begging for this kind of breakage.
See bug7708_023_v2. Checking n_written is incorrect; n_written can be true if bytes were written on the underlying transport but nothing was flushed from the buffer.
See bug7708_023_v3. We might as well use the original patch rather than just doing something equivalent. I've thrown in a note for an antipattern we should fix in 0.2.4, but we don't do code cleanups in 0.2.3.
I need to think of some way to test this, and look for more cases where we should be writing but aren't.
Maybe there should be a connection_should_be_writing/connection_should_be_reading function pair that we use to check whether a connection should be reading/writing, and if it isn't, we log a bug. That could be hard to get right. But maybe it would be a worthwhile thing. Otherwise, I worry that we could have more cases like this.
I wonder if I can provoke this on a testing network on purpose by setting SO_SNDBUF and/or SO_RCVBUF very low.
See bug7708_023_v3 again. It now has the extra patch above, plus a comment cleanup. Andrea and I like it; she's going to test it for a little while to see if it explodes.
The branch to test is bug7708_merged_to_master, which has today's master plus a NOT-SQUASHED bug7708_023_v3 on it.
So far testing, I'm seeing messages like this in the log:
Jan 31 00:42:43.000 [notice] No circuits are opened. Relaxed timeout for a circuit with channel state open to 66869ms. However, it appears the circuit has timed out anyway. 3 guards are live. [562 similar message(s) suppressed in last 3600 seconds]
This happens about every two hours on average; I'll have a closer look at the debug log and try to figure it out tomorrow, and test master and see if this is caused by this patch or not.
Atlas only finds it if I search by fingerprint, not by name, though and it has the 'unnamed' flag. Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?
Same log messages observed with 73f85905aa9cfe6ee4f014f54d5713ab662c207a; I'd still like to know what's causing them, but they're not an impediment to merging this patch I think.
Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?