Opened 7 years ago

Closed 4 years ago

#7708 closed defect (fixed)

connection_handle_write_impl mishandles TOR_TLS_WANT_WRITE

Reported by: nickm Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay bwbug 023-backport
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Reported pseudonymously:

It appears that connection_handle_write_impl exits quickly when it hears TOR_TLS_WANT_WRITE from flush_from_buf_tls. But flush_from_buf_tls can return that error code even when it has flushed some data. That means that the other code in that function -- like decrementing buckets and calling flushed_some functions -- might not get called at all.

I'm attaching the suggested patch; I think the patch could actually be even simpler.

This could be a backport candidate for 0.2.3.

Child Tickets

Attachments (2)

handle_write.patch (1.2 KB) - added by nickm 7 years ago.
8bdcc8a3.txt (857 bytes) - added by nickm 7 years ago.

Download all attachments as: .zip

Change History (24)

Changed 7 years ago by nickm

Attachment: handle_write.patch added

comment:1 Changed 7 years ago by nickm

Status: newneeds_review

Whoa, I think this was introduced back in ef2409e4e.

I've done a shorter fix as branch "bug7708_023" ; I think it should get tested in 0.2.4 and backported to 0.2.3 if it seems okay in 0.2.4 after a while.

I'd also like to refactor the fetch_from_buf_tls and read_to_buf_tls functions at long last: Their interface is begging for this kind of breakage.

comment:2 Changed 7 years ago by nickm

See bug7708_023_v2. Checking n_written is incorrect; n_written can be true if bytes were written on the underlying transport but nothing was flushed from the buffer.

comment:3 Changed 7 years ago by nickm

See bug7708_023_v3. We might as well use the original patch rather than just doing something equivalent. I've thrown in a note for an antipattern we should fix in 0.2.4, but we don't do code cleanups in 0.2.3.

Changed 7 years ago by nickm

Attachment: 8bdcc8a3.txt added

comment:4 Changed 7 years ago by nickm

The author suggests another patch on top of this one, which I've attached as 8bdcc8a3.txt .

comment:5 Changed 7 years ago by nickm

I need to think of some way to test this, and look for more cases where we should be writing but aren't.

Maybe there should be a connection_should_be_writing/connection_should_be_reading function pair that we use to check whether a connection should be reading/writing, and if it isn't, we log a bug. That could be hard to get right. But maybe it would be a worthwhile thing. Otherwise, I worry that we could have more cases like this.

I wonder if I can provoke this on a testing network on purpose by setting SO_SNDBUF and/or SO_RCVBUF very low.

comment:6 Changed 7 years ago by nickm

Keywords: bwbug added

comment:7 Changed 6 years ago by cypherpunks

Resolution: not a bug
Status: needs_reviewclosed

comment:8 Changed 6 years ago by aagbsn

Resolution: not a bug
Status: closedreopened

comment:9 Changed 6 years ago by aagbsn

Status: reopenedneeds_review

comment:10 Changed 6 years ago by nickm

Keywords: 023-backport added

comment:11 Changed 6 years ago by nickm

See bug7708_023_v3 again. It now has the extra patch above, plus a comment cleanup. Andrea and I like it; she's going to test it for a little while to see if it explodes.

The branch to test is bug7708_merged_to_master, which has today's master plus a NOT-SQUASHED bug7708_023_v3 on it.

comment:12 Changed 6 years ago by andrea

So far testing, I'm seeing messages like this in the log:

Jan 31 00:42:43.000 [notice] No circuits are opened. Relaxed timeout for a circuit with channel state open to 66869ms. However, it appears the circuit has timed out anyway. 3 guards are live. [562 similar message(s) suppressed in last 3600 seconds]

This happens about every two hours on average; I'll have a closer look at the debug log and try to figure it out tomorrow, and test master and see if this is caused by this patch or not.

comment:13 Changed 6 years ago by andrea

Now that I look at it, not sure it's making it into the consensus properly. I'll investigate more tomorrow.

comment:14 Changed 6 years ago by andrea

Oh, okay, it's in there:

https://atlas.torproject.org/#details/022C96473552936BB628D5E3EBBC6C81031E2A5E

Atlas only finds it if I search by fingerprint, not by name, though and it has the 'unnamed' flag. Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?

comment:15 Changed 6 years ago by andrea

After 36 hours, the relay seems to function, but those log messages continue roughly once an hour. I'll test with master for comparison.

comment:16 Changed 6 years ago by andrea

(That was f58c81041741d7af176e49863b5c632f9cefffc7, btw - now testing with 73f85905aa9cfe6ee4f014f54d5713ab662c207a)

comment:17 Changed 6 years ago by andrea

Same log messages observed with 73f85905aa9cfe6ee4f014f54d5713ab662c207a; I'd still like to know what's causing them, but they're not an impediment to merging this patch I think.

comment:18 Changed 6 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.3.x-final

It's now bug7708_023_v3_squashed. I've merged it into 0.2.4. We should consider it for a backport into 0.2.3.

comment:19 in reply to:  14 Changed 6 years ago by arma

Replying to andrea:

Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?

Yes.

comment:20 Changed 6 years ago by arma

Has Andrea decided that the new log messages she saw were unrelated?

So the bug here is that we end up writing correctly, but we undercount the number of bytes we spent (from our buckets)?

The patch seems pretty short.

On the other hand, this has been a bug for a long time, and 0.2.4 is as fine a place as any for the fix to be.

comment:21 Changed 6 years ago by andrea

I think they were unrelated. At any rate, it got merged into 0.2.4 and hasn't exploded yet, so I wouldn't worry.

comment:22 Changed 4 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final
Resolution: fixed
Status: needs_reviewclosed

Marking a batch of tickets that had been under consideration for 0.2.3 backport as fixed-in-0.2.4. There is no plan for further 0.2.3 releases.

Note: See TracTickets for help on using tickets.