connection_handle_write_impl mishandles TOR_TLS_WANT_WRITE

changed milestone to %Tor: 0.2.4.x-final

added 023-backport bwbug component::core tor/tor milestone::Tor: 0.2.4.x-final priority::high resolution::fixed status::closed tor-relay type::defect labels

Trac:
handle_write.patch

Whoa, I think this was introduced back in ef2409e4e.

I've done a shorter fix as branch "bug7708_023" ; I think it should get tested in 0.2.4 and backported to 0.2.3 if it seems okay in 0.2.4 after a while.

I'd also like to refactor the fetch_from_buf_tls and read_to_buf_tls functions at long last: Their interface is begging for this kind of breakage.

Trac:
Status: new to needs_review

See bug7708_023_v2. Checking n_written is incorrect; n_written can be true if bytes were written on the underlying transport but nothing was flushed from the buffer.

See bug7708_023_v3. We might as well use the original patch rather than just doing something equivalent. I've thrown in a note for an antipattern we should fix in 0.2.4, but we don't do code cleanups in 0.2.3.

Trac:
8bdcc8a3.txt

The author suggests another patch on top of this one, which I've attached as 8bdcc8a3.txt .

I need to think of some way to test this, and look for more cases where we should be writing but aren't.

Maybe there should be a connection_should_be_writing/connection_should_be_reading function pair that we use to check whether a connection should be reading/writing, and if it isn't, we log a bug. That could be hard to get right. But maybe it would be a worthwhile thing. Otherwise, I worry that we could have more cases like this.

I wonder if I can provoke this on a testing network on purpose by setting SO_SNDBUF and/or SO_RCVBUF very low.

Trac:
Keywords: tor-relay deleted, tor-relay bwbug added

Trac:
Resolution: N/A to not a bug
Status: needs_review to closed

Trac:
Resolution: not a bug to N/A
Status: closed to reopened

Trac:
Status: reopened to needs_review

Trac:
Keywords: tor-relay bwbug deleted, tor-relay bwbug 023-backport added

See bug7708_023_v3 again. It now has the extra patch above, plus a comment cleanup. Andrea and I like it; she's going to test it for a little while to see if it explodes.

The branch to test is bug7708_merged_to_master, which has today's master plus a NOT-SQUASHED bug7708_023_v3 on it.

So far testing, I'm seeing messages like this in the log:

Jan 31 00:42:43.000 [notice] No circuits are opened. Relaxed timeout for a circuit with channel state open to 66869ms. However, it appears the circuit has timed out anyway. 3 guards are live. [562 similar message(s) suppressed in last 3600 seconds]

This happens about every two hours on average; I'll have a closer look at the debug log and try to figure it out tomorrow, and test master and see if this is caused by this patch or not.

Now that I look at it, not sure it's making it into the consensus properly. I'll investigate more tomorrow.

Oh, okay, it's in there:

https://atlas.torproject.org/#details/022C96473552936BB628D5E3EBBC6C81031E2A5E

Atlas only finds it if I search by fingerprint, not by name, though and it has the 'unnamed' flag. Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?

After 36 hours, the relay seems to function, but those log messages continue roughly once an hour. I'll test with master for comparison.

(That was f58c81041741d7af176e49863b5c632f9cefffc7, btw - now testing with 73f85905aa9cfe6ee4f014f54d5713ab662c207a)

Same log messages observed with 73f85905aa9cfe6ee4f014f54d5713ab662c207a; I'd still like to know what's causing them, but they're not an impediment to merging this patch I think.

It's now bug7708_023_v3_squashed. I've merged it into 0.2.4. We should consider it for a backport into 0.2.3.

Trac:
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.3.x-final

Replying to andrea:

Do the directories do something like that if I run a relay with the same name as one I ran before for testing but a different fingerprint, by any chance?

Yes.

Has Andrea decided that the new log messages she saw were unrelated?

So the bug here is that we end up writing correctly, but we undercount the number of bytes we spent (from our buckets)?

The patch seems pretty short.

On the other hand, this has been a bug for a long time, and 0.2.4 is as fine a place as any for the fix to be.

I think they were unrelated. At any rate, it got merged into 0.2.4 and hasn't exploded yet, so I wouldn't worry.

Marking a batch of tickets that had been under consideration for 0.2.3 backport as fixed-in-0.2.4. There is no plan for further 0.2.3 releases.

Trac:
Milestone: Tor: 0.2.3.x-final to Tor: 0.2.4.x-final
Resolution: N/A to fixed
Status: needs_review to closed

closed

moved to tpo/core/tor#7708 (closed)

connection_handle_write_impl mishandles TOR_TLS_WANT_WRITE

Child items ...

Activity