Opened 3 weeks ago

Closed 2 weeks ago

#30628 closed defect (fixed)

sendme closes client connections

Reported by: traumschule Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: sendme, tor-flowctrl, 041-must 041-regression regression, asn-merge
Cc: Actual Points: 0.6
Parent ID: Points:
Reviewer: nickm Sponsor:

Description

Since upgrade to nightly (0.4.1.1-alpha-dev-20190523T134010Z-1~d90.stretch) connections are closed regularly after short data transfers (even irssi via ssh):

Connection to ... closed by remote host.

[info] sendme_process_stream_level(): Unexpected stream sendme cell. Closing circ (window 500).
[info] command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.

https://gitweb.torproject.org/tor.git/tree/src/core/or/sendme.c#n468

Child Tickets

Change History (18)

comment:1 Changed 3 weeks ago by arma

Milestone: Tor: 0.4.1.x-final

I suspect the new prop#289 work.

comment:2 Changed 3 weeks ago by arma

Cc: dgoulet added

comment:3 Changed 3 weeks ago by arma

traumschule: does this happen on every stream, or just very rarely?

And, are you mostly uploading, or mostly downloading, on these streams?

comment:4 in reply to:  3 Changed 3 weeks ago by traumschule

Replying to arma:

traumschule: does this happen on every stream, or just very rarely?

And, are you mostly uploading, or mostly downloading, on these streams?

It seems to be quite reproducible both ways typing into remote shells and receiving files with rsync:

$ torsocks rsync -aP server:some.tar.xz .
some.tar.xz
      4,456,448  10%  507.68kB/s    0:01:13  Connection to ... closed by remote host.

rsync: connection unexpectedly closed (4587634 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [receiver=3.1.3]
rsync: connection unexpectedly closed (51 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [generator=3.1.3]

comment:5 Changed 3 weeks ago by dgoulet

Since prop289 was merged, I've started seeing that error but couldn't figure out yet. It happens rarely in my setup. I tried many things to reproduce but I'm unable. I'm waiting for it to happen again to catch the debug logs.

The prop289 code was NOT suppose to change the stream-level SENDME behavior but a refactoring of the entire code did occur so very possible something got changed there... I went over the code the other day and couldn't find anything obvious that changed.

If anyone can reproduce reliably, please snatch the full client debug logs?

comment:6 Changed 3 weeks ago by dgoulet

Cc: dgoulet removed
Keywords: sendme tor-flowctrl added

comment:7 Changed 3 weeks ago by nickm

Does this also happen with 0.4.1.1-alpha? I would imagine so, but it would be good to make sure.

comment:8 Changed 3 weeks ago by nickm

Keywords: 041-must 041-regression regression added

comment:9 Changed 3 weeks ago by nickm

Owner: set to dgoulet
Status: newassigned

Tentatively assigning to dgoulet. Anything that can be used to repro this would be great.

comment:10 Changed 3 weeks ago by adYpc

I can confirm the issue for 0.4.1.1-alpha; connections terminated by EOF. I've got also those messages:

sendme_process_stream_level(): Unexpected stream sendme cell. Closing circ (window 500).
command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.

comment:11 Changed 3 weeks ago by adYpc

Here some more lines from info-level logging output:

rend_service_rendezvous_has_opened(): Done building circuit 3070116516 to rendezvous with cookie 2B7FE4B3 for service my6v3xxi2vmukctc
internal (high-uptime) circ (length 4): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $BD2CAAF6CA982D8B0BF73C7EA3E243F30AE9F042(open) $49C343884F3DC2E85CB3F847902A8018124371EB(open) $2F09BD6D9A2D5A7D6D26C17651CE8ECF0B7BA257(open)
rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2617 more seconds.
circuit_predict_and_launch_new(): Have 4 clean circs (2 internal), need another internal circ for my hidden service.
origin_circuit_new(): Circuit 23 chose an idle timeout of 2729 based on 2617 seconds of predictive building remaining.
extend_info_from_node(): Including Ed25519 ID for $7524D5503DB794914A917F57B91555E62E034E87~Unnamed at 176.9.92.184
select_primary_guard_for_circuit(): Selected primary guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B) for circuit.
extend_info_from_node(): Including Ed25519 ID for $E86B22596DC9439B6B8DDE9599F23A02E206383B~dvbhorst at 195.201.138.156
extend_info_from_node(): Including Ed25519 ID for $66CA5474346F35E375C4D4514C51A540545347EE~ToolspireRelay at 138.201.83.171
circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$E86B22596DC9439B6B8DDE9599F23A02E206383B~dvbhorst at 195.201.138.156'
connection_edge_finished_connecting(): Exit connection to [scrubbed]:7805 ([scrubbed]) established.
circuit_finish_handshake(): Finished building circuit hop:
internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(closed) $7524D5503DB794914A917F57B91555E62E034E87(closed)
circuit_send_intermediate_onion_skin(): Sending extend relay cell.
entry_guard_inc_circ_attempt_count(): Got success count 180.000000/182.000000 for guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)
circuit_finish_handshake(): Finished building circuit hop:
internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(open) $7524D5503DB794914A917F57B91555E62E034E87(closed)
circuit_send_intermediate_onion_skin(): Sending extend relay cell.
circuit_finish_handshake(): Finished building circuit hop:
internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(open) $7524D5503DB794914A917F57B91555E62E034E87(open)
circuit_build_times_get_xm(): Xm mode #0: 475 36
circuit_build_times_set_timeout(): Set buildtimeout to low value 935.337593ms. Setting to 1500ms
circuit_build_times_set_timeout(): Set circuit build timeout to 2s (1500.000000ms, 60000.000000ms, Xm: 475, a: 2.375229, r: 0.110092) based on 327 circuit times
entry_guards_note_guard_success(): Recorded success for primary confirmed guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)
circuit_build_no_more_hops(): circuit built!
pathbias_count_build_success(): Got success count 181.000000/182.000000 for guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)
connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.
connection_ap_handle_onion(): Got a hidden service request for ID '[scrubbed]'
rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2617 more seconds.
connection_ap_handle_onion(): Descriptor is here. Great.
connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
connection_ap_handshake_attach_circuit(): rend joined circ 4266353635 (id: 16) already here. Attaching. (stream 0 sec old)
pathbias_count_use_attempt(): Used circuit 16 is already in path state use succeeded. Circuit is a Hidden service client: Active rendezvous point currently open.
rend_client_note_connection_attempt_ended(): Connection attempt for [scrubbed] has ended; cleaning up temporary state.
link_apconn_to_circ(): Looks like completed circuit to hidden service does allow optimistic data for connection to [scrubbed]
connection_ap_handshake_send_begin(): Sending relay cell 0 on circ 4266353635 to begin stream 5625.
connection_ap_handshake_send_begin(): Address/port sent, ap socket 12, n_circ_id 4266353635
connection_edge_process_relay_cell_not_open(): 'connected' received for circid 4266353635 streamid 5625 after 0 seconds.
internal (high-uptime) circ (length 3): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $5AD4E3F36B803500AE5A724ACD4A3BCCE3BEFA14(open) $58FC2AAB3792AC37897D34331F4F4E00341DEC0C(open)
sendme_process_stream_level(): Unexpected stream sendme cell. Closing circ (window 500).
command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.
circuit_mark_for_close_(): Circuit 3070116516 (id: 20) marked for close at src/core/or/command.c:582 (orig reason: 1, new reason: 0)
connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.
circuit_free_(): Circuit 0 (id: 20) has been freed.
connection_edge_reached_eof(): conn (fd 12) reached eof. Closing.
connection_edge_reached_eof(): conn (fd 10) reached eof. Closing.

comment:12 Changed 2 weeks ago by pege

I'm observing the same issue running the current nightly builds from the master branch. I've only observed the issue when using onion services though. I'm not yet sure how to reliably reproduce. The connection sometimes works for hours and then it keep breaking down 50 times in a row after just a few seconds.

Some details about my setup and some additional logs:

Setup, onion service:

  • version: nightly build, master branch
  • onion service details: v3, client authorization enabled
  • log: onion_service.log
  • disconnect time: 10:34:23

Setup, client side:

  • version: 0.4.0.4-rc
  • log: client.log
  • disconnect time: ~08:34:20 (different timezone, clock slightly skewed)

comment:13 Changed 2 weeks ago by pege

I did some tests, the issue is easiest to reproduce by receiving bulk data. Since there is an SSH server behind the onion service I used for testing, I went with ssh -vvv eve.arbitrary.onion 'cat /dev/urandom' >/dev/null.

Here is some of the SSH debug output for some of the test runs:

Transferred: sent 2532, received 478420 bytes, in 1.2 seconds
Bytes per second: sent 2077.3, received 392499.2

Transferred: sent 2472, received 150180 bytes, in 3.4 seconds
Bytes per second: sent 718.6, received 43659.6

Transferred: sent 2452, received 51708 bytes, in 1.1 seconds
Bytes per second: sent 2239.1, received 47219.2

Transferred: sent 2532, received 478420 bytes, in 1.7 seconds
Bytes per second: sent 1453.2, received 274589.0

Transferred: sent 2492, received 314300 bytes, in 0.9 seconds
Bytes per second: sent 2871.7, received 362185.6

Transferred: sent 2492, received 314300 bytes, in 2.5 seconds
Bytes per second: sent 1011.3, received 127543.9

Transferred: sent 2532, received 462008 bytes, in 1.6 seconds
Bytes per second: sent 1551.6, received 283118.3

I did about 50 more runs, "received" never exceeded 512KiB. Could be a coincident though.

comment:14 Changed 2 weeks ago by dgoulet

Awesome information! I'm still unable to reproduce here :(

@pege: So the logs you provided unfortunately have a missing piece that I really need. The SIGUSR2 was sent to get the debug on. Client side I have everything but on service side I'm missing almost 2 minutes that are very important.

You think you can reproduce with the debug logs from startup?

Huge thanks!

comment:15 Changed 2 weeks ago by dgoulet

Actual Points: 0.6
Reviewer: nickm
Status: assignedneeds_review

Oooooooooooook finally found the problem and figured out WHY this would never happen with debug logs on. Problem is in this function that is responsible for decrementing the flow control package window, can you spot it? :

int
sendme_note_stream_data_packaged(edge_connection_t *conn)
{
  tor_assert(conn);
  log_debug(LD_APP, "Stream package_window now %d.", --conn->package_window);
  return conn->package_window;
}

Now imagine yourself with a tor with debug logs turned *off*, you know like 99.9999% of the tor daemon out there ;).

Answer to the quiz is: package_window does NOT get decremented if log_debug is off ;).......

I get the commit hat of shame for at least a full week there ;)....

Introduced in commit 3835a3acf57426f692a787e7729de929b40dc62e.

Branch: ticket30628_041_01
PR: https://github.com/torproject/tor/pull/1071

comment:16 Changed 2 weeks ago by nickm

Keywords: asn-merge added
Status: needs_reviewmerge_ready

LGTM!

Did you check to make sure that the other increment/decrement functions don't have this bug?

comment:17 in reply to:  16 Changed 2 weeks ago by dgoulet

Replying to nickm:

LGTM!

Did you check to make sure that the other increment/decrement functions don't have this bug?

I did! Got scared but sendme.c is clean of this madness.

comment:18 Changed 2 weeks ago by asn

Resolution: fixed
Status: merge_readyclosed

Good find. Merged!

Note: See TracTickets for help on using tickets.