Since upgrade to nightly (0.4.1.1-alpha-dev-20190523T134010Z-1~d90.stretch) connections are closed regularly after short data transfers (even irssi via ssh):
traumschule: does this happen on every stream, or just very rarely?
And, are you mostly uploading, or mostly downloading, on these streams?
It seems to be quite reproducible both ways typing into remote shells and receiving files with rsync:
$ torsocks rsync -aP server:some.tar.xz .some.tar.xz 4,456,448 10% 507.68kB/s 0:01:13 Connection to ... closed by remote host.rsync: connection unexpectedly closed (4587634 bytes received so far) [receiver]rsync error: error in rsync protocol data stream (code 12) at io.c(235) [receiver=3.1.3]rsync: connection unexpectedly closed (51 bytes received so far) [generator]rsync error: error in rsync protocol data stream (code 12) at io.c(235) [generator=3.1.3]
Since prop289 was merged, I've started seeing that error but couldn't figure out yet. It happens rarely in my setup. I tried many things to reproduce but I'm unable. I'm waiting for it to happen again to catch the debug logs.
The prop289 code was NOT suppose to change the stream-level SENDME behavior but a refactoring of the entire code did occur so very possible something got changed there... I went over the code the other day and couldn't find anything obvious that changed.
If anyone can reproduce reliably, please snatch the full client debug logs?
Here some more lines from info-level logging output:
rend_service_rendezvous_has_opened(): Done building circuit 3070116516 to rendezvous with cookie 2B7FE4B3 for service my6v3xxi2vmukctcinternal (high-uptime) circ (length 4): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $BD2CAAF6CA982D8B0BF73C7EA3E243F30AE9F042(open) $49C343884F3DC2E85CB3F847902A8018124371EB(open) $2F09BD6D9A2D5A7D6D26C17651CE8ECF0B7BA257(open)rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2617 more seconds.circuit_predict_and_launch_new(): Have 4 clean circs (2 internal), need another internal circ for my hidden service.origin_circuit_new(): Circuit 23 chose an idle timeout of 2729 based on 2617 seconds of predictive building remaining.extend_info_from_node(): Including Ed25519 ID for $7524D5503DB794914A917F57B91555E62E034E87~Unnamed at 176.9.92.184select_primary_guard_for_circuit(): Selected primary guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B) for circuit.extend_info_from_node(): Including Ed25519 ID for $E86B22596DC9439B6B8DDE9599F23A02E206383B~dvbhorst at 195.201.138.156extend_info_from_node(): Including Ed25519 ID for $66CA5474346F35E375C4D4514C51A540545347EE~ToolspireRelay at 138.201.83.171circuit_send_first_onion_skin(): First hop: finished sending CREATE cell to '$E86B22596DC9439B6B8DDE9599F23A02E206383B~dvbhorst at 195.201.138.156'connection_edge_finished_connecting(): Exit connection to [scrubbed]:7805 ([scrubbed]) established.circuit_finish_handshake(): Finished building circuit hop:internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(closed) $7524D5503DB794914A917F57B91555E62E034E87(closed)circuit_send_intermediate_onion_skin(): Sending extend relay cell.entry_guard_inc_circ_attempt_count(): Got success count 180.000000/182.000000 for guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)circuit_finish_handshake(): Finished building circuit hop:internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(open) $7524D5503DB794914A917F57B91555E62E034E87(closed)circuit_send_intermediate_onion_skin(): Sending extend relay cell.circuit_finish_handshake(): Finished building circuit hop:internal (high-uptime) circ (length 3, last hop Unnamed): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $66CA5474346F35E375C4D4514C51A540545347EE(open) $7524D5503DB794914A917F57B91555E62E034E87(open)circuit_build_times_get_xm(): Xm mode #0: 475 36circuit_build_times_set_timeout(): Set buildtimeout to low value 935.337593ms. Setting to 1500mscircuit_build_times_set_timeout(): Set circuit build timeout to 2s (1500.000000ms, 60000.000000ms, Xm: 475, a: 2.375229, r: 0.110092) based on 327 circuit timesentry_guards_note_guard_success(): Recorded success for primary confirmed guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)circuit_build_no_more_hops(): circuit built!pathbias_count_build_success(): Got success count 181.000000/182.000000 for guard dvbhorst ($E86B22596DC9439B6B8DDE9599F23A02E206383B)connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.connection_ap_handle_onion(): Got a hidden service request for ID '[scrubbed]'rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2617 more seconds.connection_ap_handle_onion(): Descriptor is here. Great.connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.connection_ap_handshake_attach_circuit(): rend joined circ 4266353635 (id: 16) already here. Attaching. (stream 0 sec old)pathbias_count_use_attempt(): Used circuit 16 is already in path state use succeeded. Circuit is a Hidden service client: Active rendezvous point currently open.rend_client_note_connection_attempt_ended(): Connection attempt for [scrubbed] has ended; cleaning up temporary state.link_apconn_to_circ(): Looks like completed circuit to hidden service does allow optimistic data for connection to [scrubbed]connection_ap_handshake_send_begin(): Sending relay cell 0 on circ 4266353635 to begin stream 5625.connection_ap_handshake_send_begin(): Address/port sent, ap socket 12, n_circ_id 4266353635connection_edge_process_relay_cell_not_open(): 'connected' received for circid 4266353635 streamid 5625 after 0 seconds.internal (high-uptime) circ (length 3): $E86B22596DC9439B6B8DDE9599F23A02E206383B(open) $5AD4E3F36B803500AE5A724ACD4A3BCCE3BEFA14(open) $58FC2AAB3792AC37897D34331F4F4E00341DEC0C(open)sendme_process_stream_level(): Unexpected stream sendme cell. Closing circ (window 500).command_process_relay_cell(): circuit_receive_relay_cell (backward) failed. Closing.circuit_mark_for_close_(): Circuit 3070116516 (id: 20) marked for close at src/core/or/command.c:582 (orig reason: 1, new reason: 0)connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.circuit_free_(): Circuit 0 (id: 20) has been freed.connection_edge_reached_eof(): conn (fd 12) reached eof. Closing.connection_edge_reached_eof(): conn (fd 10) reached eof. Closing.
I'm observing the same issue running the current nightly builds from the master branch. I've only observed the issue when using onion services though. I'm not yet sure how to reliably reproduce. The connection sometimes works for hours and then it keep breaking down 50 times in a row after just a few seconds.
Some details about my setup and some additional logs:
Setup, onion service:
version: nightly build, master branch
onion service details: v3, client authorization enabled
I did some tests, the issue is easiest to reproduce by receiving bulk data. Since there is an SSH server behind the onion service I used for testing, I went with ssh -vvv eve.arbitrary.onion 'cat /dev/urandom' >/dev/null.
Here is some of the SSH debug output for some of the test runs:
Transferred: sent 2532, received 478420 bytes, in 1.2 secondsBytes per second: sent 2077.3, received 392499.2Transferred: sent 2472, received 150180 bytes, in 3.4 secondsBytes per second: sent 718.6, received 43659.6Transferred: sent 2452, received 51708 bytes, in 1.1 secondsBytes per second: sent 2239.1, received 47219.2Transferred: sent 2532, received 478420 bytes, in 1.7 secondsBytes per second: sent 1453.2, received 274589.0Transferred: sent 2492, received 314300 bytes, in 0.9 secondsBytes per second: sent 2871.7, received 362185.6Transferred: sent 2492, received 314300 bytes, in 2.5 secondsBytes per second: sent 1011.3, received 127543.9Transferred: sent 2532, received 462008 bytes, in 1.6 secondsBytes per second: sent 1551.6, received 283118.3
I did about 50 more runs, "received" never exceeded 512KiB. Could be a coincident though.
Awesome information! I'm still unable to reproduce here :(
@pege: So the logs you provided unfortunately have a missing piece that I really need. The SIGUSR2 was sent to get the debug on. Client side I have everything but on service side I'm missing almost 2 minutes that are very important.
You think you can reproduce with the debug logs from startup?
Oooooooooooook finally found the problem and figured out WHY this would never happen with debug logs on. Problem is in this function that is responsible for decrementing the flow control package window, can you spot it? :