Opened 7 months ago

Closed 4 months ago

#23543 closed defect (invalid)

prop224: Disconnects on long-lasting HS connections (possibly because of mds)

Reported by: asn Owned by: asn
Priority: Medium Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs prop224 prop224-bugs
Cc: Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor:

Description

Hello,

so I've been noticing the following pattern in my prop224 IRC client:

Everything goes well for like a day or so, IRC connection stays open and everything. Then at some point after like 30+ hours, there is a DC on my IRC connection for a few secs, and then a reconnect.

That could be caused by any of the 6 rend circuit hops going down, but I've seen it happen 4-5 times now and I'm starting to suspect that's not the case.

So this happened yesterday night as well. Here are some logs from before/during/after the reconnect happened (at 01:13):

Sep 16 00:50:55.000 [info] handle_response_fetch_consensus(): Received consensus directory (body size 2082670) from server '137.205.124.35:1720'
Sep 16 00:50:55.000 [info] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Sep 16 00:50:55.000 [info] handle_response_fetch_consensus(): Successfully loaded consensus.
Sep 16 00:50:55.000 [info] connection_free_(): Freeing linked Directory connection [client finished] with 0 bytes on inbuf, 0 on outbuf.
Sep 16 00:51:53.000 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
Sep 16 00:51:57.000 [info] circuit_mark_for_close_(): Circuit 0 marked for close at src/or/command.c:599 (orig reason: 521, new reason: 0)
Sep 16 00:52:53.000 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
...
...
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.
Sep 16 01:13:31.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.
Sep 16 01:13:31.000 [info] connection_ap_handle_onion(): Got a hidden service request for ID 'd6sfftbz6pkwfwwl'
Sep 16 01:13:31.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2492 more seconds.
Sep 16 01:13:31.000 [info] connection_ap_handle_onion(): Descriptor is here. Great.
Sep 16 01:13:31.000 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
...
Sep 16 01:13:31.000 [notice] Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
...
Sep 16 01:13:33.000 [notice] We now have enough directory information to build circuits.
...
Sep 16 01:13:34.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Sep 16 01:13:34.000 [notice] Received an RENDEZVOUS_ESTABLISHED. This circuit is now ready for rendezvous.
Sep 16 01:13:34.000 [notice] Re-extending circ 3572311836, this time to $AB0CC19517476E610B86FF796B703643BFE21E89~$AB0CC19517476E610B at 5.51.106.108.
Sep 16 01:13:34.000 [notice] Re-extending circ 3572311836, this time to $DB4080D411B0E62788EE38164FFDC275F20159BB~$DB4080D411B0E62788 at 163.172.152.231.
Sep 16 01:13:36.000 [notice] Got RENDEZVOUS2 cell from hidden service on circuit 2915300815.
Sep 16 06:04:51.000 [notice] Heartbeat: Tor's uptime is 1 day 17:59 hours, with 2 circuits open. I've sent 4.01 MB and received 21.77 MB.
Sep 16 06:04:51.000 [notice] Average packaged cell fullness: 23.706%. TLS write overhead: 5%

So it seems like after many many hours of operation we somehow ended up with no mds for our primary guards.

I've noticed this the other day in a different way as well. Tor was running for days, and I would get this message every once in a while:

 I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 5822/6822...

but everytime the number of mds would decrease more and more. Then at some point it reached below minimum dir info, Tor stalled and then it started fetching mds again. It's like something expires mds while in memory, but there is no logic to fetch them back in before we reach the minimum point.

Child Tickets

Attachments (1)

torlog_first_incident.log.gz (146.4 KB) - added by asn 7 months ago.

Download all attachments as: .zip

Change History (14)

comment:1 in reply to:  description Changed 7 months ago by arma

Replying to asn:

but everytime the number of mds would decrease more and more. Then at some point it reached below minimum dir info, Tor stalled and then it started fetching mds again. It's like something expires mds while in memory, but there is no logic to fetch them back in before we reach the minimum point.

There is something that expires the md's -- see microdesc_cache_clean(). And if you haven't made any new socks requests lately, there won't be any predicted ports, so Tor won't be fetching new ones. Its list of descriptors *will* slowly decay, and that's ok. Or at least, that's intended by the current design.

So maybe the weird bug here is that Tor cares about whether you have a current microdesc for your primary guard, in order to maintain the connection to it?

If that's the issue, one fix would be to fix stuff so Tor doesn't break current connections just because a microdesc has disappeared. Another fix would be to prevent Tor from going dormant while it has any open connections. I'd be tempted to explore the former fix, since the latter one will (a) reduce the effectiveness of going dormant, and (b) probably require some fiddly bits where we say "oh but that connection was just for fetching a new consensus so it's not really a connection".

comment:2 in reply to:  description Changed 7 months ago by arma

Replying to asn:

Sep 16 00:51:57.000 [info] circuit_mark_for_close_(): Circuit 0 marked for close at src/or/command.c:599 (orig reason: 521, new reason: 0)

521 means "remote" plus "finished". So I think it is referring to the consensus directory fetch attempt, not to your irc stream.

comment:3 in reply to:  description ; Changed 7 months ago by arma

Replying to asn:

...
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.

Tell me more about this eof? It would appear that your irc client decided to hang up.

Sep 16 01:13:31.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.

And here it would appear that your irc client decided to make a new connection.

Changed 7 months ago by asn

comment:4 in reply to:  3 Changed 7 months ago by asn

Replying to arma:

Replying to asn:

...
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.

Tell me more about this eof? It would appear that your irc client decided to hang up.

Here are the logs surrounding the eof. Seems like it's from natural cause...

Sep 16 01:12:56.000 [debug] connection_bucket_refill_helper(): global_read_bucket now 1073741824.
Sep 16 01:12:56.000 [debug] connection_bucket_refill_helper(): or_conn->read_bucket now 1073741824.
Sep 16 01:13:30.000 [debug] conn_read_callback(): socket 18 wants to read.
Sep 16 01:13:30.000 [debug] read_to_chunk(): Encountered eof on fd 18
Sep 16 01:13:30.000 [debug] circuit_consider_stop_edge_reading(): considering layer_hint->package_window 906
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.
Sep 16 01:13:30.000 [debug] connection_edge_end(): Sending end on conn (fd 18).
Sep 16 01:13:30.000 [debug] relay_send_command_from_edge_(): delivering 3 cell forward.
Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Sep 16 01:13:30.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Sep 16 01:13:30.000 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Sep 16 01:13:30.000 [debug] scheduler_channel_has_waiting_cells(): Channel 6 at 0x7f8c240de460 went from waiting_for_cells to pending
Sep 16 01:13:30.000 [debug] conn_close_if_marked(): Cleaning up connection (fd 18).
Sep 16 01:13:30.000 [debug] circuit_detach_stream(): Removing stream 51231 from circ 4062943041
Sep 16 01:13:30.000 [debug] connection_remove(): removing socket 18 (type Socks), n_conns now 4
Sep 16 01:13:30.000 [debug] connection_free_(): closing fd 18.

I wonder what's going on here. My IRC client was idle at that time and shouldn't have disconnected... IRC clients are supposed to stick around for weeks, months, years.

WRT my IRC client deciding to make a new connection shortly after, I guess that's probably normal reconnect behavior from irssi when it detects a disconnect.

I attached some debug logs of the surrounding times. Couldn't find something incriminating there...

comment:5 Changed 7 months ago by asn

Here is another such disconnect case from yesterday evening. It's a legacy HS client running latest master connecting to a legacy HS:

Sep 16 17:29:31.000 [debug] connection_edge_process_relay_cell(): Now seen 55751 relay cells here (command 9, stream 0).
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] pathbias_count_use_success(): Marked circuit 151 (7.000000/8.000000) as used successfully for guard motmot ($E248C3A604E196137A3175D4B2E4328922178B47)
Sep 16 17:29:31.000 [info] circuit_mark_for_close_(): Circuit 2915300815 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)
Sep 16 17:29:31.000 [debug] connection_or_process_cells_from_inbuf(): 16: starting, inbuf_datalen 0 (0 pending in tls object).
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): global_read_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): or_conn->read_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] circuitmux_append_destroy_cell(): Cmux at 0x7f8c23af4dc0 queued a destroy for circ 2915300815, cmux counter is now 1, global counter is now 1
Sep 16 17:29:31.000 [debug] circuitmux_append_destroy_cell(): Primed a buffer.
Sep 16 17:29:31.000 [debug] channel_write_cell_generic_(): Writing p 0x7f8c2364cee0 to channel 0x7f8c240de460 with global ID 6
Sep 16 17:29:31.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f8c23b46820 for circ_id 2915300815, channel ID 6 (0x7f8c240de460)
Sep 16 17:29:31.000 [debug] channel_tls_get_overhead_estimate_method(): Estimated overhead ratio for TLS chan 6 is 1.053891
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Increasing queue size for channel 6 by 541 from 0 to 541
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Increasing global queue size by 541 for channel 6, new size is 541
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue size adjustment by +541 for channel 6
Sep 16 17:29:31.000 [debug] scheduler_update_queue_heuristic(): Queue heuristic is now 0
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue heuristic is now 541
Sep 16 17:29:31.000 [debug] circuitmux_notify_xmit_destroy(): Cmux at 0x7f8c23af4dc0 sent a destroy, cmux counter is now 0, global counter is now 0
Sep 16 17:29:31.000 [debug] channel_send_destroy(): Sending destroy (circID 2915300815) on channel 0x7f8c240de460 (global ID 6)
Sep 16 17:29:31.000 [info] connection_edge_destroy(): CircID 0: At an edge. Marking connection for close.
Sep 16 17:29:31.000 [info] rend_client_note_connection_attempt_ended(): Connection attempt for d6sfftbz6pkwfwwl has ended; cleaning up temporary state.
Sep 16 17:29:31.000 [debug] conn_close_if_marked(): Cleaning up connection (fd 12).
Sep 16 17:29:31.000 [debug] connection_remove(): removing socket 12 (type Socks), n_conns now 4
Sep 16 17:29:31.000 [debug] connection_free_(): closing fd 12.
Sep 16 17:29:31.000 [debug] conn_write_callback(): socket 16 wants to write.
Sep 16 17:29:31.000 [debug] flush_chunk_tls(): flushed 514 bytes, 0 ready to flush, 0 remain.
Sep 16 17:29:31.000 [debug] connection_handle_write_impl(): After TLS write of 514: 0 read, 543 written
Sep 16 17:29:31.000 [debug] channel_tls_get_overhead_estimate_method(): Estimated overhead ratio for TLS chan 6 is 1.053891
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Decreasing queue size for channel 6 by 541 from 541 to 0
Sep 16 17:29:31.000 [debug] channel_update_xmit_queue_size(): Decreasing global queue size by 541 for channel 6, new size is 0
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue size adjustment by -541 for channel 6
Sep 16 17:29:31.000 [debug] scheduler_adjust_queue_size(): Queue heuristic is now 0
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): global_write_bucket now 1073741824.
Sep 16 17:29:31.000 [debug] connection_bucket_refill_helper(): or_conn->write_bucket now 1073741824.
Sep 16 17:29:32.000 [debug] conn_read_callback(): socket 7 wants to read.
Sep 16 17:29:32.000 [debug] connection_handle_listener_read(): Connection accepted on socket 12 (child of fd 7).
Sep 16 17:29:32.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.
Sep 16 17:29:32.000 [debug] connection_add_impl(): new conn type Socks, socket 12, address 127.0.0.1, n_conns 4.
Sep 16 17:29:32.000 [debug] conn_read_callback(): socket 12 wants to read.
Sep 16 17:29:32.000 [debug] read_to_chunk(): Read 33 bytes. 33 on inbuf.
Sep 16 17:29:32.000 [debug] connection_ap_handshake_process_socks(): entered.
Sep 16 17:29:32.000 [debug] parse_socks(): socks4: Everything is here. Success.
Sep 16 17:29:32.000 [debug] connection_ap_handshake_rewrite(): Client asked for d6sfftbz6pkwfwwl.onion:6697
Sep 16 17:29:32.000 [info] connection_ap_handle_onion(): Got a hidden service request for ID 'd6sfftbz6pkwfwwl'
Sep 16 17:29:32.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2580 more seconds.
Sep 16 17:29:32.000 [info] connection_ap_handle_onion(): Descriptor is here. Great.
Sep 16 17:29:32.000 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
Sep 16 17:29:32.000 [notice] Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.

The incriminating mark for close at circuitbuild.c:1543 seems to be from circuit_truncated(). Could this be from natural causes where one of the rend circuit hops died and the circ had to be rebuilt? :/

comment:6 Changed 7 months ago by asn

Seems like the eof from comment:4 was actually caused by my irssi, since it failed to receive a PONG message from the ircd for 5 minutes which is the timeout. We now need to figure that part out.

comment:7 Changed 7 months ago by dgoulet

Owner: set to asn
Status: newassigned

comment:8 Changed 7 months ago by asn

Status: assignedneeds_information

comment:9 Changed 7 months ago by asn

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Deferring this out of 0.3.2 for now, till we learn more info.

comment:10 Changed 7 months ago by asn

Seems like most disconnects happening on the ircd are caused by killing of rend circs:

circuit_mark_for_close_(): Circuit 4054201541 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)\

where orig reason: 520 means that the circ died because of END_CIRC_REASON_CHANNEL_CLOSED...

Is it normal for this to occur so frequently??? It occurs for me about once a day, and it also occurs to more people.

Last edited 7 months ago by asn (previous) (diff)

comment:11 Changed 7 months ago by arma

It seems the current conclusion for the client-side disconnect is that it happened because of a (mis)feature in your irc client?

comment:12 in reply to:  10 Changed 7 months ago by arma

Replying to asn:

Seems like most disconnects happening on the ircd are caused by killing of rend circs:

circuit_mark_for_close_(): Circuit 4054201541 marked for close at src/or/circuitbuild.c:1543 (orig reason: 520, new reason: 0)\

Whose log is this? Your Tor client still, or is this from the service-side?

If it's the service side, are there more logs? Do they come in clumps, or spread out, or what?

comment:13 Changed 4 months ago by dgoulet

Resolution: invalid
Status: needs_informationclosed

After almost a month of ping/pong with a simpler client and service setup, we got very very few of these so I'm calling it "IRC protocol + IRC daemon" crazyness.

Note: See TracTickets for help on using tickets.