Opened 11 months ago

Closed 8 months ago

Last modified 8 months ago

#28639 closed defect (worksforme)

After several days, most of the circuits timeout

Reported by: juga Owned by:
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version:
Severity: Normal Keywords: sbws-1.0-must-moved-20181128, 10-proposed
Cc: juga, teor Actual Points:
Parent ID: #25925 Points:
Reviewer: Sponsor:

Description

When the scanner is restarted, circuits most of the circuits doesn't timeout.
The destination is reachable.

Child Tickets

TicketStatusOwnerSummaryComponent
#28645closedAdd debugging options to sbws torrcCore Tor/sbws
#28663closedjugasbws stops accumulating, silentlyCore Tor/sbws
#28691closeddirauth longclaw network connectivity impairedCore Tor/DirAuth
#28701closedTry enabling predicted circuits in sbwsCore Tor/sbws

Change History (24)

comment:1 Changed 11 months ago by teor

What does tor say in its logs?

comment:2 Changed 11 months ago by teor

Keywords: sbws-1.0-must-moved-20181128 added
Milestone: sbws 1.0 (MVP must)sbws 1.0.3

Moving all sbws 1.0 must bugs to 1.0.3.

comment:3 Changed 11 months ago by teor

Milestone: sbws 1.0.3sbws 1.0.x

Milestone renamed

comment:4 Changed 11 months ago by teor

Milestone: sbws 1.0.xsbws: 1.0.x

Milestone renamed

comment:5 Changed 11 months ago by teor

Milestone: sbws: 1.0.xsbws: 1.0.x-final

Milestone renamed

comment:6 in reply to:  1 ; Changed 11 months ago by juga

Replying to teor:

What does tor say in its logs?

The main repeated part:

Oct 28 05:43:00.000 [notice] Heartbeat: Tor's uptime is 2 days 17:59 hours, with 7 circuits open. I've sent 15.08 GB and received 482.17 GB.
Oct 28 05:43:00.000 [notice] Average packaged cell fullness: 43.617%. TLS write overhead: 4%
Oct 28 05:50:28.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
Oct 28 05:57:49.000 [notice] No circuits are opened. Relaxed timeout for circuit 40835 (a General-purpose client 2-hop circuit in state doing handshakes with channel state open) to 10000ms. However, it appears the circuit has timed out anyway. 0 guards are live. [53 similar message(s) suppressed in last 3600 seconds]

I can't figure out reasons with that, but maybe you do.

There are also some of the type:

Oct 26 23:33:59.000 [warn] Tried connecting to router at REDACTED, but identity key was not as expected: wanted REDACTED but got REDACTED.

comment:7 in reply to:  6 ; Changed 11 months ago by teor

Replying to juga:

Replying to teor:

What does tor say in its logs?

The main repeated part:

Oct 28 05:43:00.000 [notice] Heartbeat: Tor's uptime is 2 days 17:59 hours, with 7 circuits open. I've sent 15.08 GB and received 482.17 GB.
Oct 28 05:43:00.000 [notice] Average packaged cell fullness: 43.617%. TLS write overhead: 4%
Oct 28 05:50:28.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
Oct 28 05:57:49.000 [notice] No circuits are opened. Relaxed timeout for circuit 40835 (a General-purpose client 2-hop circuit in state doing handshakes with channel state open) to 10000ms. However, it appears the circuit has timed out anyway. 0 guards are live. [53 similar message(s) suppressed in last 3600 seconds]

The key parts of this message are:

No circuits are opened. Relaxed timeout for circuit

I think we can fix this bug by disabling adaptive timeouts in tor. See #28646.

0 guards are live.

I can't figure out reasons with that, but maybe you do.

sbws sets UseEntryGuards 0, so it should never have any guards:
https://github.com/torproject/sbws/blob/master/sbws/globals.py#L19

I need better logs to debug this issue, please implement #28645.

There are also some of the type:

Oct 26 23:33:59.000 [warn] Tried connecting to router at REDACTED, but identity key was not as expected: wanted REDACTED but got REDACTED.

It is safe to paste these logs without redacting them. sbws just contacts relays at random, so there is no need to protect anyone's privacy.

Once you've done #28645 and #28646, please paste the un-redacted logs in this ticket.

comment:8 Changed 11 months ago by teor

Status: newneeds_information

comment:9 in reply to:  7 ; Changed 11 months ago by teor

Replying to teor:

Replying to juga:

Replying to teor:

What does tor say in its logs?

The main repeated part:

Oct 28 05:43:00.000 [notice] Heartbeat: Tor's uptime is 2 days 17:59 hours, with 7 circuits open. I've sent 15.08 GB and received 482.17 GB.
Oct 28 05:43:00.000 [notice] Average packaged cell fullness: 43.617%. TLS write overhead: 4%
Oct 28 05:50:28.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
Oct 28 05:57:49.000 [notice] No circuits are opened. Relaxed timeout for circuit 40835 (a General-purpose client 2-hop circuit in state doing handshakes with channel state open) to 10000ms. However, it appears the circuit has timed out anyway. 0 guards are live. [53 similar message(s) suppressed in last 3600 seconds]

The key parts of this message are:

No circuits are opened. Relaxed timeout for circuit

I think we can fix this bug by disabling adaptive timeouts in tor. See #28646.

We can ignore this message. It is an unhelpful log message in Tor. It happens even when the circuit timeout will not change. See #28698.

0 guards are live.

I can't figure out reasons with that, but maybe you do.

sbws sets UseEntryGuards 0, so it should never have any guards:
https://github.com/torproject/sbws/blob/master/sbws/globals.py#L19

Are you running Tor 0.2.9?

The "entry guards are live" log message is not in master:
https://github.com/torproject/tor/blob/master/src/core/or/circuituse.c#L630

I think that it was removed in 0.3.0.1-alpha:
https://github.com/torproject/tor/commit/68679504323b0a676a446b8fb34b976c9dc66b4f

comment:10 in reply to:  9 Changed 11 months ago by teor

Replying to teor:

Replying to teor:

Replying to juga:

0 guards are live.

I can't figure out reasons with that, but maybe you do.

sbws sets UseEntryGuards 0, so it should never have any guards:
https://github.com/torproject/sbws/blob/master/sbws/globals.py#L19

Are you running Tor 0.2.9?

The "entry guards are live" log message is not in master:
https://github.com/torproject/tor/blob/master/src/core/or/circuituse.c#L630

I think that it was removed in 0.3.0.1-alpha:
https://github.com/torproject/tor/commit/68679504323b0a676a446b8fb34b976c9dc66b4f

I think this is also an unhelpful log message, because there are no entry guards.

I'll need info-level logs to debug this issue.

comment:11 Changed 11 months ago by teor

Can you please paste the full info-level logs in a pastebin?
It looks like these warning logs come after any errors that are disabling the relays.

comment:12 Changed 11 months ago by teor

This issue is an instance of #27049: sbws does not open any 3-hop circuits, so it stalls when all 3-hop circuits time out.

One quick fix is to enable predicted circuits (#28701).

I opened a tor ticket for this issue (#28714).

comment:13 in reply to:  9 Changed 11 months ago by juga

Replying to teor:

Are you running Tor 0.2.9?

Yes, i was, updated now to 0.3.4.9

comment:14 in reply to:  11 ; Changed 11 months ago by juga

Replying to teor:

Can you please paste the full info-level logs in a pastebin?
It looks like these warning logs come after any errors that are disabling the relays.

Complete info log, 5 first minutes from a test instance, running from master (so it has torrc debugging options) where i also added to torrc info log and ConnectionPadding 0. It has 'LearnCircuitBuildTimeout': 0. Tor version 0.3.4.9:

https://paste.debian.net/hidden/db05571a/

Are these 3 length circuits?: exit circ (length 3, last hop CalyxInstitute05).

Any suggestion on what to search for in the log?

comment:15 in reply to:  14 Changed 11 months ago by teor

Replying to juga:

Replying to teor:

Can you please paste the full info-level logs in a pastebin?
It looks like these warning logs come after any errors that are disabling the relays.

Complete info log, 5 first minutes from a test instance, running from master (so it has torrc debugging options) where i also added to torrc info log and ConnectionPadding 0. It has 'LearnCircuitBuildTimeout': 0. Tor version 0.3.4.9:

https://paste.debian.net/hidden/db05571a/

I'm sorry, I need the logs when sbws is stalled.

Are these 3 length circuits?: exit circ (length 3, last hop CalyxInstitute05).

Yes.

Any suggestion on what to search for in the log?

If sbws is stalling because all 3-hop circuits time out:

Replying to teor:

This issue is an instance of #27049: sbws does not open any 3-hop circuits, so it stalls when all 3-hop circuits time out.

One quick fix is to enable predicted circuits (#28701).

I opened a tor ticket for this issue (#28714).

Then you should look for:

  • circuit timeout messages
  • "No circuits are opened"
  • messages that say that Tor has no live guards

Since you changed the tor version, I don't know if the messages will be the same as 0.2.9.

comment:16 in reply to:  12 ; Changed 10 months ago by juga

Replying to teor:

This issue is an instance of #27049: sbws does not open any 3-hop circuits, so it stalls when all 3-hop circuits time out.

I don't understand this sentence.
I guess the 3-hop circuits are open by Tor to download descriptors and network status documents?.
Wouldn't new 3-hop circuits be created when other 3-hop circuits timeout?.

comment:17 in reply to:  11 ; Changed 10 months ago by juga

Replying to teor:

Can you please paste the full info-level logs in a pastebin?

sbws stalled at Dec 10 15:46:57, it was started at 11:57:31.

Info log since it started until it stalled:
https://share.riseup.net/#zPpTz3wTCRdyb2HjaWzUtw

The previous log seaching for some words:

egrep -wi --color 'No circuits are opened|circ \(length 3|Expiring|live' info_stalled_4h.log

Now i'll run it enabling predicted circuits and paste the log tomorrow.
https://paste.debian.net/hidden/339cfb38/

comment:18 in reply to:  16 Changed 10 months ago by teor

Replying to juga:

Replying to teor:

This issue is an instance of #27049: sbws does not open any 3-hop circuits, so it stalls when all 3-hop circuits time out.

I don't understand this sentence.
I guess the 3-hop circuits are open by Tor to download descriptors and network status documents?.

1-hop circuits are opened by tor to download descriptors and network status documents.

3-hop circuits are opened by most tor clients to connect to exits and onion services.

Wouldn't new 3-hop circuits be created when other 3-hop circuits timeout?.

But sbws sets __DisablePredictedCircuits 1 after bootstrapping, so sbws' tor only has its 3-hop circuits from when it bootstraps. When they expire, there are no more 3-hop circuits, so tor stops opening circuits.

comment:19 in reply to:  17 Changed 10 months ago by teor

Replying to juga:

Replying to teor:

Can you please paste the full info-level logs in a pastebin?

sbws stalled at Dec 10 15:46:57, it was started at 11:57:31.

Info log since it started until it stalled:
https://share.riseup.net/#zPpTz3wTCRdyb2HjaWzUtw

The previous log seaching for some words:

egrep -wi --color 'No circuits are opened|circ \(length 3|Expiring|live' info_stalled_4h.log

Now i'll run it enabling predicted circuits and paste the log tomorrow.
https://paste.debian.net/hidden/339cfb38/

I don't see anything wrong with tor in these logs.
Maybe this stall is on the sbws side?

If sbws is stalling because of a bug in sbws, then enabling predicted circuits probably won't fix it.

comment:20 Changed 10 months ago by juga

I can confirm that enabling predicted circuits does not solve the problem of sbws stalling.
Regarding the circuits timeout, i think this happen mostly when a new prioritization loop starts measuring all the relays that failed to be measured before cause they're unstable (for instance many of the UbuntuCore2XX), but this is not making sbws stall.
So i'd open a new ticket to don't prioritize relays that fail to be measured.
I think this ticket can be closed, and i'll explain in #28663 the reasons why i think sbws is stalling.

comment:21 Changed 8 months ago by juga

Keywords: 10-proposed added
Milestone: sbws: 1.0.x-finalsbws: 1.1.x-final

Moving to milestione 1.1 tickets which we don't know if they need to be implemented after 1.0.3 has been released or that can only be done after 1.0 has been released

comment:22 Changed 8 months ago by juga

This has not happened for long time. If it happens again we can reopen it.

comment:23 Changed 8 months ago by juga

Resolution: worksforme
Status: needs_informationclosed

comment:24 in reply to:  22 Changed 8 months ago by teor

Replying to juga:

This has not happened for long time. If it happens again we can reopen it.

No, let's open a new ticket, and reference this one.

Note: See TracTickets for help on using tickets.