Opened 2 years ago

Last modified 11 months ago

#21446 needs_information defect

Number of Introduction Points for a (SingleOnion?) HS seems variable, or degrades with time

Reported by: alecmuffett Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.3.9-alpha
Severity: Normal Keywords: tor-hs, 031-deferred-20170425, 034-triage-20180328
Cc: asn Actual Points: 0.5
Parent ID: Points: 0.5
Reviewer: Sponsor:

Description

I'm helping some people tune a very large Tor Onion site which has recently converted to 2.9.9 and Single Onions.

There have been some strange reachability issues which I am helping triage.

Analysis is tricky because independent copies of the onion exist in two different datacentres, and they both descriptors into the HSDirs.

To date this configuration has not contributed to any outages that we're aware.

For this ticket the specific onion address is "fbsbx2q4mvcl63pw" - though at least one other onion address (out of 3) has been showing the same symptoms.

All onions are configured with "HiddenServiceNumIntroductionPoints 10"

I have been monitoring the Onion descriptors with a Stem script* that prints the HS Descriptor IPs, and over time the listed IPs appear to drop from the expected 10, to as low as 4.

Partial output from the script log, for this morning, is at:

https://gist.github.com/alecmuffett/2910328730012f9af3410a28e5e6071e

At 0800h the IPs in retrieved HSDescriptor (reminder: from one of two purposely-duplicated HS instances) looked like this, lines 1..12 in the Gist; note NumIntroductionPoints/nip=10

Mon 13 Feb 2017 08:00:55 GMT
v=2 age=55 nip=10 pub(2017-02-13 08:00:00) now(2017-02-13 08:00:55) fbsbx2q4mvcl63pw

0: 134.19.177.109:443
1: 178.254.20.134:9001
2: 185.73.220.8:443
3: 188.114.140.245:9001
4: 37.187.4.8:21
5: 46.101.102.71:443
6: 46.4.49.201:9001
7: 62.210.82.44:21
8: 62.210.92.11:9001
9: 83.44.207.3:9001

Yet further down the file, a different descriptor, possibly from the other instance (lines 50-57)

Mon 13 Feb 2017 08:21:03 GMT
v=2 age=1263 nip=6 pub(2017-02-13 08:00:00) now(2017-02-13 08:21:03) fbsbx2q4mvcl63pw

0: 188.118.217.236:443
1: 192.42.116.161:9001
2: 37.187.4.8:21
3: 80.248.208.131:9001
4: 82.165.207.102:9001
5: 88.190.210.240:9001

...only 6 introduction points.

If we assume (?) that it's unlikely for the two daemons to pick the same introduction point, the fact that "188.118.217.236:443" is present in all the descriptors in lines 40..150 suggests that all these descriptors are from the same daemon instance - however the number of introduction points varies.

See attached image for highlight. Is this expected?

*script: https://github.com/alecmuffett/halfagigonion/blob/master/ls-hsdir.py

Child Tickets

TicketStatusOwnerSummaryComponent
#21594closedHidden Services with many intro points delay checking circuits on startupCore Tor/Tor
#21598closedLog a message when a hidden service has fewer introduction points than expectedCore Tor/Tor
#21599closedMake hidden service descriptor creation more consistentCore Tor/Tor
#21600assignedasnHidden service introduction point retries occur at 1 second intervalsCore Tor/Tor
#21622closedteorLog a message when a hidden service delays new introduction point circuitsCore Tor/Tor

Attachments (1)

Screen Shot 2017-02-13 at 09.32.34.png (206.6 KB) - added by alecmuffett 2 years ago.
screencap with highlight

Download all attachments as: .zip

Change History (13)

Changed 2 years ago by alecmuffett

screencap with highlight

comment:1 Changed 2 years ago by alecmuffett

nb: per the Gist, the descriptor is retrieved and analysed at 5 minute intervals.

comment:2 Changed 2 years ago by ahf

Cc: dgoulet asn added
Keywords: tor-hs added
Milestone: Tor: 0.3.1.x-final
Version: Tor: 0.2.9.9

comment:3 Changed 2 years ago by dgoulet

Status: newneeds_information

The screencap, am I looking at 5 different tor instance or the same tor instance but at 5 minutes interval? The former, I would be very worried that the same IPs are picked like that. The latter, there is a clear rotation of IPs that is not sane...

Second, do you have any notice or warn or info logs that could be useful for this?

comment:4 Changed 2 years ago by asn

This does not look good, but I'm also a bit confused.

Looking at the gist, are we looking at the intro points of two separate tor daemons, and we cannot know which logs belong to which host? Are we certain there are two hosts in there?

In general, some logs would help here.

comment:5 Changed 2 years ago by teor

For future reference, if you cc me on these tickets, I might be able to help.

comment:6 Changed 2 years ago by teor

Just to clarify: are these descriptors created using OnionBalance, or using Tor?

Here's what I think is happening:

Issue 1: 5 minute delay before intro circuits are checked

When a hidden service starts, rend_consider_services_intro_points() builds HiddenServiceNumIntroductionPoints + NUM_INTRO_POINTS_EXTRA introduction points.

When HiddenServiceNumIntroductionPoints is between 8 and 10, this means that when rend_consider_services_intro_points() next runs, it checks if it has built more than MAX_INTRO_CIRCS_PER_PERIOD (10), and then ignores this service for INTRO_CIRC_RETRY_PERIOD (5 minutes). Only then will it check any of the circuits for failure.

A similar issue can happen when the number of circuits opened over 5 minutes due to retries, failures (node down), and expiries exceeds 10. There can be up to 3 retries if a node is in the consensus, but refusing connections.

Expiry is triggered after an intro point handles INTRO_POINT_MIN_LIFETIME_INTRODUCTIONS (16384). If you have a million monthly users, you could be rotating introduction points on the order of every hour (they expire after 18-24 hours anyway).

This bug and the scale of your site exacerbate the next issue:

Issue 2: Intro point count can become stale

Just before a hidden service descriptor is rebuilt, rend_consider_services_upload() checks if there are enough intro points using count_established_intro_points(). This uses circuit_established, which is set in rend_service_intro_established(), and cleared in rend_consider_services_intro_points(). Due to the first issue, clearing this flag can take up to 5 minutes.

rend_consider_services_upload() then calls rend_service_update_descriptor(), which does the descriptor update. It uses find_intro_circuit(), which checks for *current* intro circuits before adding them to the descriptor, rather than using circuit_established.

Further Questions

I wonder whether your service is failing a lot of connections: that could make the above issues worse.

I also wonder whether the load your service places on some intro points makes them start dropping connections. This would cause an increased number of retries.

I wonder if we should delay between retries, rather than trying at 0, 1, 2 seconds, then giving up.

And I'd still like to see those logs.

comment:7 Changed 2 years ago by teor

Actual Points: 0.5
Points: 0.5
Version: Tor: 0.2.9.9Tor: 0.2.3.9-alpha

I added child tickets to this bug to address this issue, and detect similar issues in future.

Since these are not security issues, the earliest release they might make it into is 0.3.0, but that's at release candidate status. Since these bugs have been around for several releases, it's up to nickm whether they make it in.

I'm leaving this ticket open in case alecmuffett wants to provide further information.
(In the absence of further information, once all the child tickets are closed, this should close.)

Also see #21302 and #21297 for issues that were fixed in 0.3.0.3-alpha that could also cause similar reachability issues.

comment:8 Changed 2 years ago by nickm

Keywords: 031-deferred-20170425 added
Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

Triage: batch-defer unowned items of priority Medium or lower to 0.3.2.

comment:9 Changed 19 months ago by dgoulet

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Both remaining child tickets have been postponed to 033.

comment:10 Changed 15 months ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Moving a bunch of tickets from 033 to 034.

comment:11 Changed 13 months ago by nickm

Keywords: 034-triage-20180328 added

comment:12 Changed 11 months ago by dgoulet

Cc: dgoulet removed
Milestone: Tor: 0.3.4.x-finalTor: unspecified
Note: See TracTickets for help on using tickets.