Lower initial descriptor upload delay for hidden services

changed milestone to %Tor: unspecified

added component::core tor/tor milestone::Tor: unspecified owner::twim priority::medium research reviewer-was-teor-20190422 severity::normal status::needs-revision tor-hs type::enhancement labels

Trac:
0001-Make-separate-initial-descriptor-post-delay-for-ephe.patch

Hmm interesting.

A few things here:

Have you tested that the actual delay here is about 30 seconds? I remember people saying that the whole rend_consider_services_upload() function is borked. I think that would be nice to verify.

Now, if we believe that this delay actually offers security and we reduce it for ADD_ONION services, why not reduce it for all services? We don't really know the threat model of all the people who use ADD_ONION, so I'm not sure if we should take such a global decision.

Personally, I feel this delay can indeed increase security in some use cases, but I also don't like the reachability effect that you mentioned.

I think that your sugestion of making this a parameter of ADD_ONION might be a good approach. Although this assumes that all the people who use ADD_ONION actually understand the security threats here, which is quite doubtful...

Trac:
Keywords: N/A deleted, , research added
Sponsor: N/A to SponsorR-can
Milestone: N/A to Tor: 0.2.???

Replying to asn:

Have you tested that the actual delay here is about 30 seconds? I remember people saying that the whole rend_consider_services_upload() function is borked. I think that would be nice to verify.

I did and it takes exactly 30 seconds. Yes, it is kind of unclear from the code that this delay will be actually 30 seconds.

Now, if we believe that this delay actually offers security and we reduce it for ADD_ONION services, why not reduce it for all services? We don't really know the threat model of all the people who use ADD_ONION, so I'm not sure if we should take such a global decision.

But we don't know the actual security benefit of having it 30sec either.

I think that your sugestion of making this a parameter of ADD_ONION might be a good approach. Although this assumes that all the people who use ADD_ONION actually understand the security threats here, which is quite doubtful...

Yes, but it requires more code. :) I proposed this approach to leave all the crazy logic only for those who need it.

I think one should understand what delay should be set and why (maybe someone do and I don't). But if it still unclear it's better to go on with ADD_ONION flag.

That 30 seconds delay pre-dates my knowledge of hidden service so I have no clue why it was chosen. I doubt very much it was about mitigating any kind of "startup time correlation" attack because a random delay was added to it but then we realized it was borked so in the end it always been 30 seconds...

So indeed... why keep a delay at all for services? Descriptor publication to a directory only happens once all intro points circuit are ready and those are established at startup. If you think about it, it's actually a very specific pattern to detect a service startup. As a guard you see 5 circuits being established at once (yes we need 3 IPs but we launch 2 extras for better luck) then 2 of them dies quickly and you have a 6th circuit almost 30 seconds after the initial launch of those... I'm not saying that by removing that delay we'll make it go away but I really don't see the point of the delay here to hide anything. Thus, I'm all for removing it unless armadev had a reason to add that delay :) 10 years ago :).

Also let's keep in mind that this would be very useful to answer for prop224.

Ideally, when not active, HS and regular clients should look identical on the network to an external observer (see also #17591 (moved)).

Replying to dgoulet:

So indeed... why keep a delay at all for services? Descriptor publication to a directory only happens once all intro points circuit are ready and those are established at startup. If you think about it, it's actually a very specific pattern to detect a service startup. As a guard you see 5 circuits being established at once (yes we need 3 IPs but we launch 2 extras for better luck) then 2 of them dies quickly and you have a 6th circuit almost 30 seconds after the initial launch of those... I'm not saying that by removing that delay we'll make it go away but I really don't see the point of the delay here to hide anything. Thus, I'm all for removing it unless armadev had a reason to add that delay :) 10 years ago :).

Being more specific, this lower bound of 30s was introduced by commit b3f846b313b3cf3191e3a9a54ec1c97227393d3d which reads: In very rare situations new hidden service descriptors were published earlier than 30 seconds after the last change to the service, with the 30 seconds being the current voodoo saying that a descriptor is stable. So I don't see any reason to trust the voodoo and thus have this delay. :)

Also this delay makes it more distinguishable for a passive adversary (ISP) whether a client just set up an onion service or not.

See also #12500 (moved).

Replying to twim:

Being more specific, this lower bound of 30s was introduced by commit b3f846b313b3cf3191e3a9a54ec1c97227393d3d which reads:

Sorry for a typo, it's 33f846b313b3cf3191e3a9a54ec1c97227393d3d. Eventually I've gone wrong, it was introduced even before.

It has jumped to 30s from 5s due to "load on authorities". 11d89141ac0ae0ff371e8da79abe218474e7365c:

+  o Minor bugfixes (hidden services):
+    - Upload hidden service descriptors slightly less often, to reduce
+      load on authorities.

"Load on authorities" is not the point anymore because we don't use V0 since 0.2.2.1-alpha. Thus I think it's safe to drop it back to at least 5s (3s?) for all services. Or even remove it at all?

Replying to twim:

Replying to twim:

Being more specific, this lower bound of 30s was introduced by commit b3f846b313b3cf3191e3a9a54ec1c97227393d3d which reads:

Sorry for a typo, it's 33f846b313b3cf3191e3a9a54ec1c97227393d3d. Eventually I've gone wrong, it was introduced even before.

It has jumped to 30s from 5s due to "load on authorities". 11d89141ac0ae0ff371e8da79abe218474e7365c: {{{
o Minor bugfixes (hidden services):

Upload hidden service descriptors slightly less often, to reduce
 load on authorities.
}}}

"Load on authorities" is not the point anymore because we don't use V0 since 0.2.2.1-alpha. Thus I think it's safe to drop it back to at least 5s (3s?) for all services. Or even remove it at all?

I think ideally any such change should be accompanied by a prop224 patch and a mailing list discussion. It would be great if someone added a section to prop224 specifying this behavior, and made a [tor-dev] thread introducing the patch.

Trac:
Keywords: N/A deleted, proposal-needed? added

I think ideally any such change should be accompanied by a prop224 patch and a mailing list discussion. It would be great if someone added a section to prop224 specifying this behavior, and made a [tor-dev] thread introducing the patch.

Agreed, removing of this delay is too "radical" and should be moved to prop224. Anyway I think that it's safe to restore it back to 5s level and enjoy plain-old-services without useless 30s delay now.

This would also be appreciated by Single Onion Service operators (#17178 (moved)), I've had complaints from those using the test code that descriptor upload takes a while.

However, the threat here is that hidden services that have unstable introduction points now upload their descriptors 6x more often.

Why don't we make the initial upload 5s, and every upload after that 30s? Or even better, some kind of exponential backoff to a few minutes - if you've changed your intro points ten times, we really don't want your eleventh descriptor any time soon.

Replying to teor:

However, the threat here is that hidden services that have unstable introduction points now upload their descriptors 6x more often.

Why don't we make the initial upload 5s, and every upload after that 30s? Or even better, some kind of exponential backoff to a few minutes - if you've changed your intro points ten times, we really don't want your eleventh descriptor any time soon.

Yes, rend_consider_services_upload() function is borked and it's hard to tell what's going on. It's not 6x more often. 30 seconds is the initial delay (after descriptor became dirty). Actual upload period seems (sic!) to be [30s , 30s + rand(2*1h)]. The lower boundary is what is fixed. And fixed high for no actual benefit or security reason, IMO (see comment:8).

Replying to teor:

However, the threat here is that hidden services that have unstable introduction points now upload their descriptors 6x more often.

Why don't we make the initial upload 5s, and every upload after that 30s? Or even better, some kind of exponential backoff to a few minutes - if you've changed your intro points ten times, we really don't want your eleventh descriptor any time soon.

There is a retry timeout for IP circuits if too many fails (see INTRO_CIRC_RETRY_PERIOD). We rely on that for an upper limit of descriptor upload. If an IP keeps failing after a short period (5 minutes), then the IP circuit building retry timeout mechanism kicks in and thus you won't see a zillion descriptor publication. Maybe that's not perfect but that's imo something different from the 30 seconds delay added at startup time.

Now, if your IPs keep failing after the 5 minutes retry window (ex: circuit is closed because bad network), well you indeed need to rebuild a new descriptor with a new IP and publish it but that's OK imo. Adding a delay to publication won't help here because we already have that 5 minutes "wait period" in the first place to avoid too many tries.

I have this feeling that we might be at the point of going on tor-dev@ with this discussion because some of us wants an initial delay or get rid of it or only do something in 224?

"Moved" this discussion to tor-dev@. See https://lists.torproject.org/pipermail/tor-dev/2016-September/011353.html.

Trac:
Status: new to needs_review

My impression of the tor-dev discussion is that we wanted to lower the initial post delay for all hidden services, not just ephemeral hidden services.

Please revise the patch to make this happen, or let me know that's not what we agreed on tor-dev.

Trac:
Keywords: N/A deleted, TorCoreTeam201609 added
Status: needs_review to needs_revision

Trac:
0001-Lower-initial-rend-post-delay-back-to-5s.patch

Replying to teor:

My impression of the tor-dev discussion is that we wanted to lower the initial post delay for all hidden services, not just ephemeral hidden services. Please revise the patch to make this happen, or let me know that's not what we agreed on tor-dev.

Sure, done. Just not sure that we have really agreed upon it.

By the way, when I was testing this patch I found out that 3s in enough for an onion service to stabilize and is much better for UX compared to 5s.

Replying to twim:

Replying to teor:

My impression of the tor-dev discussion is that we wanted to lower the initial post delay for all hidden services, not just ephemeral hidden services. Please revise the patch to make this happen, or let me know that's not what we agreed on tor-dev.

Sure, done. Just not sure that we have really agreed upon it.

By the way, when I was testing this patch I found out that 3s in enough for an onion service to stabilize and is much better for UX compared to 5s.

3s from a desktop or VPS on a fast network in North America or Europe? I'd be keen on some testing on mobile and in other countries, if we want to push the delay as low as it can go.

Replying to teor:

By the way, when I was testing this patch I found out that 3s in enough for an onion service to stabilize and is much better for UX compared to 5s.

3s from a desktop or VPS on a fast network in North America or Europe? I'd be keen on some testing on mobile and in other countries, if we want to push the delay as low as it can go.

From an old desktop over obfs4 bridge and shitty WiFi connection. Don't think that it would be totally different on mobile (Orbot). Also meek should be tested since there are some inherent HTTP delays up to 1s (AFAIK).

Replying to twim:

Replying to teor:

By the way, when I was testing this patch I found out that 3s in enough for an onion service to stabilize and is much better for UX compared to 5s.

3s from a desktop or VPS on a fast network in North America or Europe? I'd be keen on some testing on mobile and in other countries, if we want to push the delay as low as it can go.

From an old desktop over obfs4 bridge and shitty WiFi connection. Don't think that it would be totally different on mobile (Orbot). Also meek should be tested since there are some inherent HTTP delays up to 1s (AFAIK).

OK, can you test meek? Also, let's say we do set the delay to 3s:

how are you testing whether the descriptor has stabilised?
can you warn the user when the descriptor changes after this delay? (perhaps between delay and 2*delay?) That way, we'll get feedback if we've set the delay too low, and users will understand why their onion service is hard to reach. (This might not be possible or easy - if not, that's ok. I am also happy to help you write the patch if I know how you're testing it.)

Replying to teor:

OK, can you test meek? Sure.

Also, let's say we do set the delay to 3s:

how are you testing whether the descriptor has stabilised? I look at the is_dirty dynamics. If it gets dirty after 3s then it is too low. For now it isn't.

can you warn the user when the descriptor changes after this delay? (perhaps between delay and 2*delay?) That way, we'll get feedback if we've set the delay too low, and users will understand why their onion service is hard to reach. (This might not be possible or easy - if not, that's ok. I am also happy to help you write the patch if I know how you're testing it.)

The user can be warned by a log message (easy) or by introducing a control event which is emitted when descriptor gets dirty after it was published (more code, bit complicated). Do we have some statistics on intropoint circuit half-life? I think that this delay should be based on it. And learning this half-life is what we're effectively doing here.

Replying to twim:

Replying to teor:

OK, can you test meek? Sure.

Also, let's say we do set the delay to 3s:

how are you testing whether the descriptor has stabilised? I look at the is_dirty dynamics. If it gets dirty after 3s then it is too low. For now it isn't.

can you warn the user when the descriptor changes after this delay? (perhaps between delay and 2*delay?) That way, we'll get feedback if we've set the delay too low, and users will understand why their onion service is hard to reach. (This might not be possible or easy - if not, that's ok. I am also happy to help you write the patch if I know how you're testing it.)

The user can be warned by a log message (easy) or by introducing a control event which is emitted when descriptor gets dirty after it was published (more code, bit complicated).

A log message would be fine.

Do we have some statistics on intropoint circuit half-life? I think that this delay should be based on it. And learning this half-life is what we're effectively doing here.

I don't think it's something we've ever measured - it's hard to measure safely. We'll just have to guess for now, unfortunately. I'll see if we can measure it using a privacy-preserving method, but that will take time.

Okay, getting back on results. I have tested meek and it works fine. Despite me trying, I got no events when descriptor gets dirty. For sure this should be tested in even more crazy network environment than mine.

At the moment I removed initial posting delay and it still works fine. Because startup time gets revealed anyway (#20262 (moved)) I removed dead code that was supposed to introduce random initial delay. Also it drops a warning when new 3s stabilizing period is too small (3s < was_stable_for < 30s).

Please see the last two patches. Would be thankful for any feedback on this.

Trac:
Status: needs_revision to needs_review

Trac:
0001-Removed-initial-rendezvous-post-dealy-and-dead-code-.patch

Trac:
0002-Show-a-warning-if-3-seconds-is-way-too-small-for-ser.patch

Thanks for these patches. Given the discussion in #20262 (moved), I think we can proceed with removing this code that never worked anyway.

A few comments on patch 0001:

Please don't remove the lower delay for test networks. Instead, make it 0.
All references to rend_service_reveal_startup_time() should be removed, because it is no longer used.

And on 0002:

We generally don't include numbers in the code like old_stabilizing_period = (time_t) 30;, instead, let's keep the old #define, but change its name to include OLD.
The warning message is unhelpful, because the user can't do anything to fix it. It will most likely be triggered when the network is slow or unreliable - let's say that, and tell them that their service may be unreliable as a result. (The message that maybe the "new stabilizing period is too short" is for developers, it should go in a comment above the log message.)

Trac:
Keywords: TorCoreTeam201609 deleted, 030-proposed, TorCoreTeam201610 added
Status: needs_review to needs_revision

Replying to teor:

Thanks for review and comments! Fixed now. Not sure about wording in warn message I chose. The code lives here [1] because it will be too much patch noise here otherwise.

[1] https://github.com/nogoegst/tor/tree/no-initialpostdelay

Thanks, we prefer git branches ;-)

Code review:

Removed references to rend_service_reveal_startup_time():

there are references in the comments as well. grep is your friend.

Move numbers to #define

looks good!

Add REND_DIRTY_DESC_STABILIZING_PERIOD_TESTING

The logic here is inverted: the testing period needs to be used when TestingTorNetwork is 1.

Reword warning message for users and leave a comment for developers

looks good!

In future, please add a short message to the commit saying what's changed. (These ones look like they will squash nicely into one commit, so no need to fix them.)

This also needs a changes file. See tor/changes/ for examples, or read doc/HACKING/CodingStandards.md

Replying to teor:

Removed references to rend_service_reveal_startup_time():

there are references in the comments as well. grep is your friend. Add REND_DIRTY_DESC_STABILIZING_PERIOD_TESTING

The logic here is inverted: the testing period needs to be used when TestingTorNetwork is 1.

Thanks, good catches! Fixed.

In future, please add a short message to the commit saying what's changed. (These ones look like they will squash nicely into one commit, so no need to fix them.) These ones are supposed to be squashed. There is no need to store thashy commits in master. ;)

This also needs a changes file. See tor/changes/ for examples, or read doc/HACKING/CodingStandards.md Yeap, added changes file. As always, not sure about 'feature on ...' string.

Replying to twim:

Replying to teor:

Removed references to rend_service_reveal_startup_time():

there are references in the comments as well. grep is your friend. Add REND_DIRTY_DESC_STABILIZING_PERIOD_TESTING

The logic here is inverted: the testing period needs to be used when TestingTorNetwork is 1.

Thanks, good catches! Fixed.

In future, please add a short message to the commit saying what's changed. (These ones look like they will squash nicely into one commit, so no need to fix them.) These ones are supposed to be squashed. There is no need to store thashy commits in master. ;)

These all look good.

This also needs a changes file. See tor/changes/ for examples, or read doc/HACKING/CodingStandards.md Yeap, added changes file. As always, not sure about 'feature on ...' string.

We don't use "feature on", only "bugfix on". And the changes file is usually shorter.

Here's what I suggest we write for the changes file - they can be hard to get right:

o Minor features (onion services):
  - Reduce onion service initial descriptor upload delay from 30s to 3s.
    If descriptor changes too soon after this (< 30s), log a warning about
    unreliable network connections. Closes ticket 20082.
o Minor bugfixes (onion services):
  - Remove code that claimed to introduce an initial descriptor upload delay,
    but never actually worked. Closes ticket 12500, bugfix on tor-0.0.9pre6.

Here are the commands I used to find which release the bug was introduced in:

git blame src/or/rendservice.c
(look at the commits that changed the function)
git show 4b76fe803
git describe --contains 4b76fe8

Trac:
Reviewer: N/A to teor
Summary: Lower initial descriptor upload delay for ephemeral services to Lower initial descriptor upload delay for hidden services
Milestone: Tor: 0.2.??? to Tor: 0.3.0.x-final

Replying to teor: Thanks, reworded changes file. Also note that there is no initial post delay anymore. The changes file describes precisely what the code does.

btw, can we actually get this in 029?

Thanks teor and twim! This required from me a bit more text and trac is the worst so I did a Gitlab merge request for code review here. See my comments there! Let's make the discussion happened there now which will make our life so much easier :).

@twim, yeah you'll need a Gitlab account, sorry about that but it's worth it. If you had more commit, I can add them to that branch so we don't lose our discussion.

https://gitlab.com/dgoulet/tor/merge_requests/12

Removing proposal-needed? as we don't define upload delays in the specification only time period HOWEVER I found this in rend-spec.txt:

   When uploading descriptors, the hidden service needs to make sure that
   descriptors for different clients are not uploaded at the same time (cf.
   Section 1.1) which is also a limiting factor for the number of clients.

We should make sure we actually follow this as this might be important?

Finally, to answer your question twim:

btw, can we actually get this in 029?

Nope, no chance sorry, 029 is freezed and 030 alpha is coming around the corner.

Trac:
Keywords: 030-proposed, proposal-needed? deleted, N/A added

Hello, I haven't had the time to review the patch yet.

My understanding is that the patc makes the default delay be 3s but we will have a torrc option to bump it up to 30s + random()?

I'm having trouble understanding the point of this torrc option? Who would enable it and for what reason? IMO, it's just going to go unused and contribute to our increasing torrc option bloat.

I feel that there is no point in doing probabilistic delays here without a proper security analysis of what they offer, and I have not seen one of those yet. Just saying "it obfuscates startup time" is not a security analysis IMO. Who does it obfuscate it from, what attacks are prevented, and why did we choose that random value? Most importantly, who should enable that torrc option?

Till this security analysis actually arrives, I'm inclined to just change the default delay to constant 3s, and get done with it without extra torrc options. Please let me know if I'm too absolute.

Replying to dgoulet:

@twim, yeah you'll need a Gitlab account, sorry about that but it's worth it. If you had more commit, I can add them to that branch so we don't lose our discussion.

Sure, I prefer to move review-specific code to something ephemeral to not litter trac.

Removing proposal-needed? as we don't define upload delays in the specification only time period HOWEVER I found this in rend-spec.txt: {{{ When uploading descriptors, the hidden service needs to make sure that descriptors for different clients are not uploaded at the same time (cf. Section 1.1) which is also a limiting factor for the number of clients. }}}

We should make sure we actually follow this as this might be important?

Sure, it's really important and somehow I forgot about it. I think there should be some short random initial delay only for on-disk services since they are being started at the same time (note that this code doesn't actually work now). I'll soon update my branch. The delay for hiding startup time has different purpose and should be treated separately (see #20262 (moved) and my branch obfuscate-startup-time).

Replying to asn:

My understanding is that the patc makes the default delay be 3s but we will have a torrc option to bump it up to 30s + random()? I'm having trouble understanding the point of this torrc option? Who would enable it and for what reason? IMO, it's just going to go unused and contribute to our increasing torrc option bloat.

Nope, please see changes file about how it works. torrc option is for hiding startup time for those who need it and is being discussed in #20262 (moved). There is not torrc option here.

Plus see my previous comment.

I feel that there is no point in doing probabilistic delays here without a proper security analysis of what they offer, and I have not seen one of those yet. Just saying "it obfuscates startup time" is not a security analysis IMO. Who does it obfuscate it from, what attacks are prevented, and why did we choose that random value? Most importantly, who should enable that torrc option?

Again, there is no torrc option here. I have eliminated deterministic delay here. 3s is only for the case when networks fails and we have to reupload descriptor suddenly. Also, as dgoulet has mentioned, I've introduced small random delay in order to unlink on-disk onion services from each other. It's absolutely different delay (see above).

I've squashed and fixed the patch up. Please see ticket20082_030_01 branch at https://gitlab.com/nogoegst/tor.

Trac:
Status: needs_revision to needs_review

Trac:
Keywords: N/A deleted, review-group-11 added

setting owner

Trac:
Owner: N/A to twim
Status: needs_review to assigned

Trac:
Status: assigned to needs_review

Ok so I can see how this is useful and plausibly works well. But my main concern here is that we are making that function more complicated now with more if/else thus possible culprit for something that is very important to get right.

We could maybe break it down more in more granular functions and improve testing drastically. Thoughts?

Trac:
Status: needs_review to needs_information

Replying to dgoulet:

Ok so I can see how this is useful and plausibly works well. But my main concern here is that we are making that function more complicated now with more if/else thus possible culprit for something that is very important to get right.

We could maybe break it down more in more granular functions and improve testing drastically. Thoughts?

I do agree that it would be great to have it simpler. atm I don't see any point where it could be broken down. Please leave your comments at https://gitlab.com/nogoegst/tor/merge_requests/1 if you have any ideas/confusions. Thanks!

I revisited the patch, fixed some culprits there and simplified the code. Updated code lies in my ticket20082_030_02 branch and PR is https://gitlab.com/nogoegst/tor/merge_requests/2. To reduce noise this branch will be rebased in future.

So at the moment I removed code that used to introduce initial post delay for non-ephemeral services. It was unclear how it should work and there was neither agreement upon delay value nor real security research. There should be a discussion about how it should implemented against the specs.

For now it works this way (timeline of an onion service uploads):

     ^                ^              ^    ^
     |                |              |    |
     |       +---(x)--+       +--(x)-+    |
----[a]-----[b]--------------[d]---------[d]----------> t
                      |<--y-->|
                          |
                          +---[w]->


^ - upload
a - initial upload (descriptor is dirty and never uploaded)
b - unscheduled upload (descriptor got dirty, uploaded before)
  x - 'stabilizing' period = 3s
  y - 'unstable' period = 30s
  w - warning if descriptor has changed (d) less than y after
      last upload
d - next scheduled upload (next_upload_time, 1h after last upload,
    may be rescheduled)

Trac:
Keywords: TorCoreTeam201610 deleted, TorCoreTeam201611 added

Unmessing even further by detaching the discussion about initial shuffling of services to #20524 (moved). It's good to have this feature implemented separately.

Trac:
Status: needs_information to needs_review

Commented on the gitlab merge. I made some suggestion on simplifying this even more. I might have said stupid things so feel free to express any disagreement :). Thanks twim!

Trac:
Status: needs_review to needs_revision

Trac:
Keywords: review-group-11 deleted, N/A added

Lower initial descriptor upload delay for hidden services

Child items 0

Activity