#25552 closed defect (fixed)

prop224: Onion service rev counters are useless and actually harmful for scalability

Reported by: asn Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.1.9
Severity: Normal Keywords: tor-hs, prop224, 035-roadmap-master, 035-must, regression, 035-triaged-in-20180711
Cc: franklin, isis Actual Points:
Parent ID: Points: 6+++
Reviewer: dgoulet Sponsor:

Description (last modified by asn)

Armadev discovered that hsv3 revision counters are harmful to scalability since if an onion service is hosted by multiple servers (like the fb one), every server should have visibility of the revision counter if they want to publish a descriptor.

We should figure out whether there is an easy way around that, or whether this is actually a big problem for scalable v3s. We should also consider how this works out with onionbalance-based designs.

Rev counters are there so that HSDirs (and other actors) cannot replay old HS descriptors. However, they are not really needed since now HS descriptors are only replayable for a day (before the blinded key gets refreshed), and also HSDirs could keep a replay cache of the descriptor assigned to a blinded key.

If we decide to rip them off, the way to do it is in two painful steps:
a) Remove rev counter checking from HSDirs, and do a replay cache or something.
b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

Child Tickets

TicketStatusOwnerSummaryComponent
#26549closedRevision counter for v3 ephemeral hidden service is lostCore Tor/Tor

Change History (53)

comment:1 Changed 16 months ago by asn

Description: modified (diff)

comment:2 Changed 16 months ago by franklin

Cc: franklin added

comment:3 Changed 16 months ago by asn

WRT the replay cache idea of step (a) above, we probably do need a replay cache on the HSDirs, because there is a 24 hour window (before we change blinded pubkey) where HSDirs can replace the descriptors on other HSDirs with older versions of the descriptor. We probably want to avoid this and we should use a replay cache for this.

The right way to use a replay cache here is to store the hash of the HSV3 descriptor on the replay cache. We should investigate whether we need to hash the whole descriptor, or the whole descriptor minus the signature (in case the signature is malleable and an attacker can tweak it to bypass replay cache). If we need to do the latter approach, we should add the right data in hs_cache_dir_descriptor_t as part of cache_dir_desc_new().

Implementation plan for step (a) above:

1) Introduce a global replay_cache_t *hs_cache_replay_cache in hs_cache.c next to hs_cache_v3_dir. We should index entries to this replay cache by blinded key, or maybe add an insertion timestamp so that we know when to clean it up.
2 In cache_store_v3_as_dir, remove the revision counter check, and instead query the replay cache for whether we already have seen this descriptor before. If we have seen this descriptor before we should treat it the same way we treat descriptors with a smaller or equal revision counter right now, that is, reject them and log_info.
3) We should clean up the replay cache when we are sure that the blinded key for a descriptor is now useless and will never be used by clients again. We should look in rend-spec-v3.txt to make sure when that is; probably at 24 or 48 hours or so.

comment:4 in reply to:  description ; Changed 16 months ago by arma

Replying to asn:

b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

As I understand it, HSDirs learned how to handle v3 descriptors in 0.3.0.1-alpha, as part of #17238.

According to
https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/CoreTorReleases#Listofreleases
Tor 0.3.0 is already dead. And importantly, the feature isn't in our long-term-stable 0.2.9.

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

comment:5 Changed 16 months ago by teor

This issue was originally reported in #25124 as a bug in Stem.

comment:6 Changed 16 months ago by asn

Keywords: 034-roadmap-proposed added

comment:7 in reply to:  4 ; Changed 16 months ago by teor

Replying to arma:

Replying to asn:

b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

As I understand it, HSDirs learned how to handle v3 descriptors in 0.3.0.1-alpha, as part of #17238.

According to
https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/CoreTorReleases#Listofreleases
Tor 0.3.0 is already dead. And importantly, the feature isn't in our long-term-stable 0.2.9.

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

But hang on, do clients require descriptors to have revision counters?
If so, we can't rip out revision counters on services for a long time.

But we can make HSDirs and clients ignore them,

comment:8 Changed 15 months ago by asn

Latest plan: According to RFC 8032:

   Some systems assume signatures are not malleable: that is, given a
   valid signature for some message under some key, the attacker can't
   produce another valid signature for the same message and key.

   Ed25519 and Ed448 signatures are not malleable due to the
   verification check that decoded S is smaller than l.  Without this
   check, one can add a multiple of l into a scalar part and still pass
   signature verification, resulting in malleable signatures.

We should check if our ed25519 implementations do the above check. If they do, it should be possible to just replay cache the ED25519_SIG_LEN bytes of our ed25519_signature_t. I plan to look at our implementation this week to see if the aboce check is done, then send an email to Ian Goldberg, and if he agrees that it's legit, proceed with this plan.

comment:9 in reply to:  8 Changed 15 months ago by asn

Replying to asn:

Latest plan: According to RFC 8032:

   Some systems assume signatures are not malleable: that is, given a
   valid signature for some message under some key, the attacker can't
   produce another valid signature for the same message and key.

   Ed25519 and Ed448 signatures are not malleable due to the
   verification check that decoded S is smaller than l.  Without this
   check, one can add a multiple of l into a scalar part and still pass
   signature verification, resulting in malleable signatures.

We should check if our ed25519 implementations do the above check. If they do, it should be possible to just replay cache the ED25519_SIG_LEN bytes of our ed25519_signature_t. I plan to look at our implementation this week to see if the aboce check is done, then send an email to Ian Goldberg, and if he agrees that it's legit, proceed with this plan.

Progressed with plan: https://lists.torproject.org/pipermail/tor-dev/2018-April/013074.html

comment:10 Changed 15 months ago by isis

Cc: isis added

comment:11 Changed 15 months ago by dgoulet

Keywords: 034-roadmap-master added; 034-roadmap-proposed removed
Milestone: Tor: unspecifiedTor: 0.3.4.x-final
Owner: set to asn
Status: newassigned

This is already on the roadmap so marking it for 034.

comment:12 Changed 15 months ago by cypherpunks

Parent ID: #25955

comment:13 Changed 15 months ago by isis

Parent ID: #25955

Hey @cypherpunks, we really appreciate you trying to help organise and triage our tickets, but in this case the HS team has a flow that works for them, so please don't set parent tickets on their stuff. Thanks!

Last edited 15 months ago by isis (previous) (diff)

comment:14 Changed 15 months ago by dgoulet

Cc: dgoulet removed
Owner: changed from asn to dgoulet

Taking over this.

To summarize, after a discussion with teor and asn, we'll go with a replaycache that stores a hash of the descriptor *without* the signature. And this would be done _after_ desc signature validation.

comment:15 Changed 15 months ago by dgoulet

Here is a fun fact. We use the revision counter in the computation of the descriptor encryption keys. See spec section HS-DESC-ENCRYPTION-KEYS.

So bottom line, this means that we have to remove it from secret_input computation *but* only if we can't find the counter in the plaintext data of the descriptor ("revision-counter" SP Integer NL).

Code wise, this isn't very complicated but I thought it would be wise to just throw it out there since it affects our crypto construction.

comment:16 in reply to:  7 Changed 15 months ago by dgoulet

Replying to teor:

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

But hang on, do clients require descriptors to have revision counters?

So here is the fun part of this. Client do look at the revision counter when caching but only to decide if they have a newer one in their cache or not. Thus, a revision counter always to 0 for instance wouldn't affect the client cache.

As for HSDir, they won't accept a descriptor with a revision counter that is lower or equal with the one they have in their cache. Meaning that 034+ services will still need to put the revision counter in their descriptor for a while until <= 032 is phased out. Not putting the revision counter in the descriptor for specific HSDir is not trivial amount of engineering work.

Now, this is where it gets messy. When decoding the plaintext part of a descriptor (where the rev. counter is), we hard require the counter to be in it (notice the T1()):

  T1(str_rev_counter, R3_REVISION_COUNTER, EQ(1), NO_OBJ),

Thus, as long as we have 032 and 033 HSDirs and clients on the network, we can't remove the counter from the descriptor else they won't be able to store/access 034+ services as every descriptor will fail to be decoded.

Thus to summarize, the only thing we can do for now is make HSDir use a replaycache instead of revision counter, make client ignore revision counter and make the revision counter optional in the descriptor when decoding it.

But we MUST make the service put the rev counter regardless, with the current mechanism, in the descriptor for a while else it will break client and HSDir <= 033.

Or a more nuclear option, we bump the descriptor version to 4 which won't have the revision counter which will effectively make 034+ the birth of the onion service v4 :P...... not ideal ;).

Last edited 15 months ago by dgoulet (previous) (diff)

comment:17 Changed 15 months ago by dgoulet

Reviewer: asn
Status: assignedneeds_review

Possible branch: ticket25552_034_01.

To be honest, I currently don't see a way for service to stop putting the revision counter without breaking many clients because of comment:15.

Seems like once all HSDir <= 032 are phased out, we can then make the service stop putting it in the descriptor (which will break <= 032 clients...). This means that *right now* (it is in the branch) we have to either make the client ignore the revision counter in the secret in put construction or always use 0 if no rev counter in the descriptor (which I did for simplicity).

Anyway, see the attempt above.

comment:18 Changed 15 months ago by teor

Status: needs_reviewneeds_revision

I'm going to suggest a different strategy:

  1. Make v3 onion services use the descriptor generation timestamp for the revision counter
  2. Backport this change to all tor versions with v3 onion services (0.3.2 and later)

This fix will make v3 onion services scaleable, by allowing multiple services to submit descriptors with a very small probability of revision number collisions. It also retains the property that newer descriptors replace older ones.

We can make a separate decision about replay caches on HSDirs.
We can make a separate decision about removing the revision counter entirely.
If we decide to keep it, we should check that it's a 64-bit field, so it lasts past 2038.

comment:19 Changed 15 months ago by teor

Status: needs_revisionneeds_review

Oh, wait, that doesn't work. It leaks the service's clock / clock skew to HSDirs and clients.

We could add noise and round, but suddenly the patch isn't simple any more.

comment:20 Changed 15 months ago by teor

Status: needs_reviewneeds_revision

HSDirs that allow descriptors with no revision counter should declare a new HSDir protocol version.
We should make services leave out the revision counter if a consensus parameter is set.

Then our upgrade path is:

  1. Wait until most HSDirs support the new protocol
  2. Stop giving the HSDir flag to relays that don't support the new protocol
  3. Make services uses the new protocol by setting the consensus parameter

comment:21 Changed 15 months ago by dgoulet

Status: needs_revisionneeds_review

Some discussion happened on IRC with teor, asn and I. Here is an updated branch:

https://github.com/torproject/tor/pull/80
Branch: ticket25552_034_02

comment:22 Changed 15 months ago by asn

Looks better!

I did another review and pushed some changes to ticket25552_034_02.
Changes are in different commits, so if you don't like some of them feel free to not cherry-pick them! I also added a comment on github.

Do you think we can run this on your relay for a few days (perhaps with increased logging) to ensure that we get no replay situations? I'm mainly afraid of the case where the HS cache clears its descriptor cache, and the HS tries to send us the same descriptor, and we reject it even tho we dont have one. I looked at the code and I believe that this cant happen since each HS descriptor is different (because of encryption salt, etc.), but still want to make sure :)

comment:23 Changed 15 months ago by dgoulet

Ok I've squashed your commit so we can have a clean git history! Your github comment doesn't show so I think you forgot to click Submit review ;).

All good stuff. Everything is untouched except the hs_descriptor_is_replay() that I changed so we don't use the hs_descriptor.c namespace and also I identify the function with as_dir because the hs_cache.c file contains many caches. I'll run this on my relay but it will take at the very least 5+ days since takes 96h to get HSDir flag ;).

But I think in the meantime we can merge upstream else we'll be way to short from the closing of the merge window here...

Branch: ticket25552_034_03

comment:24 Changed 15 months ago by dgoulet

Status: needs_reviewmerge_ready

comment:25 Changed 14 months ago by nickm

Status: merge_readyneeds_revision

Travis says the unit tests are failing here. I confirm:

hs_cache/directory: 
  FAIL src/test/test_hs_cache.c:112: expected log to contain "Possible descriptor replay detected"  Captured logs:

  [directory FAILED]
1/1086 TESTS FAILED. (31 skipped)

I'll also start reviewing the branch.

comment:26 Changed 14 months ago by nickm

I've reviewed the PR. The biggest issue is related to the use of "\n" signature_str, which I believe should be "\n" signature_str " " instead.

Other issues not on the PR:

  1. How hard is it to DoS this into an OOM condition? Do we need to tie it into the OOM system? And by doing so, do we subject ourselves to replay attacks once again?
  1. On point 1: perhaps the replay cache should be thought of, not as a complete replacement for revision counters, but as an alternative to them when they cannot be used? That is, we could enforce the rule that revision counters are non-decreasing, but allow revision counters to remain equal, and use the replay cache to handle only the "equal counter" case.

That way if we need to rip out the cache because of OOM issues in the future, or if we solve the problem of coordinating revision counters between distributed HS providers, we aren't stuck with the cache forever.

  1. As above, the unit tests need fixing.

comment:27 in reply to:  26 Changed 14 months ago by dgoulet

Status: needs_revisionneeds_review

Replying to nickm:

I've reviewed the PR. The biggest issue is related to the use of "\n" signature_str, which I believe should be "\n" signature_str " " instead.

Yes. Fixed on the PR with fixup 23a4c09edb2c9d46.

  1. How hard is it to DoS this into an OOM condition? Do we need to tie it into the OOM system? And by doing so, do we subject ourselves to replay attacks once again?

Right so hooking it up to the OOM could affect the replay attack protection. Thus, this cache can potentially grow to infinite. However, entries are in bytes (32 byte hash + some extra byte). For example, one would need to upload this amount of descriptors in a 36 hour period to reach the size:

100k - ~3MB
1M - ~32MB
5M - ~153MB

What if we add a warning when a certain threshold is reached so we can maybe learn if it is happening in the wild at some point?

Usually, when the OOM goes off on HSDir descriptor, we learn about it pretty quickly.

  1. On point 1: perhaps the replay cache should be thought of, not as a complete replacement for revision counters, but as an alternative to them when they cannot be used? That is, we could enforce the rule that revision counters are non-decreasing, but allow revision counters to remain equal, and use the replay cache to handle only the "equal counter" case.

This won't work with the use case we are trying to fix with this patch.

Basically, multiple services competing (for instance what FB does), the last one uploading needs to be the true one winner. If we still support revision counter, the service that restarted the last will always be the winner.

So we need to dump the rev. counter usage both at the HSDir and Client side for caches all together.

That way if we need to rip out the cache because of OOM issues in the future, or if we solve the problem of coordinating revision counters between distributed HS providers, we aren't stuck with the cache forever.

Other problem with revision counter. We can't simply dump them because it is also used in the encryption key computation by service/client...

So we need this transition period were (1) we make all client/hsdir make it optional in the descriptor and set it to 0 which means if not present, client and service uses 0. Then (2) at some point far in time, we can make service stop putting it because we believe all client/hsdir have updated......

Really not ideal situation but the best we came up with.

  1. As above, the unit tests need fixing.

Done in fixup fc45a82b7bceb4ca

comment:28 Changed 14 months ago by nickm

The changes here look good. Let's keep talking about revision counters, though: the DoS OOM opportunity makes me nervous.

comment:29 Changed 14 months ago by nickm

There's an alternative approach we've been talking about. For preliminaries, see my torspec branch ope_spec, with some corresponding backend implementation code in ope_hax

comment:30 Changed 14 months ago by asn

Also see bug25552_blinding for an alternative design suggested by Nick, which blinds the time(NULL) timestamp with a hash derived from the ephemeral blinded key as such:

    uint32_t BLINDING_FACTOR = SHA3(ephemeral_blinding_key)[4]
    uint64_t REV_COUNTER = now + BLINDING_FACTOR

IIUC, this offers the same properties as the OPE approach: monotonically increasing rev counter, with no state file needed, and with obfucated local time, but it requires time sync between load balancing nodes. It also seems easier to understand/review than the OPE approach.

Nick what you think?

comment:31 Changed 14 months ago by asn

Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final
Status: needs_reviewnew

OK approach from comment:30 is busted after all! We are going with the OPE design from comment:29, however this needs some more work. Pushing to early 035.

comment:32 Changed 13 months ago by asn

Points: 46+++
Status: newneeds_review

Oooff! I have a branch that implements the OPE design for HSv3! It seems to work on my onion service for the past week just fine, and tests pass. This was much harder than I thought, because the hs_service unittests are a mess that depend on too many time sources. I fixed them all and tested them with various kinds of local time using a crazy bash script I hacked up.

You can find a fixup for the spec branch at ope_spec in my github torspec repo. It explains why we cannot use "seconds since TP", and why we have to use "seconds since SRV".

You can find the code at my branch bug25552_ope_draft. It seems to work fine and the code is decent, but I labeled it as draft because there might be some trivial code-style improvements that can be done (in particular im thinking the is_current func argument). I also need to make sure that HSes will not burp if they have an obsolete rev-counter left in their state file. Please check it out and if you like it I will make it real proper: https://github.com/torproject/tor/pull/150

comment:33 Changed 13 months ago by dgoulet

Reviewer: asndgoulet

comment:34 Changed 13 months ago by dgoulet

Status: needs_reviewneeds_revision

Left comments on the PR. No show stoppper but a couple comments I would like to see answered or/and addressed. Thanks!

comment:35 in reply to:  34 Changed 13 months ago by asn

Replying to dgoulet:

Left comments on the PR. No show stoppper but a couple comments I would like to see answered or/and addressed. Thanks!

Thanks for the review David. I pushed two fixups that should address your comments.

Let me know if you like them and I'm gonna rebase everything into a proper branch.

comment:36 Changed 13 months ago by asn

Status: needs_revisionneeds_review

comment:37 Changed 13 months ago by maqp

Hi. I just wanted to chime in that the bug25552_ope_draft appears to be a working solution to the issues with UPLOAD_REJECTED during repeated descriptor publishing that were first reported in ticket #25124Here is the code I'm using to test the fix.

comment:38 Changed 13 months ago by dgoulet

Status: needs_reviewmerge_ready

ok so this branch seems good to go although now Github is complaining of merge conflict (because of the huge ongoing refactoring!).

I'm putting this one in merge_ready but @asn if you have a chance to provide a new one rebased on latest master, might help nickm if the conflicts are too complicated.

comment:39 Changed 13 months ago by asn

Please see branch bug25552_ope in my repo with the rebased branch**

Here is the relevant PR: https://github.com/torproject/tor/pull/185

comment:40 Changed 13 months ago by teor

Does this ticket solve #26549?

comment:41 in reply to:  40 Changed 13 months ago by asn

Replying to teor:

Does this ticket solve #26549?

Yes it does. Replied on that ticket.

comment:42 Changed 13 months ago by maqp

If it's possible, could you please backport this to 0.3.2 and later?

comment:43 in reply to:  42 ; Changed 13 months ago by teor

Keywords: 032-backport-maybe 033-backport-maybe 034-backport-maybe added

Replying to maqp:

If it's possible, could you please backport this to 0.3.2 and later?

This is a very large patch, so we might not backport it.
What would stop you from upgrading to 0.3.5?

comment:44 in reply to:  43 Changed 13 months ago by maqp

Replying to teor:

This is a very large patch, so we might not backport it.
What would stop you from upgrading to 0.3.5?

Nothing really! I just noticed the 0.3.5 milestone was set to December and since you mentioned something about backporting the fix in comment:18, I thought I'd ask if that was possible. If it's a lot of work then I'll of course wait until Tails etc. upgrade to 0.3.5.

comment:45 Changed 13 months ago by nickm

Keywords: 035-roadmap-master added; 034-roadmap-master 032-backport-maybe 033-backport-maybe 034-backport-maybe removed

comment:46 Changed 13 months ago by asn

Keywords: 035-must regression added

Marking this as 035-must. Adding the regression as suggested by the triage guidelines, meaning that this is a regression behavior from v2 to v3.

comment:47 Changed 12 months ago by nickm

Status: merge_readyneeds_revision

I left a few small comments on the branch; looking very good. Only one of the comments (saving the OPE key) should take any engineering effort.

comment:48 Changed 12 months ago by asn

Did the fixes requested, but the branch fails in chutney. Need some more hammer probably.

comment:49 Changed 12 months ago by nickm

If it passed before in chutney, but now it fails, my first suspicion would be that you're not using the correct saved OPE keys. Let me know if you want debugging help.

comment:50 Changed 12 months ago by nickm

Keywords: 035-triaged-in-20180711 added

comment:51 in reply to:  49 Changed 12 months ago by asn

Status: needs_revisionmerge_ready

Replying to nickm:

If it passed before in chutney, but now it fails, my first suspicion would be that you're not using the correct saved OPE keys. Let me know if you want debugging help.

OK, addressed your comments and pushed fixes for a few bugs found during chutney testing.
Thanks for the review!

Let me know if something smells funny.

Pushing this back in merge_ready after also discussing it with David in IRC.

comment:52 Changed 12 months ago by nickm

Squashed, and merged to master.

The squash was trivial and produced 14b507e5207ce7e581c5fc773921f8cf65d08247.

The merge was nontrivial, and produced e2b744ce38edb8901cff3288634c4ebb5b4568b9. It would be good if one of you would review it, then close this ticket. :)

comment:53 Changed 12 months ago by asn

Resolution: fixed
Status: merge_readyclosed

Merge commit looks good to me.
Thanks for the review and the merging!

Closing this! :)

Note: See TracTickets for help on using tickets.