Opened 3 years ago

Closed 3 years ago

#23387 closed defect (fixed)

prop224: HSdir index desynch between client and service

Reported by: asn Owned by:
Priority: Very High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: prop224, tor-hs, review-group-23
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by asn)

David found his client unable to connect to his service. Apparently, they compute different hsdir indices, since it was 12:20UTC (non-overlap period) and the live consensus had valid-after at 11:00UTC (overlap period). Apparently something got confused.

Theory: We use time(NULL) as the time in node_set_hsdir_index() whereas we use the live consensus valid-after in rotate_all_descriptors(). This can cause desynch within the same tor instance. We should probably use the live consensus valid-after in all cases to have a common point of reference, and avoid problems with clock skews.

Child Tickets

Change History (15)

comment:1 Changed 3 years ago by asn

Description: modified (diff)
Summary: prop224: Time period desynch between client and serviceprop224: HSdir index desynch between client and service

comment:2 Changed 3 years ago by dgoulet

See branch: bug23387_032_01

comment:3 Changed 3 years ago by dgoulet

Priority: HighVery High
Status: newneeds_review

comment:4 Changed 3 years ago by asn

Status: needs_reviewneeds_revision

Dumping some comments from my review and testing here:

a) There is no commit msg in daf72124 of why we don't like the previous approach of figuring out whether overlap period just started/ended. I don't really like this new approach of checking consensus->valid_after == tp_start_time, since there is no guarantee that Tor will get all the consensuses: it can skip a consensus, and if it skips the right one then we will never detect an overlap change.

This logic also caused an assert in my testing since we do:

    if (!overlap_mode_is_active && !overlap_mode_just_ended) {

and I accidentally triggered that because I suspended my laptop for 6 hours, which means that Tor missed the overlap-changing consensus, so it never actually rotated descriptors. Not sure how to fix this issue. I think the old logic was more robust; not sure why we ditched it.

b) Also seems like we still need a if (service->desc_next) guard before rotate_service_descriptors() since during bootstrap we continuously think that overlap has ended, and we end up there with desc_next being NULL and we overwrite desc_current with NULL.

c) Also should we change all places we get the current time period num to use the consensus valid after? To reduce desynch possibilities?

c) This final comment is not related to this branch, but I think there is a case of HSDir desynch that we don't handle in the current prop224 spec: Consider an HS with 13:00 consensus, and a client with 11:00 consensus. This means that the client is in time period #N and the HS is in time period #N+1. The HS thinks it's in non-overlap mode so it only uploads the current descriptor for time period #N+1. Meanwhile, the client always downloads the current descriptor so it will attempt to download the descriptor for time period #N which could have expired in theory.

This happens because we have an overlap period covering TP #N+1 from TP #N, but we don't have one that covers TP #N from TP #N+1.

I'm dumping these thoughts here. Depending on my schedule today I might or might not be able to fix these on time.

Last edited 3 years ago by asn (previous) (diff)

comment:5 Changed 3 years ago by asn

OK based on the comment above, I present a less-radical branch at bug23387_asn. This branch does not change the way overlap periods are detected, but still only rotates descs when exiting overlap periods. It also uses valid_after time when computing hsdir indices, and time period nums. Seems to work fine in my testing so far.

I'm now writing a unittest that will test various cases with timing on the client and service. It's pretty advanced so I'll probably finish it tomorrow or the day after.

FWIW, point (c) from above is not handled by the branch above, and might require a ticket of its own.

Last edited 3 years ago by asn (previous) (diff)

comment:6 Changed 3 years ago by asn

Status: needs_revisionneeds_review

OK please check latest bug23387_asn. It also contains a unittest that should provide some more assurance of these weird edge cases.

comment:7 Changed 3 years ago by asn

Pushed spec patch at bug23387. This is needed to better understand the code patch that will soon be posted here.

comment:8 Changed 3 years ago by asn

Status: needs_reviewmerge_ready

Also pushed code branch at bug23387 which is ready for Nick's eyes. Please use the spec branch above as a companion to this patch when reviewing. Commit messages + the spec branch should provide enough motivation for the code changes.

Furthermore, the HS reachability test is pretty awesome and gives good guarantees of the reachability enhancements we designed.

If you have any questions don't hesitate to ask them since this is a non-trivial change.

comment:9 Changed 3 years ago by asn

comment:10 Changed 3 years ago by nickm

Spec branch merged. Thanks for the extra diagrams, dgoulet! Now on to the code...

comment:11 Changed 3 years ago by nickm

Keywords: review-group-23 added

Put 0.3.2 needs_review and merge_ready tickets into review-group-23.

comment:12 Changed 3 years ago by asn

Addressed Nick's review! Thanks!

comment:13 Changed 3 years ago by nickm

Looks good to me.

comment:14 Changed 3 years ago by asn

OK. Pushed another commit which tests that upload_descriptor_to_all()
is in synch with pick_hsdir_v3() for various scenarios. This should give some more confidence and also unittest pick_hsdir_v3().

I think this is good to merge now!

comment:15 Changed 3 years ago by nickm

Resolution: fixed
Status: merge_readyclosed

merging! Please close the other tickets that are fixed by this branch.

Note: See TracTickets for help on using tickets.