Opened 3 years ago

Closed 2 years ago

#16943 closed enhancement (fixed)

Implement prop250 (Random Number Generation During Tor Voting)

Reported by: asn Owned by: dgoulet
Priority: High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, review-group-3, review-group-4
Cc: teor Actual Points:
Parent ID: #8244 Points: 6
Reviewer: nickm Sponsor: SponsorR-must

Description

This is the ticket for implementing proposal 250. This is the proposal for having the Tor dirauths collectively generate a fresh random value everyday.

Child Tickets

TicketTypeStatusOwnerSummary
#17349defectclosedCreate an ed25519 shared randomness key for dirauths
#17435defectclosedasnPatch dir-spec with the shared randomness info
#19006defectclosed[prop250] Pointer corruption and other failures in master and maint-0.2.8
#19012defectclosednickmRefactor code that looks at voted-on parameters during voting
#19124defectclosedShared Random and Half-Hour Consensuses
#19125defectclosedteor[prop250] Fix a time parsing error on platforms with 32 bit time_t
#19126defectclosedConsistently use uint64_t for integers in shared random structs
#19127defectclosedDon't crash authorities with more than 254 shared random reveals
#19132defectclosedshared random: missing field 'fetch_missing_votes' initializer
#19134defectclosedShared Random: INT_8 means 8 bytes, not 8 bits
#19195defectclosed[prop250] Remove unnecessary assertion in sr_compute_srv
#19196defectclosed[prop250] Shared random version parsing fails on 32 bit

Change History (87)

comment:1 Changed 3 years ago by asn

OK. Some very initial work can be found at shared-random-refactor1 in my repo.

The main contribution here is refactoring dirvote_compute_consensuses(). This is the function that will eventually create the SR doc, and it was a big clusterfuck. The first three commits of my branch try to refactor in a way that will allow us to add code to it, without complicating it even more.

The final commit (80bd3da) adds in some of the SR stuff that we wrote during camp plus a few more stuff. These are just ideas and scaffolding, so don't take it seriously.
However, I didn't manage to proceed with this too much because the consensus signature code is very thick. It seems to me that more refactoring is needed.

The main problem is that when dirvote.c thinks consensus, it actually thinks of networkstatus. So all consensus-related functions are created with networkstatus documents in mind, which is totally different from the SR doc. You can see this by seeing how pending_consensus_t is used around the code, and also by taking a look at the following functions:

networkstatus_add_detached_signatures()
dirvote_fetch_missing_signatures()
dirvote_add_signatures_to_pending_consensus()
dirvote_add_signatures_to_all_pending_consensuses()
dirvote_add_signatures()
networkstatus_check_consensus_signature()

Also, grep for N_CONSENSUS_FLAVORS to see various places where tor loops over all the various flavors and expect to find networkstatus documents underneath. All the above functions will need to be refactored, or their caller will need to handle SR doc totally different.

We will need to find a nice solution to this. We could for example introduce a new data structure called consensus_t that is a union that hosts either a networkstatus or an SR doc. And then write various methods for it, to be able to interact with it. I did a beginning with this, but it required shitloads of changes all around the codebase and then I dropped it.

A lame fix here would be to just stuff all the SR info into a networkstatus_t, which might solve a few problems, but it's terrible.

Seems like the "Ah, let's just use a new consensus flavor. What can go wrong." idea is starting to show its complexity :)

comment:2 Changed 3 years ago by asn

So we David we discussed a possible refactoring here that would allow us to introduce SR doc logic to functions like networkstatus_add_detached_signatures().

The idea is that we will make a new data structure called consensus_t which will be used in places where we don't care about the specific consensus flavor. This data structure will be able to contain networkstatus_t or shared_random_doc_t or any other weird flavor we come up in the future.

Then we can have some methods for consensus_t, which based on the flavor returns you the appropriate data structure.

Then we will need to refactor all the functions mentioned in the previous comment (and a few more), to be able to do their thing on a networkstatus_t but then also do a similar thing for shared_random_doc_t. This will require lots of hairy code to be refactored which is scary, but doable.

Any opinions on this strategy?

comment:3 Changed 3 years ago by asn

OK. I did some work on figuring out the current phase during voting, and on generating commitments/reveals. You can find it here:
https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250-voting-day1

comment:4 Changed 3 years ago by asn

OK, second day progress at branch prop250-voting-day2:
https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250-voting-day2

This branch can do the following things:

  • Authorities can keep track of the protocol run. Authorities know which protocol phase they are into at any given time. Both in the real network and in the testing network. (Unit tests included for real network)
  • Authorities know when a protocol run starts and can do some basic housekeeping. Like wipe counters, and generate fresh commitment values for themselves.
  • Authorities can generate actual reveal values. Authorities can generate some dummy commitment values.
  • Initial work has started on encoding commits/reveals to votes. Scaffolding work has also started on parsing commits/reveals from votes.

I also looked at your branch David. I see you have code for parsing the binary
representation of commitments and reveals . And also have code for parsing the
state file.

I noticed that the state file parsing code is quite similar to the code that we
will have to write for parsing commits/reveals from votes, but with statefile
code patterns instead of routeparse code patterns.

I'm talking about this type of code:

+  alg = crypto_digest_algorithm_parse_name(value);
+  if (alg == -1 || alg != SR_DIGEST_ALG) {
+  }
+  value = smartlist_get(args, 1);
+  if (digest256_from_base64(identity, value) < 0) {

It would be nice to be able to write utility parsing functions that can be used
by both subsystems so that we don't have duplicate code. I wrote some skeleton
code for the routerparse.c code in case you want to think more of how we could
merge those two logics. Check out commit 39cb7ab.

Cheers!

comment:5 Changed 3 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.8.x-final

comment:6 Changed 3 years ago by nickm

Sponsor: SponsorR?

comment:7 Changed 3 years ago by dgoulet

Sponsor: SponsorR?SponsorR

comment:8 Changed 3 years ago by dgoulet

Points: large

comment:9 Changed 3 years ago by asn

Hello,

please see my branch prop250-voting-day5 for further work on
generating commitments and reveals. This is the first part of
switching from the dummy integer commitments/reveals of
prop250-voting-day4, to the signature-based commitments of prop250.

This is still work in progress, but commit e3b6f4f introduces a very
important unit test that shows a skeleton of how commitments and
reveals are supposed to be generated, parsed and verified. Let me know
if this logic seems acceptable to you, or we should change it.

To move forward now, we need to introduce the SR-specific ed25519 auth
key and include its certificate in votes, so that we stop using the
ancient RSA fingerprints and be able to start verifying signatures.

Tasks that need to be done next:

  • Create and start using the ed25519 SR key and certificate chains. Needs proposal!
  • Include the shared secret in the consensus.
  • Use weasel's algorithm in get_testing_network_protocol_phase().
  • Implement the conflict line (see add_conflict_to_sr_state()).
  • Implement the maintainance function and has_majority logic (see state_phase_transition()).
  • Plug ALL the memory leaks (many!)
  • Fix the disk state (but let's do this later so that all the data structures have stabilized).

[FWIW I left the stupid integer-based commits/reveals as
generate_sr_commitment_stupid() in case we want to do some chutney
testing. Eventually, generate_sr_commitment_stupid() will be replaced
by generate_sr_commitment().]

Cheers!

comment:10 in reply to:  9 Changed 3 years ago by asn

Severity: Blocker

Replying to asn:

Hello,

Tasks that need to be done next:

  • Create and start using the ed25519 SR key and certificate chains. Needs proposal!

FWIW, I made ticket #17349 specifically for this task.

comment:11 Changed 3 years ago by asn

I worked a bit on the outdated spec! Please see branch prop250-asn in my torspec repo!

Some major mods are:

  • I described how to get the ed25519 shared randomness keys from the votes.
  • I changed the protocol to start at midnight and last the whole day!
  • I fixed the majority section and the agreement section.

Please see and add your own changes!

comment:12 Changed 3 years ago by asn

Hello,

please see prop250-general-day1 for the latest branch here.

Changes since last time:

  • Calculates shared random value normally and puts it in consensus.
  • Uses ed25519 shared randomness keys.
  • Improved (but not yet ready) disk state logic.
  • Fixed the broken unittests.
  • Uses weasel's algorithm.

comment:13 Changed 3 years ago by asn

OK, here is the latest state of the game here.

In my torspec branch prop250-nosrkeys-v2 you can find prop250 without the SR keys as discussed here: https://lists.torproject.org/pipermail/tor-dev/2015-November/009875.html

In my tor branch prop250-nosrkeys-day1 you can find the corresponding code, where we use the ed25519 master key for referencing, and the RSA identity key to detect multiple commits by a single authority. I also fixed the get_state_valid_until_time() function and wrote some unittests.

I think the next step in our simplification process here is to refactor the decide() step which is not really useful anymore now that we don't do majority or conflicts. I imagine that we could make it so that there is no decide step or voted_commits, and we just move valid authoritative commits directly to the sr state during vote parsing.

comment:14 Changed 3 years ago by asn

Proposal: Please find a more refined version of the proposal at my torspec branch prop250-final-v2:
https://gitweb.torproject.org/user/asn/torspec.git/log/?h=prop250-final-v2

Code: Branch prop250-nosrkeys-day2 in my repo contains the latest code. It fixes some annoying bugs that were making testing harder.

Last edited 3 years ago by asn (previous) (diff)

comment:15 Changed 3 years ago by dgoulet

Latest branch with code rebased on master and improvement: prop250_v4.

This branch has few new things.

  • A random number is now always hashed H(RN) so we avoid leaking bytes from our PRNG to the network.
  • Add SR flag to the vote indicating if the vote participates in the protocol
  • Parse votes and populate an object in the networkstatus_t object. Once done, we only handle those commits once we are sure a vote is valid and from known authority (we weren't doing that before).
  • Refactor some functions to be able to support the last bullet above and improve naming on some functions.

Few things are missing that I'll get to it:

  • Consensus params for the number of participants.
  • Torrc option to disable SR support (basically making the dirauth not adding the SR flag to the vote).
  • Some issues with the disk state also because we know rely on the RSA fingerprint of a dirauth.

comment:16 in reply to:  15 ; Changed 3 years ago by asn

Replying to dgoulet:

Latest branch with code rebased on master and improvement: prop250_v4.

Great! Please also check my prop250_v4 branch. Based on yours with a few mods.

Specifically:

  • Some small code tweaks and variable/function renames.
  • I removed the disaster formula. It's not useful anymore. Please see commit message for more info.
  • Made the vote parsing logic a bit more robust, I hope.
  • I renamed and changed the semantics of NumSRParticipants. Now, instead of using it just to count the number of participants, we use it to reject SRVs that have been voted by less than NumSRParticipants dirauths. This makes us more robust against edge cases like the one where the 5th SR dirauth appears during the reveal phase and calculates a different SRV than the other 4.
  • Also, now on the majority calculation, I use the total number of dirauths instead of the number of voters. I did this to mirror the way the n_required works in networkstatus_check_consensus_signature().

Let me know if you find these things reasonable!!!

More future things:

  • We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.
  • Also, I've been getting these interesting log messages for a while now:
    [SR] You may not provide a value for virtual option 'SharedRandCurrentValue'
    
    but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.
Last edited 3 years ago by asn (previous) (diff)

comment:17 in reply to:  16 Changed 3 years ago by dgoulet

Replying to asn:

Great! Please also check my prop250_v4 branch. Based on yours with a few mods.

Specifically:

  • Some small code tweaks and variable/function renames.

Commit: d653f8cc17d6e278e16acb8a3101a821f9f42f6e

- Macroify previous/current SRV strings.

Not sure I agree with that. I much prefer typed variable here instead of macros for strings that we only use locally in this file. We have compiler protection with that where macros is pretty yolo in terms of warnings for compiler. You see a win for macros?

More future things:

  • We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.

So you mean before merging 250, we should make an adhoc patch right now? Most dirauth won't upgrade until a stable and since we plan to have this in 028-stable, we can make this a big block.

Hrm, consensus params are options in a file for dirauth so we have to decide on a "hard" number which in this case would be 5. And we can see how the SR behaves over time and increase it if needed?

  • Also, I've been getting these interesting log messages for a while now:
    [SR] You may not provide a value for virtual option 'SharedRandCurrentValue'
    
    but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.

Yeah I have yet understantd this one but I think that when the state file is read that option can be missing and thus we have this warning...

comment:18 Changed 2 years ago by asn

Latest spec at prop250-final-v4. Should include everything we have discussed up to and including CCC.

comment:19 Changed 2 years ago by dgoulet

Priority: MediumHigh
Severity: BlockerNormal
Status: newneeds_review
Type: defectenhancement

Hello! asn and I are very happy to present to you wonderful reviewers the implementation for proposal 250 along with the final specification:

Both branch in tor.git and torspec.git: dgoulet/prop250_final_v1

Some notes. We've separated this in 7 commits prefixed with prop250: except first one that adds a needed tor_htonll/ntohll function to tor utils. This code is mostly contained in two *new* files (with their headers) that are shared-random.{c|h} and shared-random-state.{c|h}.

Our unit test code coverage:

    shared-random-state.c   92.8 %      376 / 405      65.3 %      109 / 167
    shared-random.c         85.1 %      326 / 383      66.9 %      101 / 151

Also, there are attacks to this protocol that we are well aware of but all are easily detectable so for this reason we've wrote a DocTor specification that atagar will help use deploy once this is merged.

https://storm.torproject.org/shared/kXO7N2oatC9RzRcZvfLNVaA1yNkZ6m5rODGUYEt08H6

Finally, we expect this code to run for a long time before the shared random values generated by the authorities are used thus for now you will NOT find anything using them.

Please ask questions! This won't be that trivial to review :).

comment:20 Changed 2 years ago by teor

Cc: teor added

comment:21 Changed 2 years ago by dgoulet

Cc: dgoulet removed

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Last edited 2 years ago by dgoulet (previous) (diff)

comment:22 in reply to:  21 ; Changed 2 years ago by teor

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

Every other multi-word file name in src/or uses underscores.

If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?

There are a lot of missed opportunities for const pointer arguments.

e282227ebc03ae71f1cd6490a2adf2058140c66e:

If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.

72baec6805aaab8d62f2deed8fbc658c82a7d086:

shared-random-state.c:

If we use an asymmetric magic number, we can detect when the two halves are swapped.

#define SR_DISK_STATE_MAGIC 42424242

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?

#define SHARED_RANDOM_N_ROUNDS 12

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:

time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
                      options->TestingV3AuthVotingStartOffset);

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES?
get_state_valid_until_time, get_sr_protocol_phase

int total_rounds = SHARED_RANDOM_N_ROUNDS * 2;

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:

current_round = (beginning_of_current_round / voting_interval) % total_rounds;

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now?

Why do we have disk_state_validate() and disk_state_validate_cb()?
Should one call the other? Should they check _magic?
Should they check ValidAfter < ValidUntil?

disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.

disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.

I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.

disk_state_parse_sr_values() can assert its arguments are not NULL.

disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.

In disk_state_put_commit_line(), do we need to memwipe reveal_str?
Or are we about to make it public anyway?

disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.

I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)

We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.

In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed)
We do this some places, but not others.

Can data in state_query_get_() be a const pointer?
Some of the sr_state_set_* functions can take const pointers.

shared-random-state.h:

I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE?

/* Private methods (only used by shared-random.c): */

shared-random.c:

char b64_decoded[SR_COMMIT_LEN + 2];

Can we either fix #17868 as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?

#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];

Is TIMESTAMP first, or H(REVEAL)?
commit_decode says one thing in the comments, and another in the code.

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

In sr_parse_srv(), why do we use base16 for the SRV, but base64 for the commits?

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
A comment either way would help.

I have 5 more commits to review. (Please add fixups to the end of the branch so I don't lose my place.)

comment:23 in reply to:  22 ; Changed 2 years ago by dgoulet

Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.

Every other multi-word file name in src/or uses underscores.

Fixed. Extra commit here since this applies to almost all commits.

If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?

There are a lot of missed opportunities for const pointer arguments.

Fixed most of those. See "fixup" commits

e282227ebc03ae71f1cd6490a2adf2058140c66e:

If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.

Fixed!

72baec6805aaab8d62f2deed8fbc658c82a7d086:

shared-random-state.c:

If we use an asymmetric magic number, we can detect when the two halves are swapped.

#define SR_DISK_STATE_MAGIC 42424242

Fixed

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?

#define SHARED_RANDOM_N_ROUNDS 12

The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:

time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
                      options->TestingV3AuthVotingStartOffset);

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Changing this part of the code after the extensive testing we did could be very delicate. asn is looking at possible solution.

Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES?
get_state_valid_until_time, get_sr_protocol_phase

int total_rounds = SHARED_RANDOM_N_ROUNDS * 2;

Definitely better.

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:

current_round = (beginning_of_current_round / voting_interval) % total_rounds;

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now?

No, valid_after time that is now is valid which is what we set when you start up since we have no valid_after time until first consensus.

Why do we have disk_state_validate() and disk_state_validate_cb()?

We must have a callback for the "state file" API, NULL is not accepted. Furthermore, if the callback return an error, right now tor does a wonderful assert(0)... so it has been ignored.

Should one call the other? Should they check _magic?

This is done with CONFIG_CHECK() which is called multiple time when you use config_assign().

Should they check ValidAfter < ValidUntil?

Fixed! (But the correct check is >= here :)

disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.

Fixed

disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.

Fixed

I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.

Fixed with the commit above.

disk_state_parse_sr_values() can assert its arguments are not NULL.

Fixed with commit above

disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.

Fixed.

In disk_state_put_commit_line(), do we need to memwipe reveal_str?
Or are we about to make it public anyway?

Indeed. I think you are right. We do put _our_ commit on disk during the commit phase so better to cleanup memory here to be safe.

disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.

Fixed.

I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)

Not much. We avoid alloc/free every time we update the state which in a voting period happens often. (probably not enough to be noticeable but still)

We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.

We assume that everything in our memory state is coherent so disk update can only use value that are usable. I think it's fine, since if the memory state has crazy stuff even if we don't put on disk or not, thing will go bad.

In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed)
We do this some places, but not others.

I've added a needed memwipe in reveal_encode since we do encode _our_ commitment during the commit phase so the reveal value should only be in the commit object memory and not on stack.

Can data in state_query_get_() be a const pointer?
Some of the sr_state_set_* functions can take const pointers.

the get_ yes, I fixed it but not for the set since digestmap doesn't handle a const.

shared-random-state.h:

I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE?

/* Private methods (only used by shared-random.c): */

Ah, by that we mean methods that are _ONLY_ used by shared_random.c so "private" to the subsystems but not static because they need to be visible.

shared-random.c:

char b64_decoded[SR_COMMIT_LEN + 2];

Can we either fix #17868 as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?

#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];

I would really want to fix #17868 instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.

Is TIMESTAMP first, or H(REVEAL)?
commit_decode says one thing in the comments, and another in the code.

Hrm both the code and comment do the same or am I missing something?

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

We prefer not since this is data from the network so potentially untrusted.

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.

In sr_parse_srv(), why do we use base16 for the SRV, but base64 for the commits?

No reasons... We'll make it base64.

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
A comment either way would help.

We do and the top commit of the function details the order of args. What extra commit you have in mind?

I have 5 more commits to review. (Please add fixups to the end of the branch so I don't lose my place.)

comment:24 in reply to:  22 ; Changed 2 years ago by asn

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:

time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
                      options->TestingV3AuthVotingStartOffset);

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.

I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x

comment:25 in reply to:  23 ; Changed 2 years ago by teor

Replying to dgoulet:

Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?

#define SHARED_RANDOM_N_ROUNDS 12

The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.

Not necessarily - 12 can still be the default number of rounds.
(I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)

Opened #18295 for this.

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:

current_round = (beginning_of_current_round / voting_interval) % total_rounds;

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Can we refactor the round number / time calculations so they're in one place?
I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.

shared-random.c:

char b64_decoded[SR_COMMIT_LEN + 2];

Can we either fix #17868 as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?

#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];

I would really want to fix #17868 instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.

Made #17868 a child ticket of this ticket.

Is TIMESTAMP first, or H(REVEAL)?
commit_decode says one thing in the comments, and another in the code.

Hrm both the code and comment do the same or am I missing something?

The commit_decode() function header comment says:

base64-encode( H(REVEAL) || TIMESTAMP )

The code says:

  /* First is the timestamp (8 bytes). */
  commit->commit_ts = (time_t) tor_ntohll(get_uint64(b64_decoded));
  offset += sizeof(uint64_t);
  /* Next is hashed reveal. */
  memcpy(commit->hashed_reveal, b64_decoded + offset,
         sizeof(commit->hashed_reveal));

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

We prefer not since this is data from the network so potentially untrusted.

OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN?
Can we make that explicit in a comment?

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.

The risk of leaving them duplicated is that bugs get fixed in one and not the other.
Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.

This has already happened to these two functions:

  • commit_decode() uses offset, reveal_decode() uses a hard-coded 8.
  • the "too small" warning in reveal_decode() is less detailed, but could be identical to commit_decode()

So I have a more radical suggestion for you:

  • you have two groups of 3 struct fields that are the same types, and have the same operators performed on them;
  • abstract these 3 fields into their own struct type (ts, hashed, encoded)
  • add operations for encoding and decoding this type (and perhaps, turning this authority's reveal into its commit)
  • refactor the existing code to call these operations

That way, the interface stays the same, you can add any extra logging you need in the commit- or reveal-specific functions, but you get the benefit of using the same code for encoding/decoding commits and reveals.

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
A comment either way would help.

We do and the top commit of the function details the order of args. What extra commit you have in mind?

If an authority supplies 5 arguments in its vote, we will ignore the fifth.
There's no comment documenting whether this is intentional or not.
And should we ignore the entire line instead?

comment:26 in reply to:  24 ; Changed 2 years ago by teor

Replying to asn:

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:

time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
                      options->TestingV3AuthVotingStartOffset);

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.

I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x

Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?

My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.

Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.

comment:27 Changed 2 years ago by teor

This is my 3rd comment in a row. Please read all 3 comments.

Reviewing 3406980ed700cc7fcd95b9b3774256493805057b:

Nitpick: why add a phase_str for "unknown" when you assert rather than use it?

commit_log() should probably use safe_str() on the reveal value, as it is sensitive.
Other debug messages also have this issue.

It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.

I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).

reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.

sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself.
It can also use a const pointer.

sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.

tor_assert(reveal_num < UINT8_MAX);

Is this OK?
It should be documented somewhere.

In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure.
In generate_srv(), you don't check the return value of crypto_digest256.
You might want to check how you handle the return values of all your crypto_* calls.

In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?

47ade62b1811c16c28d73ebe59f47daa16cd4522:

AuthDirSharedRandomness needs a tor manual page entry.

There are missed opportunities for const pointers in this commit.

srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.

get_majority_srv_from_votes() could be significantly simplified by adding to container.c:

/** Return the most frequent member of the sorted list of DIGEST256_LEN
 * digests in <b>sl</b>. If count_out is non-NULL, set it to the count
 * of the most frequent member. */
const uint8_t *
smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out)
{
  return smartlist_get_most_frequent_(sl, compare_digests256_, count_out);
}

Then using:

  int count = 0;
  value = smartlist_get_most_frequent_digest256_(sr_digests, &count);
  /* Was this SRV voted by enough auths for us to keep it? */
  if (!should_keep_srv(count)) {
    goto end;
  }

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

8e180aa061f0ea39968f05c85796b26f10a81744:

extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.

Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?

  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);

extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.

Let's also make sure we don't assert() on untrusted input in other parsing functions.

extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?

verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.

In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)

8f3f5667754037cccd758eb8631636e2e78fed93:

Looks ok.

3f2014d7aabe2e10163f4f44a8ca46fd5a91f04a:

I'm not going to review the unit tests in detail. (We generally accept unit tests if they work, right?)

src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.

I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.

Five times:

shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib             	0x00007fff975b06e7 abort + 129
3   test                          	0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
4   test                          	0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
5   test                          	0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
6   test                          	0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
7   test                          	0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
8   test                          	0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
9   test                          	0x00000001019df43f main + 639 (testing_common.c:298)

One time:

shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib             	0x00007fff975b06e7 abort + 129
3   test                          	0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
4   test                          	0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
5   test                          	0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
6   test                          	0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
7   test                          	0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
8   test                          	0x00000001019df43f main + 639 (testing_common.c:298)

And I also see 12/720 TESTS FAILED. (1 skipped).

The dgoulet/prop250_final_v4 branch also has similar unit test issues.

Now on to the extra commits in dgoulet/prop250_final_v4:
There are twelve extra commits in this branch.

625e08167d9d8e855ddefa45f8d9902f65122f1f:
0308e7bf92c3699fa7e88140ae5037d8bc0d480b:
7bb6eac9fb6af2c13f61e8ddb9e9c741fa4736d1:
fe710d15af7709f9d88ae9d8b5e451e5b849983b:

I trust the compiler to warn about consts in the wrong places.

18771227176e6f634540d483b8c1b1a6501aec14:
75a243b9aad489929d8ed8273920fcf7fc295ebb:
50ab3ae82b8248a45203c4ee9fc6be5d18522ac8:
c0401326224a994dfe35e39da4aa8849ac053ee3:

Look good.

e5613fad328e78b2477d8e848bb163aa8dcb1e8b:

Thanks for removing the duplicate code here.

08beead18ed2bfdc8f09b17e7628ac9997222e3a:
10ccf50d599896feaf48bf0aecf4a1c2da1fea45:

Better safe than sorry.

c219b23d47e9c742532354eedae10ca0509a1d78:

One I didn't pick up, thanks for fixing it.

I've finished reviewing the code in dgoulet/prop250_final_v4.
Over to you to respond to my comments 25, 26 & 27.

I'd still like to run the SRV code in chutney, is there a chutney template for that?
(I think I'll need to wait for the unit tests to be fixed first.)

comment:28 in reply to:  25 Changed 2 years ago by dgoulet

Replying to teor:

Replying to dgoulet:

Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?

#define SHARED_RANDOM_N_ROUNDS 12

The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.

Not necessarily - 12 can still be the default number of rounds.
(I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)

Opened #18295 for this.

I'll comment on the ticket from now on. thanks! :)

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:

current_round = (beginning_of_current_round / voting_interval) % total_rounds;

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Can we refactor the round number / time calculations so they're in one place?
I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.

I'll wait on asn's work on this before refactoring anything.

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

We prefer not since this is data from the network so potentially untrusted.

OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN?
Can we make that explicit in a comment?

I think this at the start of the function should catch it. Pass that, their can't be anything pass SR_COMMIT_LEN I think:

if (strlen(encoded) > SR_COMMIT_BASE64_LEN)

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.

The risk of leaving them duplicated is that bugs get fixed in one and not the other.
Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.

This has already happened to these two functions:

  • commit_decode() uses offset, reveal_decode() uses a hard-coded 8.

The hardcoded 8 is _BAD_... that is a good find.
[snip]
I'll work on something here.

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
A comment either way would help.

We do and the top commit of the function details the order of args. What extra commit you have in mind?

If an authority supplies 5 arguments in its vote, we will ignore the fifth.
There's no comment documenting whether this is intentional or not.
And should we ignore the entire line instead?

True. We should check if we bust the number of accepted argument and if so, reject.

I'll apply the fixes of the above and submit them in the next comment. I would like to avoid multiple thread in this ticket :).

comment:29 in reply to:  26 Changed 2 years ago by asn

Replying to teor:

Replying to asn:

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

<snip>

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

<snip>

Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?

My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.

Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.

OK, I implemented the above logic in my branch prop250_final_v4:
https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250_final_v4

The refactoring was a bit more messy than I expected, but the branch seems to work in chutney. The tests still pass without any changes, so that's good.

I also moved get_next_valid_after_time() in dirvote.c as you suggested.

comment:30 Changed 2 years ago by teor

Status: needs_reviewneeds_revision

I've reviewed asn's prop250_final_v4 up to ab5a2ffed4fa02d3ac8a2e46b8f3c38a318a7d3a.
Thanks for the refactoring, I am much more comfortable that the new timing code uses the same values as the existing timing code.

Still waiting to hear about the unit tests failures and chutney template.

comment:31 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Replying to teor:

This is my 3rd comment in a row. Please read all 3 comments.

Ok, FYI I've autosquash the fixup from last comment. It took me a while to have
it done cleanly but now it's done. You have a series of fixup now for this
comment.

See branch: prop250_final_v5

Reviewing 3406980ed700cc7fcd95b9b3774256493805057b:

Nitpick: why add a phase_str for "unknown" when you assert rather than use it?

get_phase_str() uses the enum value as the index in the phase_str array.
"unknown" is there as a taint if we ever use an uninit phase value.

commit_log() should probably use safe_str() on the reveal value, as it is sensitive.
Other debug messages also have this issue.

I've refactor that a bit better to print the encoded value instead so we can
match them to strings in vote or other logs that do print the encoded part.

It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.

Fixed (for both this and the commit_log)

I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).

reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.

I've discussed this with asn and after coding two different ways of achieving
this, we don't feel super comfortable for two reasons. First is that both
options don't make it very clean tbh. Second, reveal and commit values are two
different logical object and they just happen to have the same encoding format
so we are fine leaving them like that since it's little code duplication but
also simply different things...

sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself.
It can also use a const pointer.

Fixed! (3 different commits to be squashed on different master commit)

sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.

tor_assert(reveal_num < UINT8_MAX);

Is this OK?
It should be documented somewhere.

This is indeed a limitation. Proposal 250 does force INT_1(REVEAL_NUM). I've
pushed an update to torspec mentionning the limitation.

In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure.
In generate_srv(), you don't check the return value of crypto_digest256.
You might want to check how you handle the return values of all your crypto_* calls.

Good catch! Fixed! (two fixup commits)

In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?

Hrm... assert sounds a bit scary to just kill an authority. Most of tor code
just ignores the return value. So I'm guessing having an empty SRV is valid
here in case of an epic error like that and the authority should recover after
a while.

47ade62b1811c16c28d73ebe59f47daa16cd4522:

AuthDirSharedRandomness needs a tor manual page entry.

Fixed in b37214db72d42af8b25f7f87a5d3c6894c032115.

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.

Fixed.

get_majority_srv_from_votes() could be significantly simplified by adding to container.c:

/** Return the most frequent member of the sorted list of DIGEST256_LEN
 * digests in <b>sl</b>. If count_out is non-NULL, set it to the count
 * of the most frequent member. */
const uint8_t *
smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out)
{
  return smartlist_get_most_frequent_(sl, compare_digests256_, count_out);
}

Then using:

  int count = 0;
  value = smartlist_get_most_frequent_digest256_(sr_digests, &count);
  /* Was this SRV voted by enough auths for us to keep it? */
  if (!should_keep_srv(count)) {
    goto end;
  }

Indeed. I've created smartlist_get_most_frequent_digest256_ and changed
around that function. Fixed.

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

8e180aa061f0ea39968f05c85796b26f10a81744:

extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.

Fixed.

Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?

  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);

It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?

extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.

Let's also make sure we don't assert() on untrusted input in other parsing functions.

Indeed, very bad. Fixed!

extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?

sr_info in a the networkstatus_t structure is allocated with the ns so on
error, the parsing stops and the object dissapear on its own.

    if (extract_shared_random_srvs(ns, tokens) < 0) {
      log_warn(LD_DIR, "SR: Unable to parse SRV(s)");
      goto err;
    }

verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.

Fixed.

In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)

True, should be a LD_DIR here. Fixed

src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.

Yah that python one is out of date and SHA3 is not in python hashlib as far as
I know so fix to that thing is pending until we can do it in python...

I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.

Five times:

shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
3   test                              0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
4   test                              0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
5   test                              0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
6   test                              0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
7   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
8   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
9   test                              0x00000001019df43f main + 639 (testing_common.c:298)

I think it's fixed. Since I can't reproduce, I hope you can confirm.

One time:

shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
3   test                              0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
4   test                              0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
5   test                              0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
6   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
7   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
8   test                              0x00000001019df43f main + 639 (testing_common.c:298)

This one is a bit more confusing to me. I don't see any possible way how that
assert can be triggered apart from _already_ having a commit in our state when
adding it. You'll have to tell me if you hit it again after the fixes and
probably we'll have to dig deeper.

And I also see 12/720 TESTS FAILED. (1 skipped).

[snip]

I'd still like to run the SRV code in chutney, is there a chutney template for that?
(I think I'll need to wait for the unit tests to be fixed first.)

You can with that branch. We'll simply use the default values for the super
majority. grep for prefix "SR:" logs and see sr-state file.

comment:32 Changed 2 years ago by dgoulet

Owner: set to dgoulet
Status: needs_reviewassigned

comment:33 Changed 2 years ago by nickm

Status: assignedneeds_review

comment:34 in reply to:  31 ; Changed 2 years ago by teor

Status: needs_reviewneeds_revision

Replying to dgoulet:

Replying to teor:

This is my 3rd comment in a row. Please read all 3 comments.

Ok, FYI I've autosquash the fixup from last comment. It took me a while to have
it done cleanly but now it's done. You have a series of fixup now for this
comment.

See branch: prop250_final_v5

The last commit I reviewed was c219b23d47e9c742532354eedae10ca0509a1d78 in prop250_final_v4.
This appears to correspond to 4f5e212cbd34c8cfb701b25a007e88eacbac1f1f "Rename shared random files to use underscore" in prop250_final_v5, which includes an autosquash and a rebase to master.

I'm only going to comment on new commits, and any issues I still have with the old code.
(I'll remove any questions, fixes, and explanations I'm ok with.)

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl.
(It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

I'm in favour of consistently using int, but it doesn't matter that much.
(We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)

Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?

  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);

It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?

Most clients use the microdesc consensus, some still use NS.

I don't mind which consensus we use, but:

  • I think we should have a comment explaining the arbitrary choice, and
  • I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).

I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.

Five times:

shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
3   test                              0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
4   test                              0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
5   test                              0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
6   test                              0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
7   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
8   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
9   test                              0x00000001019df43f main + 639 (testing_common.c:298)

I think it's fixed. Since I can't reproduce, I hope you can confirm.

I still see 5 of these. One example is:

3   test                          	0x00000001053d4045 sr_generate_our_commit + 709 (shared_random.c:870)
4   test                          	0x00000001053d60f1 sr_state_update + 177 (shared_random_state.c:1068)
5   test                          	0x00000001053d67d3 sr_state_init + 483 (shared_random_state.c:1243)
6   test                          	0x00000001052c9573 test_sr_get_majority_srv_from_votes + 51 (test_shared_random.c:811)

One time:

shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
3   test                              0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
4   test                              0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
5   test                              0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
6   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
7   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
8   test                              0x00000001019df43f main + 639 (testing_common.c:298)

This one is a bit more confusing to me. I don't see any possible way how that
assert can be triggered apart from _already_ having a commit in our state when
adding it. You'll have to tell me if you hit it again after the fixes and
probably we'll have to dig deeper.

I still see this one, same location, line numbers have changed slightly:

3   test                          	0x00000001053d6f4f commit_add_to_state + 175 (shared_random_state.c:212)
4   test                          	0x00000001053d5b50 sr_state_add_commit + 48 (shared_random_state.c:953)
5   test                          	0x00000001052cac6e test_state_transition + 574 (test_shared_random.c:1000)

And I also see 12/720 TESTS FAILED. (1 skipped).

Now I see 7/724 TESTS FAILED. (1 skipped).

6 of these are the crashes above, the final one is:

shared-random/state_update: [forking] 
  FAIL src/test/test_shared_random.c:1190: assert(state->n_commit_rounds == 1): 2 vs 1
  [state_update FAILED]

I'd still like to run the SRV code in chutney, is there a chutney template for that?
(I think I'll need to wait for the unit tests to be fixed first.)

You can with that branch. We'll simply use the default values for the super
majority. grep for prefix "SR:" logs and see sr-state file.

OK, I'll wait to use chutney until the unit tests pass with no crashes.

I'm going to look into those crashes myself. If I find anything, I'll post another comment.

comment:35 in reply to:  34 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Replying to teor:

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl.
(It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)

Fixed 136da34.

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

I'm in favour of consistently using int, but it doesn't matter that much.
(We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)

Fair enough. Fixed 279764a

Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?

  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);

It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?

Most clients use the microdesc consensus, some still use NS.

I don't mind which consensus we use, but:

  • I think we should have a comment explaining the arbitrary choice, and

I actually changed it to use the FLAV_NS which is the main one that if we can't
create, problem! I also checked before using the consensus pointer that it does
exists.

See 63abee2

  • I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).

Yeah, quite unsure but plausible from what I can see.

[snip]

Now I see 7/724 TESTS FAILED. (1 skipped).

6 of these are the crashes above, the final one is:

Thanks to asn, we found out why the failures! See 30edde4.

See branch: prop250_final_v5 with the fixups.

comment:36 Changed 2 years ago by teor

I've reviewed all the changes up to 30edde4bce7c5f34abb673604cc6e08c43195425.
The code changes are trivially correct, the unit test changes are proven by execution.

All unit tests pass ok with no failures, as does make test-network-all.

I would be ok with this being merged, but I'd like to run it through coverity or clang-scan eventually.

comment:37 Changed 2 years ago by teor

(This is my second comment. Please read both.)

Compiling with clang (with some extra warnings) yields the following integer size complaints:

Fixed in 185189a in my branch prop250_final_v5 in https://github.com/teor2345/tor.git
(You might want to split this into a series of fixups.)

next_valid_after_time should be time_t in:

time_t
get_next_valid_after_time(time_t now)
{
  int next_valid_after_time;

sr_parse_srv has the only remaining implicit integer conversion warning - it would be nice to silence it by casting the result of tor_parse_long() to int:

sr_srv_t *
sr_parse_srv(const smartlist_t *args)
{
  char *value;
  int num_reveals, ok, ret;
  sr_srv_t *srv = NULL;

  tor_assert(args);

  if (smartlist_len(args) < 2) {
    goto end;
  }

  /* First argument is the number of reveal values */
  num_reveals = tor_parse_long(smartlist_get(args, 0),
                               10, 0, INT32_MAX, &ok, NULL);

ts should be time_t:

static void
test_encoding(void *arg)
{
  (void) arg;
  int ret, duper_rand = 42;
  /* Random number is 32 bytes. */
  char raw_rand[32];
  uint64_t ts = 1454333590;

These should be tt_u64_op:

  • tt_int_op(state->n_protocol_runs, ==, 1);
  • tt_int_op(state->n_protocol_runs, ==, 45);
  • tt_int_op(state->n_protocol_runs, ==, 46);

comment:38 Changed 2 years ago by teor

(This is my third comment.)

I also get an assertion failure:

shared-random/utils: [forking] 
  FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
  [utils FAILED]

But only when compiling under certain clang options (OS X):

  • -fstack-protector -fstrict-aliasing -DPARANOIA -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error -fno-sanitize-recover
  • No optimisation (-O0)
  • 64 bit

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):

2   libsystem_c.dylib             	0x9d7f5c34 abort + 156
3   test                          	0x0057e5bf crash_handler + 383
4   libsystem_platform.dylib      	0x990fb01b _sigtramp + 43
5   ???                           	0xffffffff 0 + 4294967295
6   test                          	0x0057e440 0x1000 + 5755968
7   test                          	0x007d20eb parse_iso_time_ + 1739 (util.c:1713)
8   test                          	0x007d2134 parse_iso_time + 52 (util.c:1723)
9   test                          	0x00702157 config_assign_value + 7239 (confparse.c:346)
10  test                          	0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11  test                          	0x006fa435 config_assign + 2133 (confparse.c:828)
12  test                          	0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13  test                          	0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)

comment:39 Changed 2 years ago by teor

And finally, from clang-scan:

In sr_act_post_consensus, this code can be executed when consensus is NULL - it should be surrounded by if (consensus) like some of the code above it:

  /* Update our state with the valid_after time of the next consensus so once
   * the next voting period start we are ready to receive votes. */
  time_t next_consensus_valid_after =
    get_next_valid_after_time(consensus->valid_after);
  sr_state_update(next_consensus_valid_after);

comment:40 in reply to:  38 ; Changed 2 years ago by dgoulet

Replying to teor:

(This is my third comment.)

I also get an assertion failure:

shared-random/utils: [forking] 
  FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
  [utils FAILED]

But only when compiling under certain clang options (OS X):

  • -fstack-protector -fstrict-aliasing -DPARANOIA -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error -fno-sanitize-recover
  • No optimisation (-O0)
  • 64 bit

I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):

2   libsystem_c.dylib             	0x9d7f5c34 abort + 156
3   test                          	0x0057e5bf crash_handler + 383
4   libsystem_platform.dylib      	0x990fb01b _sigtramp + 43
5   ???                           	0xffffffff 0 + 4294967295
6   test                          	0x0057e440 0x1000 + 5755968
7   test                          	0x007d20eb parse_iso_time_ + 1739 (util.c:1713)
8   test                          	0x007d2134 parse_iso_time + 52 (util.c:1723)
9   test                          	0x00702157 config_assign_value + 7239 (confparse.c:346)
10  test                          	0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11  test                          	0x006fa435 config_assign + 2133 (confparse.c:828)
12  test                          	0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13  test                          	0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)

I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...

comment:41 Changed 2 years ago by dgoulet

I've fixed the rest that is I took your commit 185189a. Function sr_act_post_consensus has now a check on the pointer. This function can't be called with a NULL consensus for now but better safe in the future.

Since the above are easy to confirm without fixup commits, I've created a new branch with everything squashed. You can find the fixup commits in prop250_final_v5.

Fully squashed and rebased on git master "ready for merge" branch is: prop250_final_v6

comment:42 in reply to:  40 Changed 2 years ago by teor

Status: needs_reviewneeds_revision

Replying to dgoulet:

Replying to teor:

I also get an assertion failure:

shared-random/utils: [forking] 
  FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
  [utils FAILED]

I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!

The memcpy(fp, commit.rsa_identity_fpr, sizeof(fp)); copies the uninitialised commit.rsa_identity_fpr into fp. Therefore, clang can assume that commit.rsa_identity_fpr is all zero (or, alternately, commit.rsa_identity_fpr uses memory that is initialised zero). Therefore, you get the same result for both commit_is_authoritative calls.

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):

2   libsystem_c.dylib             	0x9d7f5c34 abort + 156
3   test                          	0x0057e5bf crash_handler + 383
4   libsystem_platform.dylib      	0x990fb01b _sigtramp + 43
5   ???                           	0xffffffff 0 + 4294967295
6   test                          	0x0057e440 0x1000 + 5755968
7   test                          	0x007d20eb parse_iso_time_ + 1739 (util.c:1713)
8   test                          	0x007d2134 parse_iso_time + 52 (util.c:1723)
9   test                          	0x00702157 config_assign_value + 7239 (confparse.c:346)
10  test                          	0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11  test                          	0x006fa435 config_assign + 2133 (confparse.c:828)
12  test                          	0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13  test                          	0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)

I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...

You pass 2666 as the year. Any years after 2037 overflow a 32 bit time_t.

See also #18383 for a related issue in the dirauth code that was causing some assertion failures.

comment:43 Changed 2 years ago by teor

Status: needs_revisionneeds_review

My branch prop250_final_v6 has fixes for these.

comment:44 Changed 2 years ago by asn

Please find the latest branch for review at dgoulet's repo at prop250_final_v6.

comment:45 Changed 2 years ago by dgoulet

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

comment:46 Changed 2 years ago by dgoulet

Sponsor: SponsorRSponsorR-must

comment:47 Changed 2 years ago by nickm

Keywords: TorCoreTeam201604 added

Every postponed needs_review ticket should get a review in April

comment:48 Changed 2 years ago by nickm

Reviewer: nickm

comment:49 Changed 2 years ago by dgoulet

Status: needs_reviewmerge_ready

comment:50 Changed 2 years ago by dgoulet

Since this one is humongeous, here is an experiment for code review on Gitlab using the merge request feature:

https://gitlab.com/dgoulet/tor/merge_requests/1

comment:51 Changed 2 years ago by nickm

Status: merge_readyneeds_revision

dgoulet is revising based on early review. More review to follow.

comment:52 Changed 2 years ago by asn

Keywords: TorCoreTeam201605 added; TorCoreTeam201604 removed

comment:53 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Ok new and clean branch for this in ticket16943_029_01

Gitlab merge request is here: https://gitlab.com/dgoulet/tor/merge_requests/3

The current branch contains the fixup from the first pass made by nickm. Since the first commit was a bit of a mess, the fixup now applies to multiple commit (instead of the first one).

comment:54 Changed 2 years ago by dgoulet

Squashed version with new merge request. There was an issue with one fixup commit that didn't apply at the right commit.

https://gitlab.com/dgoulet/tor/merge_requests/4

See branch: ticket16943_029_02

comment:55 Changed 2 years ago by nickm

Status: needs_reviewneeds_revision

I am done reviewing the branch!

comment:56 Changed 2 years ago by teor

With clang on OS X:

src/or/shared_random.c:855:28: warning: size argument in 'strlcpy' call appears
      to be size of the source; expected the size of the destination
      [-Wstrlcpy-strlcat-size]
  strlcpy(dst, buf, sizeof(buf));

comment:57 Changed 2 years ago by teor

#19006 records the following crash on dgoulet's ticket16943_029_02:

Application Specific Information:
crashed on child side of fork pre-exec
*** error for object 0x7fb203c032c0: pointer being freed was not allocated
 

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff9d1a6f06 __pthread_kill + 10
1   libsystem_pthread.dylib       	0x00007fff8daa84ec pthread_kill + 90
2   libsystem_c.dylib             	0x00007fff903bb6e7 abort + 129
3   libsystem_malloc.dylib        	0x00007fff8b468041 free + 425
4   tor                           	0x00000001060086b1 config_free_all + 321 (config.c:841)
5   tor                           	0x000000010606e233 tor_free_all + 179 (main.c:3170)
6   tor                           	0x000000010606e479 tor_cleanup + 313 (main.c:3239)
7   tor                           	0x0000000106067c31 consider_hibernation + 129 (hibernate.c:971)
8   tor                           	0x000000010606c7ce second_elapsed_callback + 590 (main.c:1467)
9   libevent-2.0.5.dylib          	0x0000000106261aa2 event_base_loop + 1871
10  tor                           	0x000000010606c245 do_main_loop + 1525 (main.c:2560)
11  tor                           	0x000000010606e586 tor_main + 230 (main.c:3660)
12  tor                           	0x0000000105fd8ce9 main + 25 (tor_main.c:30)
13  libdyld.dylib                 	0x00007fff982e35ad start + 1

comment:58 Changed 2 years ago by teor

Let's call the last two comments T1 ans T2.

T3: This log message is confusing, please fix it:

sr_state_update(): SR: State prepared for new voting 
period (2016-05-10 00:00:00). Current phase is commit (1/0).

I'd prefer commit phase 1, reveal phase 0.

comment:59 Changed 2 years ago by teor

T4: please sort the shared-rand-commit lines in each vote in a stable order (by fingerprint).

comment:60 Changed 2 years ago by teor

T5: please cherry-pick 37655c1 from my branch sr-commit-debug on https://github.com/teor2345/tor.git

I added some logging to track down the i386 issue. It appears that i386 tors can't parse the shared random commits correctly. The commit 37655c1 adds some extra logging, and escapes all the strings that are logged.

comment:61 Changed 2 years ago by teor

You probably want tor_parse_ulong() here, because on i386, min is 1, and max is effectively (long)UINT32_MAX, which is -1.

long tor_parse_long(const char *s, int base, long min,
                    long max, int *ok, char **next);
...
uint32_t version;
...
version = (uint32_t) tor_parse_long(value, 10, 1, UINT32_MAX, NULL, NULL);

I think we could warn in the tor_parse_* functions when this happens, see #19063.

comment:62 Changed 2 years ago by teor

Let's call the bug above T6.

T7: In sr_parse_srv() you parse a string as a long between 0 and INT32_MAX, then cast and assign it to an int. It's good you limit the value to INT32_MAX rather than INT_MAX, which varies by architecture. Can you also use an int32_t for num_reveals and sr_srv_t.num_reveals to make this limit clearer?

int num_reveals, ok, ret;
sr_srv_t *srv = NULL;
...
num_reveals = (int)tor_parse_long(smartlist_get(args, 0),
                               10, 0, INT32_MAX, &ok, NULL);
...
srv->num_reveals = num_reveals;

comment:63 Changed 2 years ago by teor

T8: In sr_compute_srv(), we limit the number of SR authorities to 255, and crash if there are more. This seems brittle: why not warn instead?

tor_assert(reveal_num < UINT8_MAX);

comment:64 Changed 2 years ago by teor

T8: In networkstatus_parse_vote_from_string(), we check if it's a vote, then check K_SR_FLAG, then parse commits and reveals. But we don't do the same for SRVs. What if buggy authorities include SRVs in their vote without the K_SR_FLAG? What if a consensus includes SRVs without the SR flag?

Can we be consistent, and check the SR flag, then parse commits and SRVs as appropriate?

  /* If this is a vote document, check if information about the shared
     randomness protocol is included, and extract it. */
  if (ns->type == NS_TYPE_VOTE) {
    /* Does this authority participates in the SR protocol? */
    tok = find_opt_by_keyword(tokens, K_SR_FLAG);
    if (tok) {
      ns->sr_info.participate = 1;
      /* Get the SR commitments and reveals from the vote. */
      if (extract_shared_random_commits(ns, tokens) < 0) {
        log_warn(LD_DIR, "SR: Unable to parse commits in vote.");
        goto err;
      }
    }
  }
  /* For both a vote and consensus, extract the shared random values. */
  if (ns->type != NS_TYPE_OPINION) {
    if (extract_shared_random_srvs(ns, tokens) < 0) {
      log_warn(LD_DIR, "SR: Unable to parse SRV(s)");
      goto err;
    }
  }

comment:65 Changed 2 years ago by asn

Here is some TODO from my meeting notes:

  • Commit prop250: Only trust known authority when computing SRV seems to reject votes that contain commits by unknown dirauths. This can happen naturally in the network. Instead of handling unknonw commits during parsing in extract_shared_random_commits() maybe we should handle them in should_keep_commit().
  • I got this message on my notice level logs: [notice] SR: State loaded successfully from file /home/user/data_directory/sr-state. I think this message should be demoted to a lesser severity.
  • It would be nice for debugging, to order commits in votes based on authority fpr, so that they are all in the same order in different votes. If it's too hard, let's not do that.
  • We discussed treating SRVs in votes the same way we treat consensus parameters. That is, not needing a real majority, and if at least 3 dirauths agree on the same SRV on their votes, it is copied on the consensus. This is controversial security-wise, so it needs some thinking.
  • Make a state diagram of the SRV protocol for greater understanding. We already have one in the bottom of the notes page, but we might be able to make a better one.

comment:66 Changed 2 years ago by teor

Please see my branch sr-commit-fixes on https://github.com/teor2345/tor.git

37655c1 Improve log messages for SR commit lines (T5)
520e14f Fix a bug in shared random version parsing on 32-bit platforms (T6)
2fd8a21 Consistently use uint32_t for integers in shared random structs (T7)
d6487a3 Silence a clang warning about shared random buffer sizes (T1)
2fd8a21 Make a shared random log message easier to understand (T3)
add23bd Move the SR flag to the same place as the other SR ns string constants
3731984 Don't crash authorities when they have 254 or more shared random reveals (T8, T7)
8a84070 Sort commits in lexicographical order in votes (T4)

T2 was fixed by dgoulet.

comment:67 Changed 2 years ago by isabela

Points: large6

comment:68 Changed 2 years ago by teor

T9: The unit test fixes in ticket16943_029_02 have a bug on platforms where time_t is 32 bit.

When running src/test/test --debug shared-random/state_load_from_disk on 32 bit OS X, I see:

[warn] tor_timegm: Bug: Result does not fit in tor_timegm (on Tor 0.2.9.0-alpha-dev )
[warn] SR: Reading state error: Invalid time '2666-04-19 07:16:00' for keyword 'ValidAfter'

This was introduced with the unit tests in commit e931fef2, let's use 2037 instead.

comment:69 Changed 2 years ago by teor

I've split off everything that hasn't been included in dgoulet/ticket16943_029_02 into child tickets.

Here's what has happened to T1-T9 and my branch sr-commit-fixes:

37655c1 Improve log messages for SR commit lines (T5)
has been fixup'd into another commit in dgoulet/ticket16943_029_02

520e14f Fix a bug in shared random version parsing on 32-bit platforms (T6)
has been fixup'd into another commit in dgoulet/ticket16943_029_02

2fd8a21 Consistently use uint32_t for integers in shared random structs (T7)
Split off into #19126, rebased on the latest dgoulet/ticket16943_029_02

d6487a3 Silence a clang warning about shared random buffer sizes (T1)
issue was fixed by b9b9022 prop250: Pass the dst length to sr_srv_encode()

bc7160c Make a shared random log message easier to understand (T3)
has been fixup'd into another commit in dgoulet/ticket16943_029_02

add23bd Move the SR flag to the same place as the other SR ns string constants
has been fixup'd into another commit in dgoulet/ticket16943_029_02

3731984 Don't crash authorities when they have 254 or more shared random reveals
Split off into #19126

8a84070 Sort commits in lexicographical order in votes (T4)
cherry-picked as 4d956a6 prop250: Sort commits in lexicographical order in votes

The unit test fixes in ticket16943_029_02 have a bug on platforms where time_t is 32 bit (T9)
Split off into #19125

comment:70 Changed 2 years ago by teor

Logged #19132 for a missing initialiser error in clang.

comment:71 Changed 2 years ago by teor

I just split off #19134 because we're doing the SRV hash wrong: we're meant to use 8 bytes of reveal_num and version in network order, not 8 bits. This obsoletes #19127, and modifies #19126.

Last edited 2 years ago by teor (previous) (diff)

comment:72 Changed 2 years ago by asn

OK, did another review of dgoulet/ticket16943_029_02 up to b9b9022. I like the new changes!

Please see my branch ticket16943_029_04, which addresses some more issues. Specifically, it fixes the first two points from comment:65, and also it folds in teor's patch from #19126 (which fixes tests in platforms where time_t is 32bits). It also fixes #19132.

What else remains to be done?

Last edited 2 years ago by asn (previous) (diff)

comment:73 Changed 2 years ago by dgoulet

Latest branch with all the fixes discussed: ticket16943_029_05

I still have to incorporate #19012 and push it on the gitlab for code review. Should be coming soon.

comment:74 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Branch ready for review at: https://gitlab.com/dgoulet/tor/merge_requests/5

See description for some extra info about last review since this one has been squashed and rebased on master.

comment:75 Changed 2 years ago by dgoulet

Points: 6

comment:76 Changed 2 years ago by dgoulet

Points: 6

Hrm wait, putting points back on this one since this is related to a huge chunk of code.

comment:77 Changed 2 years ago by nickm

Keywords: TorCoreTeam201605 removed

Remove "TorCoreTeam201605" keyword. The time machine is broken.

comment:78 Changed 2 years ago by nickm

Keywords: review-group-3 added

Move some tickets into review-group-3: they are in 0.2.9, and they are needs_review.

comment:79 Changed 2 years ago by nickm

Keywords: review-group-4 added

comment:80 Changed 2 years ago by nickm

Okay, I made a few comments. Only found a couple things that look like possible bugs

comment:81 Changed 2 years ago by nickm

Status: needs_reviewneeds_revision

comment:82 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Everything has been address except one detail, please see the merge request comments.

Fixup commits in ticket16943_029_05 (also pushed on the gitlab).

comment:83 Changed 2 years ago by nickm

Status: needs_reviewneeds_revision

added more comments.

comment:84 Changed 2 years ago by dgoulet

Status: needs_revisionneeds_review

Please see discussion on the merge request.

Fixup commits in ticket16943_029_05 (also pushed on the gitlab).

comment:85 Changed 2 years ago by dgoulet

Status: needs_reviewmerge_ready

Ok, the fully squashed branch is in ticket16943_029_05-squashed. The following returns an empty set for me that is the diff between the squashed and the current branch being reviewed with fixup commits:

$ git diff ticket16943_029_05..ticket16943_029_05-squashed
$

Unit tests pass, chutney works and our test network is currently running this branch for my dirauth. Once merged upstream, I'll inform the testnet operatorsyy to update to master so we can track any bug from now on.

@nickm, finally, you won't be able to close this ticket because of opened child tickets. Just say so here once it's merged, I'll take care of moving the tickets to another parent and closing the appropriate child.

comment:86 Changed 2 years ago by nickm

m

e

r

g

e

d

!

comment:87 Changed 2 years ago by dgoulet

Resolution: fixed
Status: merge_readyclosed

Wow code, Much feature, Such closing!

Note: See TracTickets for help on using tickets.