Review the guardiness python script of #9321

changed milestone to %Tor: 0.2.7.x-final

added component::core tor/tor milestone::Tor: 0.2.7.x-final nickm-review priority::medium resolution::fixed status::closed tor-auth tor-guard type::defect unfrozen labels

You can find my python script in the guardiness branch of https://git.torproject.org/user/asn/hax.git. You need stem to run it. Please read the README to understand the components of the script and how it's supposed to be run.

Trac:
Status: new to needs_review

Trac:
Cc: N/A to NickHopper

Trac:
Cc: NickHopper to NickHopper, yawning

As mentioned in https://lists.torproject.org/pipermail/tor-dev/2014-September/007526.html I refactored the script to use sqlite instead of summary files. You can find the new script here: https://gitweb.torproject.org/user/asn/hax.git/shortlog/refs/heads/guardfraction (guardfraction branch instead of guardiness)

Here is some times:

$ time python databaser.py consensus_dir/tmp/ 8
...
DEBUG:root:Parsing consensus 2014-06-05-07-00-00-consensus-microdesc (2204/2206)!
DEBUG:root:Parsing consensus 2014-08-23-17-00-00-consensus-microdesc (2205/2206)!
DEBUG:root:Parsing consensus 2014-06-17-00-00-00-consensus-microdesc (2206/2206)!

real	49m17.897s
user	40m11.616s
sys	0m25.204s

$ time python guardfraction.py 9
DEBUG:root:Read db file with 2206 consensuses info
...
real	0m1.329s
user	0m1.212s
sys	0m0.096s

Basically, populating the database with 2200 consensuses (3 months) takes 49 minutes, but is only going to happen once. Then calculating the guardfraction given the guardiness only takes a second. That's better performance than the summary files, and much easier to sysadmin and understand.

The script should be ready for review. There are two or three XXXs I need to fix before deployment, but they are not super important. Please check the README for usage instructions.

BTW, the repository also contains 4MB of consensus files that are used for the unittests. They are in test/test_consensuses/ and you can use them as a simple corpus for the scripts too.

To run the unittests do the following terrible thing in the project topdir:

$ export PYTHONPATH=`pwd`
$ python -m unittest  discover -s test/

consensus.py:

line 17: Are we sure all dirauths will run the parser at the same time offset? If some dirauths run just before the hour and some just after the hour there could be off-by-one differences even if the dirauths all have essentially the same view of the history.

line 34: probably best to use stem.Flag.GUARD instead of 'Guard'

line 56: I think your schema should do this, since you require the consensus dates to be unique

guards_ds.py

write_output_file: I think what you're trying to do here, by appending everything to f_str before writing is like an atomic write? But I think file.write() could still fail, and in the meantime you've truncated the output file by opening it w+. If you want to do an atomic replacement, you need to do a two-phase commit, e.g. write to a tempfile and then rename it after the write is complete.

guard_fraction.py:

lines 92-94: same comment as consensus.py; it would seem prudent to round to the nearest hour so that we don't get off-by-one errors with different dirauths having a database that is off by one consensus.

Replying to NickHopper:

Thanks for the review!

I pushed some new code, addressing most of your comments and some comments of Sebastian.

The most important change is that now the guardfraction script doesn't delete old consensuses from the database. Instead it just ignores them by default. I added a new CLI switch to delete them.

Also, I started using the sqlite datetime() functions which are a bit smarter about days of the month etc.

=== consensus.py: === line 17: Are we sure all dirauths will run the parser at the same time offset? If some dirauths run just before the hour and some just after the hour there could be off-by-one differences even if the dirauths all have essentially the same view of the history.

You have a point here, but unfortunately I didn't find a way to round down the datetime() of sqlite down to the hour. I will look more into this.

That said, I'm going to give a cron job to dirauth operators that they are supposed to use. The cron job will run at around 30 minutes in the hour, every hour. So that should not be that big of a problem.

line 34: probably best to use stem.Flag.GUARD instead of 'Guard'

Done.

line 56: I think your schema should do this, since you require the consensus dates to be unique

Done.

=== guards_ds.py === write_output_file: I think what you're trying to do here, by appending everything to f_str before writing is like an atomic write? But I think file.write() could still fail, and in the meantime you've truncated the output file by opening it w+. If you want to do an atomic replacement, you need to do a two-phase commit, e.g. write to a tempfile and then rename it after the write is complete.

Hm, I was not actually trying to do an atomic write. That style remained from when I started developing the script, and instead of writing to a file I logged in stdout to check that the output is correct. If you think I should try to do an atomic write there, tell me.

=== guard_fraction.py: === lines 92-94: same comment as consensus.py; it would seem prudent to round to the nearest hour so that we don't get off-by-one errors with different dirauths having a database that is off by one consensus.

Same reply as above.

Replying to asn:

If you think I should try to do an atomic write there, tell me.

There's probably no need. if the history file is corrupted for one hour it should be rebuilt the next; the voting process can tolerate a rare failure.

Looks plausible, I could see myself trying it out. A couple of things: The cron script should never output a single thing in the case where no errors happened, but if an error happened it should always make sure that gets reported. There probably wants to be a vacuum somewhere so the database file gets shrunk periodically. There should be a protection against parallel execution of the script, because that will happen eventually. What is the deal with not importing a consensus if it isn't valid anymore? I don't understand the logic. The duplicate protection should prevent duplicates, so what's the issue?

For the output format, I thought we had talked about giving a way to enumerate which consensuses were missing. Maybe that could at least be an optional command? Also, instead of

n-inputs <number of consesuses parsed> <number of months considered>

maybe

n-inputs <number of consensuses parsed> <number of all consensuses in voting period> <number of months considered>

is more appropriate/easier to understand? Because otherwise "how many hours is a month" becomes an issue.

What happens with consensuses generated at the half-hour mark? Can they accidentally get added, and if they do, do they distort results?

Replying to Sebastian:

Looks plausible, I could see myself trying it out. A couple of things: The cron script should never output a single thing in the case where no errors happened, but if an error happened it should always make sure that gets reported.

Fixed that I think. Now it should be totally silent if everythin went well, but still cry if things went bad. I'm testing this right now on my system.

There probably wants to be a vacuum somewhere so the database file gets shrunk periodically.

I didn't do this yet.

There should be a protection against parallel execution of the script, because that will happen eventually.

Fixed with flock(1).

What is the deal with not importing a consensus if it isn't valid anymore? I don't understand the logic. The duplicate protection should prevent duplicates, so what's the issue?

You are right. Now the date check happens only on guardfraction.py. The databaser script will just import whatever is found in the directory it was pointed to.

For the output format, I thought we had talked about giving a way to enumerate which consensuses were missing. Maybe that could at least be an optional command?

Did this. It's now switch -m in guardfraction.py . Check it out!

Also, instead of {{{ n-inputs }}}

maybe

{{{ n-inputs }}}

is more appropriate/easier to understand? Because otherwise "how many hours is a month" becomes an issue.

As discussed in IRC, I changed the period to be in days and not in months. This is much better. And I also output the ideal number of consensuses in the output file.

What happens with consensuses generated at the half-hour mark? Can they accidentally get added, and if they do, do they distort results?

I think they can get added but they don't really distort results.

They will definitely not be considered by the list missing consensuses functionality.

Please check the new changes out!

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.6.x-final

Hmmmmm. I worry slightly about using stem for parsing here ... remind me, does stem accept unrecognized elements and fields and so forth in documents, or does it reject those? I thought it sometimes needed to get patched when we added new fields. If that's right, this will make us fail when the format evolves.
- If I'm wrong about stem's behavior, great.
- If I'm right, we need to give it a way to be more tolerant.
- Regardless, we should make sure that we can recover from misformatted cnsensuses.
The output format should be made extensible so that it can handle more than one ID format. Soon, Ed25519+RSA identity pairs will exist. After that, Ed25519-alone will exist.
- The schema should also probably handle this situation.
Probably this needs a LICENSE or COPYING file. And a license.
Do we verify consensuses now?
README should explain how and what to backup.
The cron script probably shouldn't be called cron.sh, right?
- Is the cron script smart enough to exit if something fails?
In guardfraction.py ... double-check that max_days is positive? And not huge?
Check that we're not going backwards in time?
Do you need an index on consensus.consensus_date to make delete and count fast enough?
find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.
How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.

Hmmmmm. I worry slightly about using stem for parsing here ... remind me, does stem accept unrecognized elements and fields and so forth in documents, or does it reject those? I thought it sometimes needed to get patched when we added new fields. If that's right, this will make us fail when the format evolves.

Hi Nick. It's not clear to me what the spec change being proposed here is. To more generally answer your question though, new descriptor lines are fine for Stem as long as they don't violate the spec in some way. They're available via the get_unrecognized_lines() method on the Descriptor base class.

Cheers! -Damian

Replying to nickm:

Thanks for the review. I fixed some but not all stuff and pushed the changes to my branch. Some comments:

Hmmmmm. I worry slightly about using stem for parsing here ... remind me, does stem accept unrecognized elements and fields and so forth in documents, or does it reject those? I thought it sometimes needed to get patched when we added new fields. If that's right, this will make us fail when the format evolves.

If I'm wrong about stem's behavior, great.

If I'm right, we need to give it a way to be more tolerant.

Regardless, we should make sure that we can recover from misformatted cnsensuses.

Good point. As Damian said, stem is filtering unknown lines and putting them in a special list. I can do some testing wrt soon, to make sure it works fine.

The output format should be made extensible so that it can handle more than one ID format. Soon, Ed25519+RSA identity pairs will exist. After that, Ed25519-alone will exist.

The schema should also probably handle this situation.

That's also a good point.

Currently, I'm using the fingerprint of the guard as its unique identifier. The problem I guess is that in the future, each guard will have 2 such fingerprints. And in the future future, it will have one but not the one it has now.

How should this be fixed I wonder? Should I change the schema and output file format to support multiple fingerprints? So that we include both RSA and Ed25519?

To do so, I will have to change the output file format quite a bit again... Both Python and Tor code will need to change, and unittests will need to be rewritten. Should I change the output file format to be yaml or something? :/ But then how will Tor parse it...

Probably this needs a LICENSE or COPYING file. And a license.

Done. Used unlicense.

Do we verify consensuses now?

No we don't. #11045 (closed) is the corresponding ticket. Damian asked me to take care of that ticket, which might happen but not any time very soon.

However, I don't worry too much about this. The cron script for now, will just copy the consensus from Tor's data directory to the guardfraction directory every hour. So the consensus should be authentic. I'm also going to suggest to dirauth operators to just run the cron script for a month or so before enabling the code, to have a sufficiently populated guardfraction db.

README should explain how and what to backup.

Hm, what do you mean backup?

The cron script probably shouldn't be called cron.sh, right?

True. Renamed to guardfraction_cron.sh.

Is the cron script smart enough to exit if something fails?

Done. Now it's smarter!

In guardfraction.py ... double-check that max_days is positive? And not huge?

Done.

Check that we're not going backwards in time?

Hm, how should I do that? Should I take the latest consensus from the database, and make sure that the current time is later than the valid-after of that consensus?

Do you need an index on consensus.consensus_date to make delete and count fast enough?

I think that's a good idea since it's used in multiple WHERE clauses, but I'm not very good at databases. I added the index in any case.

find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.

Yeah this was a hard problem. The current implementation is a rough hack, to give dirauth operators a rough idea of which consensuses are missing...

How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.

I think it's right. At least its output seems to be correct, and the unittests test it. Which parentheses don't match?

Thanks for the code review again! We can discuss this during the next tor meeting if you want :)

Replying to asn:

Good point. As Damian said, stem is filtering unknown lines and putting them in a special list. I can do some testing wrt soon, to make sure it works fine.

Sounds good.

The output format should be made extensible so that it can handle more than one ID format. Soon, Ed25519+RSA identity pairs will exist. After that, Ed25519-alone will exist.

The schema should also probably handle this situation.

That's also a good point.

Currently, I'm using the fingerprint of the guard as its unique identifier. The problem I guess is that in the future, each guard will have 2 such fingerprints. And in the future future, it will have one but not the one it has now.

How should this be fixed I wonder? Should I change the schema and output file format to support multiple fingerprints? So that we include both RSA and Ed25519?

To do so, I will have to change the output file format quite a bit again... Both Python and Tor code will need to change, and unittests will need to be rewritten. Should I change the output file format to be yaml or something? :/ But then how will Tor parse it...

I would suggest the following: don't worry about it for now. We can add an option to tell it to output a new format, and have tor accept the old format and the new one.

If we're really smart, we can add a version line as the start for the output.

README should explain how and what to backup.

Hm, what do you mean backup?

Let's say the system gets wrecked. What files would we want a copy of to restore it?

Check that we're not going backwards in time?

Hm, how should I do that? Should I take the latest consensus from the database, and make sure that the current time is later than the valid-after of that consensus?

That's a good way.

find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.

Yeah this was a hard problem. The current implementation is a rough hack, to give dirauth operators a rough idea of which consensuses are missing...

How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.

I think it's right. At least its output seems to be correct, and the unittests test it. Which parentheses don't match?

Okay, if it works okay, and performs okay, that's fine.

Thanks for the code review again! We can discuss this during the next tor meeting if you want :)

You're welcome, and thanks for the code!

And here are some more things we need to fix. These were found while deploying the script to weasel's dirauth:

Apply these two patches: https://www.palfrader.org/volatile/2015-01-02-pi6uUb3Tl5U/0001-Random-shell-cleanups.patch and https://www.palfrader.org/volatile/2015-01-02-ZIeQ6rV7YDA/0001-Remove-redirect-to-dev-null.patch
Get an official git repo for the guardfraction script.
Mention that stem is a requirement in the README file.
Have an example of a cron line in the README.
Add suggestion in README to use chronic to suppress cron output.
The mv line in the cron script can fail with "command line too big" if there are too many files.
chmod +x all the scripts

Trac:
Cc: NickHopper, yawning to NickHopper, yawning, isis

Trac:
Keywords: tor-guard tor-auth deleted, tor-guard tor-auth unfrozen added

Review the guardiness python script of #9321

Child items ...

Activity

consensus.py:

guards_ds.py

guard_fraction.py: