You can find my python script in the guardiness branch of https://git.torproject.org/user/asn/hax.git. You need stem to run it. Please read the README to understand the components of the script and how it's supposed to be run.
$ time python databaser.py consensus_dir/tmp/ 8...DEBUG:root:Parsing consensus 2014-06-05-07-00-00-consensus-microdesc (2204/2206)!DEBUG:root:Parsing consensus 2014-08-23-17-00-00-consensus-microdesc (2205/2206)!DEBUG:root:Parsing consensus 2014-06-17-00-00-00-consensus-microdesc (2206/2206)!real 49m17.897suser 40m11.616ssys 0m25.204s$ time python guardfraction.py 9DEBUG:root:Read db file with 2206 consensuses info...real 0m1.329suser 0m1.212ssys 0m0.096s
Basically, populating the database with 2200 consensuses (3 months) takes 49 minutes, but is only going to happen once. Then calculating the guardfraction given the guardiness only takes a second. That's better performance than the summary files, and much easier to sysadmin and understand.
The script should be ready for review. There are two or three XXXs I need to fix before deployment, but they are not super important. Please check the README for usage instructions.
BTW, the repository also contains 4MB of consensus files that are used for the unittests. They are in test/test_consensuses/ and you can use them as a simple corpus for the scripts too.
To run the unittests do the following terrible thing in the project topdir:
line 17: Are we sure all dirauths will run the parser at the same time offset? If some dirauths run just before the hour and some just after the hour there could be off-by-one differences even if the dirauths all have essentially the same view of the history.
line 34: probably best to use stem.Flag.GUARD instead of 'Guard'
line 56: I think your schema should do this, since you require the consensus dates to be unique
guards_ds.py
write_output_file:
I think what you're trying to do here, by appending everything to f_str before writing is like an atomic write? But I think file.write() could still fail, and in the meantime you've truncated the output file by opening it w+. If you want to do an atomic replacement, you need to do a two-phase commit, e.g. write to a tempfile and then rename it after the write is complete.
guard_fraction.py:
lines 92-94: same comment as consensus.py; it would seem prudent to round to the nearest hour so that we don't get off-by-one errors with different dirauths having a database that is off by one consensus.
I pushed some new code, addressing most of your comments and some comments of Sebastian.
The most important change is that now the guardfraction script doesn't delete old consensuses from the database. Instead it just ignores them by default. I added a new CLI switch to delete them.
Also, I started using the sqlite datetime() functions which are a bit smarter about days of the month etc.
=== consensus.py: ===
line 17: Are we sure all dirauths will run the parser at the same time offset? If some dirauths run just before the hour and some just after the hour there could be off-by-one differences even if the dirauths all have essentially the same view of the history.
You have a point here, but unfortunately I didn't find a way to round down the datetime() of sqlite down to the hour. I will look more into this.
That said, I'm going to give a cron job to dirauth operators that they are supposed to use. The cron job will run at around 30 minutes in the hour, every hour. So that should not be that big of a problem.
line 34: probably best to use stem.Flag.GUARD instead of 'Guard'
Done.
line 56: I think your schema should do this, since you require the consensus dates to be unique
Done.
=== guards_ds.py ===
write_output_file:
I think what you're trying to do here, by appending everything to f_str before writing is like an atomic write? But I think file.write() could still fail, and in the meantime you've truncated the output file by opening it w+. If you want to do an atomic replacement, you need to do a two-phase commit, e.g. write to a tempfile and then rename it after the write is complete.
Hm, I was not actually trying to do an atomic write. That style remained from when I started developing the script, and instead of writing to a file I logged in stdout to check that the output is correct.
If you think I should try to do an atomic write there, tell me.
=== guard_fraction.py: ===
lines 92-94: same comment as consensus.py; it would seem prudent to round to the nearest hour so that we don't get off-by-one errors with different dirauths having a database that is off by one consensus.
Looks plausible, I could see myself trying it out. A couple of things: The cron script should never output a single thing in the case where no errors happened, but if an error happened it should always make sure that gets reported. There probably wants to be a vacuum somewhere so the database file gets shrunk periodically. There should be a protection against parallel execution of the script, because that will happen eventually. What is the deal with not importing a consensus if it isn't valid anymore? I don't understand the logic. The duplicate protection should prevent duplicates, so what's the issue?
For the output format, I thought we had talked about giving a way to enumerate which consensuses were missing. Maybe that could at least be an optional command? Also, instead of
n-inputs <number of consesuses parsed> <number of months considered>
maybe
n-inputs <number of consensuses parsed> <number of all consensuses in voting period> <number of months considered>
is more appropriate/easier to understand? Because otherwise "how many hours is a month" becomes an issue.
What happens with consensuses generated at the half-hour mark? Can they accidentally get added, and if they do, do they distort results?
Looks plausible, I could see myself trying it out. A couple of things: The cron script should never output a single thing in the case where no errors happened, but if an error happened it should always make sure that gets reported.
Fixed that I think. Now it should be totally silent if everythin went well, but still cry if things went bad. I'm testing this right now on my system.
There probably wants to be a vacuum somewhere so the database file gets shrunk periodically.
I didn't do this yet.
There should be a protection against parallel execution of the script, because that will happen eventually.
Fixed with flock(1).
What is the deal with not importing a consensus if it isn't valid anymore? I don't understand the logic. The duplicate protection should prevent duplicates, so what's the issue?
You are right. Now the date check happens only on guardfraction.py. The databaser script will just import whatever is found in the directory it was pointed to.
For the output format, I thought we had talked about giving a way to enumerate which consensuses were missing. Maybe that could at least be an optional command?
Did this. It's now switch -m in guardfraction.py . Check it out!
Also, instead of
{{{
n-inputs
}}}
maybe
{{{
n-inputs
}}}
is more appropriate/easier to understand? Because otherwise "how many hours is a month" becomes an issue.
As discussed in IRC, I changed the period to be in days and not in months. This is much better. And I also output the ideal number of consensuses in the output file.
What happens with consensuses generated at the half-hour mark? Can they accidentally get added, and if they do, do they distort results?
I think they can get added but they don't really distort results.
They will definitely not be considered by the list missing consensuses functionality.
Hmmmmm. I worry slightly about using stem for parsing here
... remind me, does stem accept unrecognized elements and fields and
so forth in documents, or does it reject those? I thought
it sometimes needed to get patched when we added new fields. If
that's right, this will make us fail when the format evolves.
If I'm wrong about stem's behavior, great.
If I'm right, we need to give it a way to be more tolerant.
Regardless, we should make sure that we can recover from
misformatted cnsensuses.
The output format should be made extensible so that it can handle more than
one ID format. Soon, Ed25519+RSA identity pairs will exist. After
that, Ed25519-alone will exist.
The schema should also probably handle this situation.
Probably this needs a LICENSE or COPYING file. And a license.
Do we verify consensuses now?
README should explain how and what to backup.
The cron script probably shouldn't be called cron.sh, right?
Is the cron script smart enough to exit if something fails?
In guardfraction.py ... double-check that max_days is positive? And not huge?
Check that we're not going backwards in time?
Do you need an index on consensus.consensus_date to make delete and count fast enough?
find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.
How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.
Hmmmmm. I worry slightly about using stem for parsing here ... remind me, does stem accept unrecognized elements and fields and so forth in documents, or does it reject those? I thought it sometimes needed to get patched when we added new fields. If that's right, this will make us fail when the format evolves.
Hi Nick. It's not clear to me what the spec change being proposed here is. To more generally answer your question though, new descriptor lines are fine for Stem as long as they don't violate the spec in some way. They're available via the get_unrecognized_lines() method on the Descriptor base class.
Thanks for the review. I fixed some but not all stuff and pushed the changes to my branch.
Some comments:
Hmmmmm. I worry slightly about using stem for parsing here
... remind me, does stem accept unrecognized elements and fields and
so forth in documents, or does it reject those? I thought
it sometimes needed to get patched when we added new fields. If
that's right, this will make us fail when the format evolves.
If I'm wrong about stem's behavior, great.
If I'm right, we need to give it a way to be more tolerant.
Regardless, we should make sure that we can recover from
misformatted cnsensuses.
Good point. As Damian said, stem is filtering unknown lines and putting them in a special list. I can do some testing wrt soon, to make sure it works fine.
The output format should be made extensible so that it can handle more than
one ID format. Soon, Ed25519+RSA identity pairs will exist. After
that, Ed25519-alone will exist.
The schema should also probably handle this situation.
That's also a good point.
Currently, I'm using the fingerprint of the guard as its unique identifier. The problem I guess is that in the future, each guard will have 2 such fingerprints. And in the future future, it will have one but not the one it has now.
How should this be fixed I wonder? Should I change the schema and output file format to support multiple fingerprints? So that we include both RSA and Ed25519?
To do so, I will have to change the output file format quite a bit again... Both Python and Tor code will need to change, and unittests will need to be rewritten. Should I change the output file format to be yaml or something? :/ But then how will Tor parse it...
Probably this needs a LICENSE or COPYING file. And a license.
Done. Used unlicense.
Do we verify consensuses now?
No we don't. #11045 (closed) is the corresponding ticket. Damian asked me to take care of that ticket, which might happen but not any time very soon.
However, I don't worry too much about this. The cron script for now, will just copy the consensus from Tor's data directory to the guardfraction directory every hour. So the consensus should be authentic. I'm also going to suggest to dirauth operators to just run the cron script for a month or so before enabling the code, to have a sufficiently populated guardfraction db.
README should explain how and what to backup.
Hm, what do you mean backup?
The cron script probably shouldn't be called cron.sh, right?
True. Renamed to guardfraction_cron.sh.
Is the cron script smart enough to exit if something fails?
Done. Now it's smarter!
In guardfraction.py ... double-check that max_days is positive? And not huge?
Done.
Check that we're not going backwards in time?
Hm, how should I do that?
Should I take the latest consensus from the database, and make sure that the current time is later than the valid-after of that consensus?
Do you need an index on consensus.consensus_date to make delete and count fast enough?
I think that's a good idea since it's used in multiple WHERE clauses, but I'm not very good at databases. I added the index in any case.
find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.
Yeah this was a hard problem. The current implementation is a rough hack, to give dirauth operators a rough idea of which consensuses are missing...
How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.
I think it's right. At least its output seems to be correct, and the unittests test it.
Which parentheses don't match?
Thanks for the code review again! We can discuss this during the next tor meeting if you want :)
Good point. As Damian said, stem is filtering unknown lines and putting them in a special list. I can do some testing wrt soon, to make sure it works fine.
Sounds good.
The output format should be made extensible so that it can handle more than
one ID format. Soon, Ed25519+RSA identity pairs will exist. After
that, Ed25519-alone will exist.
The schema should also probably handle this situation.
That's also a good point.
Currently, I'm using the fingerprint of the guard as its unique identifier. The problem I guess is that in the future, each guard will have 2 such fingerprints. And in the future future, it will have one but not the one it has now.
How should this be fixed I wonder? Should I change the schema and output file format to support multiple fingerprints? So that we include both RSA and Ed25519?
To do so, I will have to change the output file format quite a bit again... Both Python and Tor code will need to change, and unittests will need to be rewritten. Should I change the output file format to be yaml or something? :/ But then how will Tor parse it...
I would suggest the following: don't worry about it for now. We can add an option to tell it to output a new format, and have tor accept the old format and the new one.
If we're really smart, we can add a version line as the start for the output.
README should explain how and what to backup.
Hm, what do you mean backup?
Let's say the system gets wrecked. What files would we want a copy of to restore it?
Check that we're not going backwards in time?
Hm, how should I do that?
Should I take the latest consensus from the database, and make sure that the current time is later than the valid-after of that consensus?
That's a good way.
find_missing_hours_from_list() isn't actually right: We don't guarantee a one-hour timer. Instead you need to check for gaps in the (valid-after, fresh-until) series of times. But this would mean that just counting consensuses isn't right. Maybe each consensus needs a duration-fresh field.
Yeah this was a hard problem. The current implementation is a rough hack, to give dirauth operators a rough idea of which consensuses are missing...
How does the nested query in read_db_file() perform? Is it even right? I don't think the parentheses balance.
I think it's right. At least its output seems to be correct, and the unittests test it.
Which parentheses don't match?
Okay, if it works okay, and performs okay, that's fine.
Thanks for the code review again! We can discuss this during the next tor meeting if you want :)