Sebastian keeps running into situations where he wants to know why fluxe3 fell out of the consensus.
Sometimes descriptors get voted about but don't get enough votes to go in the consensus. Do we keep those? Perhaps we keep just the ones that gabelmoo has? Should we try to be more thorough?
I guess since the v2 statuses are still around, maybe we do get most of the other descriptors. Is it just a matter of how to export them on the relay-search page?
Hm.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Right now we collect descriptors from three different sources:
Every hour, we rsync part of gabelmoo's data directory to learn about the new consensus, new votes, and new descriptors.
Every hour, after doing 1, we look at the descriptors to find referenced votes and descriptors that we don't have. We download missing consensuses, votes, and descriptors from three directory authorities via HTTP.
Once a day, we rsync the full content of weasel's directory archive script that downloads consensuses and votes and all descriptors referenced from there.
So, if only moria1 knows about a descriptor that is not included in a vote or a descriptor, we won't collect it. What would be the best way to download all descriptors from all directory authorities? If we find a good solution, I'd like to stop doing 1 in the list above.
Here's how we can remove item 1 from my last comment's list and start collecting all descriptors that don't get into the consensus:
Once a day, we download /tor/server/all and /tor/extra/all from all eight directory authorities. These files contain all known server and extra-info descriptors at the time. The files are 4.3M and 5.7M in size as of today.
Is it acceptable to download another 10M from the directory authorities per day? And is once per day sufficient?
I wonder if this approach might be insufficient for your requirements. It will tell you about descriptors that the authorities 'they have accepted and have decided to keep. It won't tell us about descriptors that the authorities immediately rejected, or ones that they decided (for whatever reason) to drop or replace.
Do we care about those factors?
As for the information about download size: you can make it much smaller. First, instead of downloading "all", download "all.z". Second, instead of downloading all extra-info descriptors, read through the descriptors in tor/server/all.z to see which ones you are missing, and download only those. I'd bet these approaches combined would save 60-80% of the expected download size.
I wonder if this approach might be insufficient for your requirements. It will tell you about descriptors that the authorities 'they have accepted and have decided to keep. It won't tell us about descriptors that the authorities immediately rejected, or ones that they decided (for whatever reason) to drop or replace.
Do we care about those factors?
That's a fine question. I can't say. I guess Sebastian or arma have an answer. From a metrics POV, we're only interested in the descriptors that are referenced from consensuses and maybe votes. But I understand the need to collect unreferenced descriptors for debugging purposes.
What reasons are there for an authority to reject or drop a descriptor? a) unable to parse and b) changes are cosmetic come to mind. I'm somewhat concerned about a) here. If we want to include descriptors that the directory authorities cannot parse, I'll have to improve the metrics code for parsing descriptors. I'd prefer to not include descriptors from case a), though. Descriptors from case b) should be fine to archive. Are there other reasons for the authorities to drop or reject descriptors?
As for the information about download size: you can make it much smaller. First, instead of downloading "all", download "all.z".
Right. We should do that for all downloads, I guess.
Second, instead of downloading all extra-info descriptors, read through the descriptors in tor/server/all.z to see which ones you are missing, and download only those. I'd bet these approaches combined would save 60-80% of the expected download size.
I wonder if this approach might be insufficient for your requirements. It will tell you about descriptors that the authorities 'they have accepted and have decided to keep. It won't tell us about descriptors that the authorities immediately rejected, or ones that they decided (for whatever reason) to drop or replace.
Do we care about those factors?
That's a fine question. I can't say. I guess Sebastian or arma have an answer. From a metrics POV, we're only interested in the descriptors that are referenced from consensuses and maybe votes. But I understand the need to collect unreferenced descriptors for debugging purposes.
What reasons are there for an authority to reject or drop a descriptor? a) unable to parse and b) changes are cosmetic come to mind. I'm somewhat concerned about a) here. If we want to include descriptors that the directory authorities cannot parse, I'll have to improve the metrics code for parsing descriptors. I'd prefer to not include descriptors from case a), though. Descriptors from case b) should be fine to archive. Are there other reasons for the authorities to drop or reject descriptors?
Without having more information what descriptors people want to collect, I'll assume that whatever we learn by downloading /tor/server/all.z and /tor/extra/all once per day is sufficient. Please let me know if it's not.
As for the information about download size: you can make it much smaller. First, instead of downloading "all", download "all.z".
Right. We should do that for all downloads, I guess.
I added ".z" to all URLs except for extra-info descriptors. It seems that directory authorities first compress extra-info descriptors and then concatenate the results. I know that this is permitted in the specification. Unfortunately, I cannot handle that easily in Java. After spending two hours on this problem, I decided that developer time is more valuable than bandwidth and removed the ".z" for extra-info descriptors. Everything else works fine with ".z". I'm happy to accept a patch if someone wants to look closer at the Java problem.
Second, instead of downloading all extra-info descriptors, read through the descriptors in tor/server/all.z to see which ones you are missing, and download only those. I'd bet these approaches combined would save 60-80% of the expected download size.
Okay, that should work. Is once per day enough?
I tried downloading /tor/server/all.z and all the extra-info descriptors referenced from there, and then downloaded /tor/extra/all. The latter gave me new descriptors that were not referenced from the server descriptors I had. We're trying to collect all descriptors in the network, so I enabled downloading both /tor/server/all.z and /tor/extra/all once per day.
As the next steps I'm going to check whether we still need to import gabelmoo's cached-* files, and how we can add a timeout per authority to avoid being delayed by extremely slow authorities.
Hrm, I'm not sure if fetching once per day is sufficient, because wouldn't that mean that we don't learn about a descriptor that was published at hour 1, then superseded at hour 19?
Hrm, I'm not sure if fetching once per day is sufficient, because wouldn't that mean that we don't learn about a descriptor that was published at hour 1, then superseded at hour 19?
Ugh. You're right. I was under the impression that a directory would tell us the full content of cached-descriptors[.new] when asking for /tor/server/all. Looks like I was wrong. It tells us exactly one descriptor per fingerprint.
Does that mean we need to extend Tor, or is there some other way to learn the descriptors in cached-descriptors[.new], cached-extrainfo[.new], and cached-microdescs[.new]?
If we need to extend Tor, how about we add three new URLs to request the descriptor identifiers of all known server descriptors, extra-info descriptors, and microdescriptors. We could also make sure that the directory authorities store all descriptors that might be interesting for debugging, including those with cosmetic changes. metrics-db could then compare the identifiers to see which descriptors it is missing and download them by ID. This probably needs a (short) proposal.
I haven't thought about it for very long, but I wouldn't object to a /tor/server/complete or something that includes now-unused or never-actually-used descriptors that the authority has lying around.
I also wouldn't object to a Tor patch that writes rejected descriptors to a file somewhere. We'd want to come up with some mechanism for making sure the file doesn't grow to infinity.
(Does this mean this is now a 'Tor directory authority' task?)
Yes, let's make this a 'Tor Directory Authority' task, because the next step is to change Tor. Once Tor supports the new URLs, we can turn it into a 'Metrics' task again, and I'm going to let metrics-db download the new URL once a day or more often if required.
I'm afraid I cannot make the Tor side of this a high-priority task, because it's an unpredictable time sink for me.
Trac: Owner: karsten toN/A Component: Metrics to Tor Directory Authority