Opened 2 years ago

Last modified 7 months ago

#18798 assigned enhancement

Analyze descriptor completeness

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Normal Keywords: ctip, metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I started a wiki page here.

Child Tickets

Change History (19)

comment:1 Changed 2 years ago by iwakeh

Status: newneeds_information

To make sure I look at the right things: Is the choice to look at the lines starting with M-2016 correct?

What other data from the logs might be of use?

What could cause the regular increases of missing descriptor counts?

Last edited 2 years ago by iwakeh (previous) (diff)

comment:2 Changed 2 years ago by karsten

Thanks for starting this analysis!

Let me first speculate about the cause for those daily patterns that you found there. And let me start by giving an example from recent logs to explain the format of M- lines:

M-2016-04-11T22:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0279 -> 0.0279)
M-2016-04-11T23:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0279 -> 0.0279)
M-2016-04-11T23:00:00Z -> D-597C4455AF049B147337BBFF35CE4817676339FF5C94E971A05D416FD1A2DD95 (0.0279 -> 0.0558)
M-2016-04-12T00:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0280 -> 0.0558)
M-2016-04-12T00:00:00Z -> D-597C4455AF049B147337BBFF35CE4817676339FF5C94E971A05D416FD1A2DD95 (0.0280 -> 0.0558)
  1. The first line means that there's a microdescriptor with digest 38F2.. missing from the microdescriptor consensus with valid-after time 2016-04-11 22:00:00. That missing microdescriptor adds a value of 0.0279 to the total missing descriptor count which is then 0.0279. The idea is to only warn if that total value passes 1.0.
  2. The second line says that the same missing microdescriptor is also referenced from the microdescriptor consensus with valid-after time 2016-04-11 23:00:00. Given that we shouldn't double-count that missing descriptor, we're not increasing the total count there.
  3. The third line mentions another microdescriptor with digest 597C.. that is missing, and in this case it's referenced from the microdescriptor consensus with valid-after time 2016-04-11 23:00:00. That one raises the total count by another 0.0279 to then 0.0558.
  4. I guess the remaining two lines are self-explanatory at this point.

Now, what could be the reason for the daily pattern you found there? First of all this has to do with the Tor network growing and shrinking over the day (surprise!). My guess is that quite a few of the relays of which we're missing microdescriptors leave the network during some part of the day and rejoin at a later time. So, when your numbers go up again, those are microdescriptors that we're still missing at that point, not newly missing microdescriptors. At least that's my guess, I didn't confirm it with real data.

Another reason for the high increase could be that you're double-counting missing descriptors by counting a descriptor that's missing in n consensus n times. Again, I didn't look whether that's how you're counting things, I'm just guessing.

Regarding your other question of which lines to look at, the M- lines are only a small part of what we're interested in. In theory, everything after Missing referenced descriptors: is relevant for this analysis. Each of those lines lists a descriptor that references another descriptor that we're missing, which includes lines starting with:

  • S-: a server descriptor references an extra-info descriptor that is missing,
  • V-: a vote references a server descriptor that we're missing,
  • C-: a consensus references a server descriptor that we're missing, and
  • M-: a microdescriptor consensus references a microdescriptor that is missing (see above).

I guess it would be interesting to have statistics on all four types of missing descriptors (or three if we count a server descriptor referenced from a vote or a consensus as the same). Did I only give you three days of logs? If so, I should give you at least a month of logs. In particular the disk-full problem would skew the results a bit.

Thanks!

comment:3 Changed 2 years ago by iwakeh

This was the information I was looking for, thanks!

I only have those few days of logs and would like to see more data for this analysis :-)

What other resources might provide useful background information for me (other than Tor specs)?
For example, reports about regular Tor network patterns etc.

As a next step I'll provide the other types of statistics.

comment:4 Changed 2 years ago by karsten

There, I just sent you more logs for this analysis.

Hmm, I don't have a good answer to your question about other resources. I might have made it sound too obvious that there's churn in the Tor network. I know that, because I analyzed relay uptimes years ago when designing the requirements for relays to become useful hidden-service directories. But I can't think of a good source to learn similar facts about the Tor network. For example, the graphs on Metrics only show daily averages, so those graphs would not have helped, either. I'd say it's easiest if you look for things that you find unusual, and I try to help making sense of those. Note that I only guessed about the reasons for the daily pattern in your graph, I can't say for sure what would have caused them.

comment:5 Changed 2 years ago by iwakeh

Found the logs, thanks.

I'm not worried, if there aren't any other resources; I just wanted to make sure I didn't miss out on any relevant and obvious information :-)

comment:6 Changed 2 years ago by iwakeh

Just added the new analysis here.

What happened on April 1st?

Other than that there aren't that many descriptors missing.

It would be nice to have even more logs soon. Is past log data available?

What further analysis would make sense now? Have a look at the actually downloaded descriptor files?

comment:7 Changed 2 years ago by karsten

Thanks for the update! As mentioned briefly yesterday, but also for the record here, the disk ran full on April 1st, so that's what caused those problems.

I agree that there aren't many referenced descriptors missing. That's a good outcome of this analysis, and it's a sign that the current logic to fetch missing descriptors is working okay.

To be honest, I don't have a good explanation for those many missing microdescriptors. The week-long pattern there is probably the result from most microdescriptors being replaced after a week. I don't yet understand why CollecTor wouldn't be able to fetch missing microdescriptors during that week. We might be looking at a bug there, either in the collection or in the logging. But I'd say put that under low priority for now, because microdescriptors are the least important descriptors we're collecting. In theory (!), we would be able to generate our own microdescriptors from server descriptors, so missing some of them is not a big deal.

But here's something that we're yet missing in the analysis and that just crossed my mind! We're only looking at missing referenced descriptors, but we're totally ignoring missing referencing descriptors, namely consensuses, microdescriptor consensuses (less important), and votes. There are no log lines for those descriptors, but there should be 1 new consensus, 1 new microdescriptor consensus, and 9 new votes every hour. The part that makes it so important to get these descriptors is that they become unavailable after that hour. That's different from referenced descriptors, which is why we seem to be recovering well from those disk-full problems. That might look different with consensuses, microdescriptor consensuses, and votes.

In theory, finding out whether any of those are missing should be a matter of fetching descriptor tarballs and counting files. Would you want to make graphs for those counts and put them next to the missing referenced descriptors graphs?

comment:8 Changed 2 years ago by iwakeh

I added the referencing descriptor analysis to the todo list.

Another question regarding the logging statements:

What do the following numbers mean, i.e. how reliable are these numbers?
Also, the difference between "importing" and "downloading"?

2016-04-21 20:08:29 INFO org.torproject.ernie.db.relaydescs.ArchiveWriter checkMissingDescriptors Finished writing relay descriptors to disk.
While importing relay descriptors from local Tor data directories, we stored 1 consensus(es), 0 microdesc consensus(es), 9 vote(s), 9 certificate(s), 784 server descriptor(s), 767 extra-info descriptor(s), and 0 microdescriptor(s) to disk.
While downloading relay descriptors from the directory authorities, we stored 0 consensus(es), 1 microdesc consensus(es), 0 vote(s), 0 certificate(s), 44 server descriptor(s), 44 extra-info descriptor(s), and 75 microdescriptor(s) to disk.
Statistics on the completeness of written relay descriptors:
V, 2016-04-21 20:00:00, 8114/8114 S (100.0%), 8111/8111 E (100.0%)
V, 2016-04-21 20:00:00, 7934/7934 S (100.0%), 7931/7931 E (100.0%)
V, 2016-04-21 20:00:00, 7969/7969 S (100.0%), 7966/7966 E (100.0%)
V, 2016-04-21 20:00:00, 7934/7934 S (100.0%), 7931/7931 E (100.0%)
V, 2016-04-21 20:00:00, 8099/8099 S (100.0%), 8096/8096 E (100.0%)
V, 2016-04-21 20:00:00, 7934/7934 S (100.0%), 7931/7931 E (100.0%)
V, 2016-04-21 20:00:00, 8118/8118 S (100.0%), 8115/8115 E (100.0%)
V, 2016-04-21 20:00:00, 7934/7934 S (100.0%), 7931/7931 E (100.0%)
V, 2016-04-21 20:00:00, 8079/8079 S (100.0%), 8076/8076 E (100.0%)
C, 2016-04-21 20:00:00, 9/9 V (100.0%), 7155/7155 S (100.0%), 7153/7153 E (100.0%)
M, 2016-04-21 20:00:00, 7155/7155 M (100.0%)

And:
Why are there so many "404 Not found" like

2016-04-22 09:08:51 FINE org.torproject.ernie.db.relaydescs.RelayDescriptorDownloader downloadResourceFromAuthority Downloaded http://154.35.32.5/tor/extra/d/eba9d8540ab196082eb5b8ce5b88c264ac4682e2+ec1f657931f383050e0676c10694ac5129cd6a0f+edd2c259c793fe800cec14ffffd4ce3726f6c9ab+ee5838444f89659d0358fdee3b326f7d6fc7607a+ee8d7295f465e015d16488a043208da30f90e8c7+efc3ceacd5a50dcc8e7c1cd89c3620e75d2e03ca+efc4ed691d99f7b4d8b2a0f14418b298649ca4ac+f0512e9c7f2444e3e480a06ca1a4137e7344e97b+f076e485dd2bd10a342d0fa7d5bdbe4c4dcfd62e+f095e676bd2264b70d26bc3f4bd5d2c770ed9ca3+f0c374adf1096adc0cd54dd6d48d9fe75892d837+f0c87b38a3603cfda11f9e796da65208463a91e3+f120142e2cd82effb3f8a14a61fa62277d45a92e+f16d6a5c1d236b7a56cdf491163d825d03663068+f193d3422168b9bf1097e01cd299b2ad493e7ef6+f29c619aae973ae7052e30fc4065252ec063ca67+f2e1e462e74ba9af9a2f078906447965c3264b12+f32d0c792511b9f9d2d26b576206fc69c8388449+f4623ec336ed83c5dc7fe6a7c88d519abe345c0d+f49cdcd06bffaa28e2d7735b088a1e7af5823a65+f4e60697bc1eb4899993bea2af1593d8410c1b1e+f547156e1eec23858548818eaf94fc8495a2c355+f567b500d0a45cbb21efa8211179d53cd089af5f+f5848db6200401e1634c956940e69dd165211b7e+f6d11790ba60c07c6d6855725f8f93d085e5651b+f791b8a476db1e6ff0e80196bca5f5a0ee14b7d8+f805ffac5b20c0c74e654a14334ccee21b273fec+f8be01a62ce55196a5396638929f2417a56871b2+f937342ed4ddb8eaae2f24b65e060c0a1efd067e+f97802f6f800997d4c8e1d6a8810e152a1047f12+f9c67c9973a7cb17d84ed44b7fa12d0a50780273+f9d821c071760f365a515db487f8e080db295683+f9d852d3fcc1eb7f2b52ae8d3da1d5788dfbb844+f9eca45a51938dc424b84a3a0de6c05531534290+f9f68effe2b846caf01776bdb28069b28f129d24+fa33a938a577148fa6f95007ae868d890c4cc726+fa40d36e1a6c4679b5d52fb935e7da45088e7bd6+fa4d44ebe08afd4ac77bd605c795040d5e659dc7+fad968587da998d29f144990c75051d469ce7423+faf4aedbfb47ed82ae7f9dd19c53aef7e96fd037+fb48f07c350a1f8062862c89ae9a2d22ce194309+fc5ac1718bbfd61f364382f21aff31004a1476c3+fcc1184ce33e2d9262bae11c850cd61c8b054c90+fccf823342c858b0aef14d7678170e7a3db53cda+fce65f424aeda39e0fb747a8fb71c50a8f11bee9+fcf5e07b6f710e0ae5640165a77c02ccbed5c9b8+fd080aae6fc3a0c9a3672c04cafcdae4cc261a07+fd21d67244376f85f5d023253dd975c7fc19264a+fda2f8d27a90fbbede4b4c9cdafd1aecd10a0f63+fddc63bd930b3671c28b0478c1a0e5345db3a039+fe0f93b940bcad4c1302213b792d9d6a3d904c84+fe1a42f2db7bd684459f5cbc72a1377b04a4059a+fea1405d7d35d823625d91c25d94daa89ae28e43+fec8d0b3179d3e09081e245a9216c15147aa5717+ff04fa1e0de9f01e15193798934f027b01988a87 -> 404 (0 bytes)

Does this mean that just one of the requested descriptors wasn't found?
Or, that there was no answer, or ...?
Hope I didn't overlook anything obvious in the tor-spec.

comment:9 in reply to:  8 ; Changed 2 years ago by karsten

Replying to iwakeh:

I added the referencing descriptor analysis to the todo list.

Great!

Another question regarding the logging statements:

What do the following numbers mean, i.e. how reliable are these numbers?
Also, the difference between "importing" and "downloading"?

[...]

CollecTor receives its descriptors from two sources:

  1. it first fetches copies of cached descriptors from directory authority gabelmoo using wget on the command line outside of the directory protocol and imports them (the "import" step);
  2. it looks which descriptors it's still missing and fetches them via the directory protocol from most directory authorities except a few that have been too slow in the past (the "download" step).

I believe that the numbers in the logs are reliable, but I don't know for certain.

And:
Why are there so many "404 Not found" like
[...]

Does this mean that just one of the requested descriptors wasn't found?
Or, that there was no answer, or ...?
Hope I didn't overlook anything obvious in the tor-spec.

Fine question, let's simply try it out. The current extra-info descriptor digest of TorLand1 is 352c235cd4134d0d8ff41c1f88c653a450bd4137:

curl -v http://154.35.32.5/tor/extra/d/352c235cd4134d0d8ff41c1f88c653a450bd4137 > /dev/null   # 200 OK
curl -v http://154.35.32.5/tor/extra/d/0000000000000000000000000000000000000000 > /dev/null   # 404 Not found
curl -v http://154.35.32.5/tor/extra/d/352c235cd4134d0d8ff41c1f88c653a450bd4137+0000000000000000000000000000000000000000 > /dev/null   # 200 OK

So, looks like "404 Not found" means that none of the requested descriptor was found. Maybe the spec agrees, I didn't look.

comment:10 Changed 2 years ago by iwakeh

a quick overview of 48 hours of missing referenced descs on a new instance can be found here.

comment:11 Changed 2 years ago by karsten

Here are some random ideas why there might be missing referenced descriptors on your CollecTor instance:

  1. My CollecTor instance does not download descriptors from 194.109.206.212, because in May 2014 that directory authority left connections open without writing any bytes, and we're not handling timeouts very well yet. I don't expect that to still be the case, but if you're seeing log lines with that authority indicating problems, maybe take that address out of DownloadFromDirectoryAuthorities.
  2. I'm not setting DownloadAllServerDescriptors and DownloadAllExtraInfoDescriptors in my CollecTor instance. It might be that these settings are the cause for your missing numbers going down once per day. Maybe the logs tell you more, or maybe you'll need to add logs for whenever your instance downloads all descriptors. By the way, here's what I noted down when I disabled those settings: "By downloading "all" descriptors, we only learn the most recent descriptors for all known servers, not all known descriptors. That’s not exactly what we’d expect. There’s also a potential problem with the result: the authority’s own descriptor ends with a double newline which might confuse metrics-lib; unless we split up concatenated descriptors differently in metrics-db. Found out both things on May 8, 2014 when looking more into #11648."
  3. I'm also not setting CompressRelayDescriptorDownloads. Here's where I noted down why: "2014-04-29: changed to 0, because of "java.io.EOFException: Unexpected end of ZLIB input stream"". I also noted down this: "The reason for broken compressed downloads might have been #11648, which should be fixed by August/September. The current default in metrics-db for compressing downloads is 0. That's bad. Consider fixing this once all directory authorities have upgraded. Have they?"

And here are some suggestions for finding out more about the missing descriptors:

  • For serverdesc missing (referenced by votes), can you plot how many of those are missing by how many votes? I wouldn't worry so much about missing descriptors being referenced from a single vote but more about missing descriptors being referenced from (almost) all votes.
  • Regarding the extrainfo missing, have these been published by relays that only published a single server descriptor or relays that have been around for a longer time? Again, more worried in the latter case.

You'll notice that I'm mostly guessing here, because I don't know what could be going wrong. But I think you're in a good position to spot a bug or three here. Thanks for looking into this!

comment:12 Changed 2 years ago by iwakeh

Interesting background info!

ad 1. 194.109.206.212 is now behaving only 200 and 404. So, no need to exclude it.
ad 2. I turned off the DownloadAll... properties in order to see, what will change here.
ad 3. There are no Exceptions logged. Only the occasional "Could not parse" warning (cf. ticket #19170). therefor I didn't change the compression setting. I also want to first see some results of the change to the downloadall settings in 2.

More soon.

Last edited 2 years ago by iwakeh (previous) (diff)

comment:13 in reply to:  9 ; Changed 2 years ago by iwakeh

curl -v http://154.35.32.5/tor/extra/d/352c235cd4134d0d8ff41c1f88c653a450bd4137 > /dev/null   # 200 OK
curl -v http://154.35.32.5/tor/extra/d/0000000000000000000000000000000000000000 > /dev/null   # 404 Not found
curl -v http://154.35.32.5/tor/extra/d/352c235cd4134d0d8ff41c1f88c653a450bd4137+0000000000000000000000000000000000000000 > /dev/null   # 200 OK

So, looks like "404 Not found" means that none of the requested descriptor was found. Maybe the spec agrees, I didn't look.

Your empirical result coincides with the spec:

6.2. HTTP status codes

  Tor delivers the following status codes.  Some were chosen without much
  thought; other code SHOULD NOT rely on specific status codes yet.

  200 -- the operation completed successfully
      -- the user requested statuses or serverdescs, and none of the ones we
         requested were found (0.2.0.4-alpha and earlier).

  304 -- the client specified an if-modified-since time, and none of the
         requested resources have changed since that time.

  400 -- the request is malformed, or
      -- the URL is for a malformed variation of one of the URLs we support,
          or
      -- the client tried to post to a non-authority, or
      -- the authority rejected a malformed posted document, or

  404 -- the requested document was not found.
      -- the user requested statuses or serverdescs, and none of the ones
         requested were found (0.2.0.5-alpha and later).

  503 -- we are declining the request in order to save bandwidth
      -- user requested some items that we ordinarily generate or store,
         but we do not have any available.

Is the "none" in the 200 description a typo? Shouldn't it be "at least one"? Or am I missing something obvious?

comment:14 in reply to:  11 Changed 2 years ago by iwakeh

And here are some suggestions for finding out more about the missing descriptors:

  • For serverdesc missing (referenced by votes), can you plot how many of those are missing by how many votes? I wouldn't worry so much about missing descriptors being referenced from a single vote but more about missing descriptors being referenced from (almost) all votes.
  • Regarding the extrainfo missing, have these been published by relays that only published a single server descriptor or relays that have been around for a longer time? Again, more worried in the latter case.

I'll add some plots for both questions.
Many of the missing serverdesc will most likely be referenced by the majority of votes, b/c the missing descs referenced from consensus always goes up at the same time.

ad 2. turning off DownloadAll... doesn't make much of a difference, yet (cf. here).

comment:15 in reply to:  13 Changed 2 years ago by karsten

Replying to iwakeh:

  200 -- [...]
      -- the user requested statuses or serverdescs, and none of the ones we
         requested were found (0.2.0.4-alpha and earlier).
  [...]

  404 -- [...]
      -- the user requested statuses or serverdescs, and none of the ones
         requested were found (0.2.0.5-alpha and later).

Is the "none" in the 200 description a typo? Shouldn't it be "at least one"? Or am I missing something obvious?

So, in the third or fourth attempt I read that as follows:

If a client requests server descriptors from a directory and none can be found, a Tor version 0.2.0.4-alpha and earlier returns code 200 and a version 0.2.0.5-alpha and later returns code 404.

comment:16 Changed 2 years ago by iwakeh

Status: needs_informationaccepted

This makes sense. Maybe, it needs to be mentioned that all versions return 200 if one or more results were found. Then it would be comprehensible on first reading ;-)

Setting status to assigned as all open questions are answered.
Todo: add detailed plots about the missing descriptor (cf. comment:11)

comment:17 Changed 17 months ago by karsten

Summary: analysis of descriptor completenessAnalyze descriptor completeness
Type: taskenhancement

comment:18 Changed 11 months ago by karsten

Keywords: metrics-2018 added

comment:19 Changed 7 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: acceptedassigned

Move to metrics-team as these are not worked on by me during the next week.

Note: See TracTickets for help on using tickets.