wiki:doc/CollecTor/AnalysisDescriptorCompleteness

Part 1: Analysis of Referenced Descriptor Completeness

This page summarizes the current findings. The discussion and questions can be found here.

Log Entries

The archiving component of CollecTor logs the missing descriptors of various types in a special format.

The following log entry explanation was extracted from Karsten's description in ticket 18798.

M-2016-04-11T22:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0279 -> 0.0279)
M-2016-04-11T23:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0279 -> 0.0279)
M-2016-04-11T23:00:00Z -> D-597C4455AF049B147337BBFF35CE4817676339FF5C94E971A05D416FD1A2DD95 (0.0279 -> 0.0558)
M-2016-04-12T00:00:00Z -> D-38F20E16457647CCFF5BD131692D5FCA129E87DC210B456DA983AB291141C85D (0.0280 -> 0.0558)
M-2016-04-12T00:00:00Z -> D-597C4455AF049B147337BBFF35CE4817676339FF5C94E971A05D416FD1A2DD95 (0.0280 -> 0.0558)
  1. The first line means that there's a microdescriptor with digest 38F2.. missing from the microdescriptor consensus with valid-after time 2016-04-11 22:00:00. That missing microdescriptor adds a value of 0.0279 to the total missing descriptor count which is then 0.0279. The idea is to only warn if that total value passes 1.0.
  2. The second line says that the same missing microdescriptor is also referenced from the microdescriptor consensus with valid-after time 2016-04-11 23:00:00. Given that we shouldn't double-count that missing descriptor, we're not increasing the total count there.
  3. The third line mentions another microdescriptor with digest 597C.. that is missing, and in this case it's referenced from the microdescriptor consensus with valid-after time 2016-04-11 23:00:00. That one raises the total count by another 0.0279 to then 0.0558.

Other log entry examples listing missing descriptors are

C-2016-03-19T07:00:00Z -> S-BD9E2444C8416A29467463F6B228CEB75B1216B7 (0.0281 -> 0.0281)
S-000A13E991700CB0A356CD08DDC0CDAB022F8B7E -> E-8A8DB3818A2CEE9D2844F8A9AD6FB89E04CFA7D1 (0.0100 -> 8.6512)
V-2016-03-19T09:00:00Z-14C131DFC5C6F93646BE72FA1401C02A8DF2E8B4 -> S-010612B70E18CB3E0CCA72A464E8FD683FDF029B (0.0254 -> 15.5266)

The short explanation for all four types:

  • S-: a server descriptor references an extra-info descriptor that is missing,
  • V-: a vote references a server descriptor that we're missing,
  • C-: a consensus references a server descriptor that we're missing, and
  • M-: a microdescriptor consensus references a microdescriptor that is missing (see above).

Method

The missing descriptor log entries are parsed and collected in sets according to the time-stamp of the log entry and the referrer type. Using sets we avoid counting a missing descriptor referenced by multiple entities (e.g. different votes, different microconsensus, etc.). Missing server descriptors are listed for votes and consensus separately, i.e., a missing server descriptor referenced by votes and consensus will increase the count in both types.

From these sets two numbers are calculated for each time-stamp and referrer type:

  • the number of currently missing descriptors of a certain type belonging to a certain type of referrer and
  • the number of new missing descriptors for each time-stamp compared to the previous run.

Data

The log files last from 2016-03-08 to 2016-04-13 with missing parts 2016-03-09 to 2016-03-18 and 2016-03-24 to 2016-03-31.

There was one known incident of a full server hard drive that prevented storing descriptors around 2016-03-19.

Another peak in missing descriptors is visible around 2016-04-01, which is also explained by a full hard drive.

Deciles

The following deciles are calculated without excluding the peaks:

referenced by0%10%20%30%40%50%60%70%80%90% 100%
consensus 0 0 0 0 0 0 0 0 0 0 1339
votes 0 0 0 0 0 0 0 0 1 3 2375
server 0 2 2 3 7 11 16 19 26 35 55
microconsensus 0 3 4 5 7 8 12 15 26 56 798

Graphs

Each of the following diagrams shows the number of total missing descriptor in lighter colors and the number of newly encountered missing descriptors in a darker color.

The y-axis depicts the count, the x-axis the time of measurement.

Counts are discrete, so the lines connecting the data points are just there to make perception easier, they are not an interpolation for the time in between measurements.

Total Picture

April 1st Closeup

April 2nd to April 13th

Last modified 2 years ago Last modified on Apr 29, 2016, 6:44:13 PM

Attachments (6)

Download all attachments as: .zip