#24729 closed defect (fixed)

Find reason for 'null' values in Onionoo document

Reported by: Dbryrtfbcbhgf Owned by: karsten
Priority: High Milestone:
Component: Metrics/Onionoo Version:
Severity: Major Keywords:
Cc: sjcjonker@…, irl Actual Points:
Parent ID: #24155 Points:
Reviewer: iwakeh Sponsor:

Description (last modified by iwakeh)

See comment:1 for a data example.
Next steps:

  • determine possible cause of null-values (incl. data check on CollecTor)
  • discuss, if this is a desirable feature or bug
  • fix cause, if ncessary

Initial report:

This relay shows that there was 0 read/write bytes per second for almost a month even though the Consensus Weight stays above 8000. https://atlas.torproject.org/#details/5D86AFD7CE409251E67B373B4F0E780A0F41C944

Child Tickets

Attachments (3)

issue.png (270.7 KB) - added by Dbryrtfbcbhgf 20 months ago.
Image showing the metrics issue
relay-search-upsampled.png (144.6 KB) - added by karsten 19 months ago.
3-month-graph-with-24-hour-resolution.png (233.6 KB) - added by karsten 19 months ago.

Download all attachments as: .zip

Change History (26)

Changed 20 months ago by Dbryrtfbcbhgf

Attachment: issue.png added

Image showing the metrics issue

comment:1 Changed 20 months ago by irl

Component: Metrics/Relay SearchMetrics/Onionoo

These graphs are based on the data coming out of Onionoo, so if this is a bug, it's most likely a bug on Onionoo.

https://onionoo.torproject.org/bandwidth?lookup=2949AE0F05FEC29FBB503C60F77D25E696B81808 looks like:

{"version":"5.0",
"build_revision":"0bce98a",
"relays_published":"2017-12-24 08:00:00",
"relays":[
{"fingerprint":"5D86AFD7CE409251E67B373B4F0E780A0F41C944","write_history":{"1_month":{"first":"2017-11-23 06:00:00","last":"2017-12-24 02:00:00","interval":14400,"factor":4671.676676676677,"count":186,"values":[821,889,863,949,999,783,652,633,625,746,720,698,752,710,854,865,878,796,400,597,689,740,788,812,567,823,832,530,530,489,404,446,459,467,555,500,333,369,335,385,309,343,299,338,408,490,514,462,409,514,490,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,348,348,301,271,289,373,451,295,468,453,338,384,292,329,348]},"3_months":{"first":"2017-09-23 06:00:00","last":"2017-12-24 06:00:00","interval":43200,"factor":5453.75975975976,"count":185,"values":[206,244,255,313,192,215,146,203,219,275,260,298,263,291,281,320,245,356,248,320,299,297,264,280,278,313,254,239,216,231,213,330,254,291,284,277,231,293,264,272,153,194,169,212,177,197,176,203,194,241,187,248,220,252,231,286,298,279,229,185,77,161,109,115,122,157,141,126,118,107,101,76,90,125,131,114,107,138,90,176,78,213,86,92,100,94,87,124,87,48,104,80,125,102,6,105,140,184,177,210,266,259,299,381,311,465,394,448,442,390,490,653,624,655,664,766,727,999,910,967,794,897,751,803,591,597,617,741,512,633,629,540,382,423,343,294,280,403,395,420,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,298,246,320,359,287,298]},"1_year":{"first":"2017-09-10 00:00:00","last":"2017-12-23 00:00:00","interval":172800,"factor":4918.484484484485,"count":53,"values":[1,24,108,279,383,226,244,283,210,292,320,324,316,301,275,307,294,202,209,241,274,275,128,151,111,128,141,134,112,89,94,197,334,449,548,751,999,900,706,641,400,405,459,550,476,473,424,381,371,317,262,353,336]}},"read_history":{"1_month":{"first":"2017-11-23 06:00:00","last":"2017-12-24 02:00:00","interval":14400,"factor":4700.518518518518,"count":186,"values":[821,889,862,949,999,785,653,634,626,747,720,699,754,712,855,866,878,796,400,597,688,739,787,810,566,819,829,529,528,488,404,445,458,463,553,499,332,368,334,382,308,343,299,337,407,488,512,461,408,513,489,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,350,350,302,272,290,374,452,295,467,451,338,382,292,330,347]},"3_months":{"first":"2017-09-23 06:00:00","last":"2017-12-24 06:00:00","interval":43200,"factor":5479.079079079079,"count":185,"values":[205,243,254,311,191,214,145,202,218,273,258,297,262,290,280,319,244,354,247,318,298,295,262,278,277,312,253,238,216,231,214,329,253,291,284,277,231,292,264,271,152,193,169,212,177,197,176,203,194,242,187,248,221,252,231,286,299,280,230,185,77,161,109,115,122,157,141,126,118,107,101,75,90,125,131,114,107,138,90,176,78,213,86,92,100,93,86,124,87,48,104,80,125,102,6,105,140,184,177,209,265,259,299,380,311,465,394,448,442,389,490,653,623,655,664,766,727,999,912,969,797,898,752,803,592,599,619,743,513,633,628,539,382,422,343,293,280,402,395,420,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,300,247,321,359,287,298]},"1_year":{"first":"2017-09-10 00:00:00","last":"2017-12-23 00:00:00","interval":172800,"factor":4946.875875875876,"count":53,"values":[1,24,107,277,380,224,242,281,208,290,319,322,314,299,274,306,293,201,209,241,274,275,128,151,111,128,141,134,112,88,94,197,333,448,547,750,999,900,707,640,399,404,459,549,476,473,427,383,375,320,264,355,336]}}}
],
"bridges_published":"2017-12-24 08:04:04",
"bridges":[
]}

There seem to be a lot of null values in there.

comment:2 Changed 20 months ago by iwakeh

Description: modified (diff)
Summary: This relay shows that there was 0 read/write bytes per second for almost a monthFind reason for 'null' values in Onionoo document

comment:3 Changed 20 months ago by karsten

Here's what happened:

  • The relay upgraded from a tor software version reporting bandwidth for 4 hour intervals to 0.3.2.6-alpha on December 2 and later downgraded to a pre-0.3.2.6-alpha version on December 21. During that time it reported bandwidth values for 24 hour intervals.
  • Onionoo's 1 month graph has a fixed data point interval of 4 hours. It cannot report data with a resolution of only 1 data point per 24 hours. That's why Onionoo includes all those null values, simply meaning that it doesn't know how much bandwidth the relay read or wrote during that time.
  • Relay search displays those null values as 0 values, which it shouldn't do. Knowing that a relay pushed 0 bytes is a different thing than not knowing how many bytes it pushed.
  • Only the 1 month and 3 months graphs are affected, because they have a data resolution of 4 hours and 12 hours, respectively. Look at the 1 year graph with a data resolution of 48 hours and which doesn't have a gap in December 2017.
  • Note that this is the same issue that leads to blank 1 month and 3 month graphs and non-blank 1 year graphs for new relays.

Is this a bug in Onionoo? Probably.

What can we do? We could go through all reported data points for a given graph and pick the largest interval as data point interval. That is, if all data points have an interval of 4 hours, we produce graph data with 1 data point per 4 hours. But if there's at least 1 reported data point with an interval of 24 hours, we aggregate all data for that graph to 1 data point per 24 hours.

A possible downside is that the graphs for 3 months, 1 month, 1 week, and 3 days will basically contain the exact same data. Maybe we can do something smart to avoid providing data that is too redundant, but not too smart in order to not move too much logic into clients.

comment:4 Changed 20 months ago by karsten

Here's a possibly smarter idea. We add a 6 month history object with a data resolution of 1 day. And we skip graphs if the reported data interval of even a single data point is higher than the graph interval.

Let me illustrate this idea by giving a table of all documents containing histories together with their graph intervals (columns), data resolution (cells), and maximum number of data points (in parentheses):

document 3 days 1 week 1 month 3 months 6 months 1 year 5 years
clients 1 day (7) 1 day (31) 1 day (92) 1 day (183) 2 days (183) 10 days (183)
bandwidth 15 minutes (288) 1 hour (168) 4 hours (186) 12 hours (184) 1 day (183) 2 days (183) 10 days (183)
uptime 1 hour (168) 4 hours (186) 12 hours (184) 1 day (183) 2 days (183) 10 days (183)
weights 1 hour (168) 4 hours (186) 12 hours (184) 1 day (183) 2 days (183) 10 days (183)

Example: The 3 days bandwidth graph has a data resolution of 15 minutes which means that it can hold up to 288 data points (3 * 24 * 60 / 15 = 288).

Suggested changes in bold are to add a 6 months graph with a resolution of 1 day and also to drop the shorter graphs from clients documents with a data resolution of 1 day. (The latter is unrelated to this change, but why not clean up here now that we discovered this redundancy.)

How does this sound?

comment:5 Changed 20 months ago by sjcjonker

Cc: sjcjonker@… added

comment:6 Changed 20 months ago by teor

This is related to the metrics ticket #24155 about the tor reporting change in #23856. There's also #24829 about the graphs we show.

Maybe #24155 should be the parent here? (How does metrics do parent tickets?)

comment:7 in reply to:  6 Changed 19 months ago by karsten

Parent ID: #24155

Replying to teor:

This is related to the metrics ticket #24155 about the tor reporting change in #23856. There's also #24829 about the graphs we show.

Maybe #24155 should be the parent here? (How does metrics do parent tickets?)

Sure, can't hurt. (I'd say we do parent tickets in a similar way as other teams.)

comment:8 in reply to:  4 Changed 19 months ago by karsten

Replying to karsten:

Here's a possibly smarter idea. We add a 6 month history object with a data resolution of 1 day. And we skip graphs if the reported data interval of even a single data point is higher than the graph interval.

irl, iwakeh, I'm curious what you think about this idea!

comment:9 Changed 19 months ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

Accepting this ticket as something I'm going to work on over the next week.

comment:10 Changed 19 months ago by karsten

Status: acceptedneeds_review

Setting to needs_review to get a review of the idea above to add 6 month graphs and skip graphs as soon as we run into a single data point with too long reporting period.

comment:11 Changed 19 months ago by iwakeh

Reviewer: iwakeh

comment:12 Changed 19 months ago by iwakeh

The suggestion in comment:4 seems ok, technically all data for shorter graphs can be taken from there.
Can the clients using the data provide all useful/needed graphs with this change and deal with possibly omitted data?

comment:13 in reply to:  12 Changed 19 months ago by karsten

Replying to iwakeh:

The suggestion in comment:4 seems ok, technically all data for shorter graphs can be taken from there.

Agreed.

Can the clients using the data provide all useful/needed graphs with this change and deal with possibly omitted data?

As of now, clients cannot handle this yet. But I think that's #24831, so at least Relay Search will be handle to handle this at some point.

However, I thought more about omitting data. Maybe we should not do that. Here's a new idea:

  • We add a 6 month history object with a data resolution of 1 day. (Same as above.)
  • We drop the shorter graphs from clients documents with a data resolution of 1 day. (Same as above.)
  • If we compile a graph from a history only containing entries with a data resolution that is too low for the graph (e.g., data is provided for 24 hours, but the graph shows data for 12 hours), we skip that graph. (Same as we do right now, same as above.)
  • If we compile a graph from a history also containing entries with a data resolution that is too low along with entries with the high-enough data resolution (e.g., data is provided for 4 hour intervals, then for 24 hour intervals, then again for 4 hour intervals), we include that graph and interpolate data as necessary. (This suggestion is new. The earlier suggestion was to drop this graph.)

Here's why I think that this suggestion is better:

  • Even if we add a 6 month graph today, the data in current status files for 3 to 6 months ago is already compressed to a resolution of 2 days. We'd basically have to wait 3 months for 6 month graphs to first appear, or we'd have to re-import data. Ewww.
  • I'm worried that there are edge cases where we're compressing data in status files a tiny bit too early. The effect might be that we're not providing any graphs at all.

Here's a possible issue that I see with this suggestion:

  • Whenever a relay changes its reporting interval, graphs will suddenly be a lot less volatile, which might confuse relay operators wondering what happened. But I think the explanation that their relay is now reporting data on a lower resolution should be obvious enough to quickly answer those questions.

I did not implement this, because I'd still want us to resolve #16513 first. Raising priority on that even more, so that we can finally resolve this one.

How does this plan sound?

Changed 19 months ago by karsten

Attachment: relay-search-upsampled.png added

comment:14 Changed 19 months ago by karsten

Hmm, no, I don't like my last suggestion anymore after trying it out based on the #16513 changes. Those interpolated/upsampled points look much more awkward than I had expected. We'd mainly shift confusion from missing points to points that look like glitches. Also, we don't really need a 3 month graph and a 6 month graph.


New plan:

  • Short-term fix:
    • We change just the bandwidth graph for 3 months to a data resolution of 24 hours rather than 12 hours. That way it can accommodate new statistics along with old statistics.
    • We fix Relay Search to plot null as missing data point rather than the value 0. That's going to fix the 1 month graph, and it's the right thing to do anyway.
  • Medium-term fix:
    • We start retaining data in statuses on 24 hour granularity rather than 48 hours for up to 6 months.
    • In 3 months from now, we change the 3 months graph to 6 a months graph with a resolution of 24 hours.
    • Also in 3 months from now, we change Relay Search to display a 6 months graph rather than the 3 months graph.
  • Long-term fix:
    • We stop giving out data for fixed intervals and provide all data in a single history object along with a normalized x axis with timestamps.
    • We teach Relay Search to draw different graphs based on this single history object. Basically, it will need to learn how to downsample data points that are too detailed for a graph showing a long period of time.

I can try this out this afternoon. Does this make sense?

comment:15 in reply to:  14 Changed 19 months ago by iwakeh

Replying to karsten:

Hmm, no, I don't like my last suggestion anymore after trying it out based on the #16513 changes. Those interpolated/upsampled points look much more awkward than I had expected. We'd mainly shift confusion from missing points to points that look like glitches. Also, we don't really need a 3 month graph and a 6 month graph.

Makes sense.

...
New plan:

  • Short-term fix:
    • We change just the bandwidth graph for 3 months to a data resolution of 24 hours rather than 12 hours. That way it can accommodate new statistics along with old statistics.

Let's take a look and try how this influences the resulting graphs.

  • We fix Relay Search to plot null as missing data point rather than the value 0. That's going to fix the 1 month graph, and it's the right thing to do anyway.

This is a ticket, i.e., planned already, afaik.

  • Medium-term fix:
    • We start retaining data in statuses on 24 hour granularity rather than 48 hours for up to 6 months.

The granularity of one day is a good choice, imo.

  • In 3 months from now, we change the 3 months graph to 6 a months graph with a resolution of 24 hours.
  • Also in 3 months from now, we change Relay Search to display a 6 months graph rather than the 3 months graph.

Sure, the graphing window should rather be set and computed at the client side.

  • Long-term fix:
    • We stop giving out data for fixed intervals and provide all data in a single history object along with a normalized x axis with timestamps.

A fine goal and the way to go.

  • We teach Relay Search to draw different graphs based on this single history object. Basically, it will need to learn how to downsample data points that are too detailed for a graph showing a long period of time.

Clients should be able to handle the new data.

I can try this out this afternoon. Does this make sense?

Yes. It might be good to also hear more from the client/Relay Search side here. And, an opinion of what users expect to see graphed, e.g. six vs. three month, granularity etc.

Last edited 19 months ago by iwakeh (previous) (diff)

Changed 19 months ago by karsten

comment:16 Changed 19 months ago by karsten

Replying to iwakeh:

Replying to karsten:

Hmm, no, I don't like my last suggestion anymore after trying it out based on the #16513 changes. Those interpolated/upsampled points look much more awkward than I had expected. We'd mainly shift confusion from missing points to points that look like glitches. Also, we don't really need a 3 month graph and a 6 month graph.

Makes sense.

...
New plan:

  • Short-term fix:
    • We change just the bandwidth graph for 3 months to a data resolution of 24 hours rather than 12 hours. That way it can accommodate new statistics along with old statistics.

Let's take a look and try how this influences the resulting graphs.

Alright, I implemented this and attached a sample graph showing 3 months of the relay in question in this ticket:


The graph of the left is the current graph. It contains null values in the middle.

The graph on the right is the new graph. It has fewer data points, but it does have the part in the middle and an additional part on the right that is not contained in the graph on the left.

Please review the last four commits in my task-16513-24729 branch which is based on your earlier task-16513-2 branch:

  • 8a14e83 is an important fix related to #16513.
  • 08932a6 is what I suggested as "Tor time" in #16513 and which is similar to your suggestion in #25091.
  • b00d44a changes the data resolution of the bandwidth 3 months graph.
  • 5769dca retains histories on a 24 hour granularity for up to 6 months rather than 6.

(Feel free to comment on the first two commits on #16513, if you prefer, and on the last two commits on this ticket.)

  • We fix Relay Search to plot null as missing data point rather than the value 0. That's going to fix the 1 month graph, and it's the right thing to do anyway.

This is a ticket, i.e., planned already, afaik.

Okay.

  • Medium-term fix:
    • We start retaining data in statuses on 24 hour granularity rather than 48 hours for up to 6 months.

The granularity of one day is a good choice, imo.

Done. See the fourth commit above.

  • In 3 months from now, we change the 3 months graph to 6 a months graph with a resolution of 24 hours.
  • Also in 3 months from now, we change Relay Search to display a 6 months graph rather than the 3 months graph.

Sure, the graphing window should rather be set and computed at the client side.

Right, though in this case I basically meant to change the label from "3 Months" to "6 Months". Computing things would be left to the long-term fix below.

  • Long-term fix:
    • We stop giving out data for fixed intervals and provide all data in a single history object along with a normalized x axis with timestamps.

A fine goal and the way to go.

Agreed.

  • We teach Relay Search to draw different graphs based on this single history object. Basically, it will need to learn how to downsample data points that are too detailed for a graph showing a long period of time.

Clients should be able to handle the new data.

Agreed.

I can try this out this afternoon. Does this make sense?

Yes. It might be good to also hear more from the client/Relay Search side here. And, an opinion of what users expect to see graphed, e.g. six vs. three month, granularity etc.

Hopefully, the attached graphs are sufficient to make a decision on the short-term fix. And yes, the medium-term and long-term changes should be discussed more on their own tickets.

So, assuming this review goes through, how do we proceed, and in which order?

  1. Squash and merge to master.
  2. Put out a release, possibly adapting the change log.
  3. Update the specification page on metrics-web.
  4. Coordinate deployment by making backups and then updating.
  5. Create tickets for the suggested medium-term and long-term changes.
  6. Maybe also create a ticket for removing redundant graphs in the clients document.

comment:17 Changed 19 months ago by iwakeh

The commits look fine (also commented on #16513 for the two related to that ticket), pass the usualy tests&checks.

I leave this on review as the solution plan and graphs might need feedback from others, too.

Next steps in comment:16 seem fine to me.

comment:18 in reply to:  17 Changed 19 months ago by karsten

Cc: irl added

Replying to iwakeh:

The commits look fine (also commented on #16513 for the two related to that ticket), pass the usualy tests&checks.

Sounds great. Thanks for checking!

I leave this on review as the solution plan and graphs might need feedback from others, too.

irl, that would be you. Any thoughts? :)

Next steps in comment:16 seem fine to me.

Okay.

comment:19 Changed 19 months ago by karsten

Actually, we could as well move forward with what we have. The changes to 3 months graphs are easily reversible, because they don't affect files in status/ but only files in out/. If we later find out we want something else, we can simply write new out/ files. Let's rather deploy what we have, because the current situation is bugging too many users.

As commented on #16513, I squashed all fixup/squash commits and pushed to master. I'm now preparing a release. Leaving this ticket open until we're done with all 6 steps above.

comment:20 Changed 19 months ago by karsten

Status: needs_reviewmerge_ready

comment:21 Changed 19 months ago by irl

This looks fine to me. Sorry for taking a while to get to replying.

comment:22 Changed 19 months ago by karsten

Ah, great. Sorry for moving forward without waiting for your reply. I thought it would be good to get this deployed rather sooner than later. We can adjust things later on as needed.

comment:23 in reply to:  16 Changed 19 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Replying to karsten:

  1. Squash and merge to master.

Done yesterday.

  1. Put out a release, possibly adapting the change log.

Done yesterday.

  1. Update the specification page on metrics-web.

Actually, the specification page does not say which detail a given graph is supposed to have, and we did not yet add a new 6 month graph. So, we don't have to update the specification page. Which also makes sense, because we did not change the protocol version number.

  1. Coordinate deployment by making backups and then updating.

Done yesterday.

  1. Create tickets for the suggested medium-term and long-term changes.

Created #25175 and #25176 for this today.

  1. Maybe also create a ticket for removing redundant graphs in the clients document.

Created #25177 for this today.

I think that addresses all next steps. Closing as fixed. Thanks for all the input!

Note: See TracTickets for help on using tickets.