Opened 4 months ago

Closed 3 months ago

#25196 closed defect (fixed)

Cut off recent dates from several CSV files

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: iwakeh Sponsor:


In commit 8831ece we stopped cutting off data from CSV files. From the commit message: "When graphing data from CSV files it's not our job to make sure the data is stable enough to be graphed. That would mean that whoever uses our CSV files directly would have to make that sure by themselves. If data is too recent to be graphed, it shouldn't be included in the CSV files. As a side effect this makes the graphing process a little easier."

Looks like we should cut off a day or two more from hidserv.csv. I'm going to attach a sample graph soon.

Let's also create new tickets for similar findings.

Child Tickets

Attachments (2)

hidserv-frac-reporting-2017-11-10-2018-02-09.png (60.7 KB) - added by karsten 4 months ago.
cut-off-recent-dates-2018-03-06.pdf (31.2 KB) - added by karsten 3 months ago.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 4 months ago by karsten

comment:2 Changed 4 months ago by karsten

Hint: it's just the last date, February 9, where the fraction drops off. It's February 9 while I'm writing this comment.

comment:3 Changed 3 months ago by karsten

Owner: changed from metrics-team to karsten
Status: newaccepted

comment:4 Changed 3 months ago by karsten

Status: acceptedneeds_review

comment:5 Changed 3 months ago by karsten

Cc: metrics-team added
Reviewer: iwakeh

Optimistically assigning this review to iwakeh. Feel free to give it back if you don't want it.

comment:6 Changed 3 months ago by karsten

Priority: MediumHigh

Raising priority to high, as this is already confusing users.

comment:7 Changed 3 months ago by karsten

Priority: HighVery High

Now that metrics-lib/CollecTor releases are out of the way, it's time to fix this.

comment:8 Changed 3 months ago by karsten

Priority: Very HighMedium
Status: needs_reviewneeds_revision

I'd like to generalize this patch and also include the other statistics files. I'm currently waiting on some more data points from a local metrics-web instance. As soon as I have something, I'll revise the patch and set this ticket to needs_review again.

Changed 3 months ago by karsten

comment:9 Changed 3 months ago by karsten

Status: needs_revisionneeds_review
Summary: Cut off recent dates from hidserv.csvCut off recent dates from several CSV files

I set up a local metrics-web instance and modified it to run once per hour and not cut off any dates at all. I'm attaching a PDF file showing how statistics for given dates (colors) change (y axis) over the UTC day of March 6 (x axis). If a colored line changes much over the day, we cannot reasonable include it yet and need to cut off that date. There's a trade-off of holding back a statistic that is still changing too much vs. delaying a statistic more than necessary and not being able to act on the data.

Here's what I think we should do for all current statistics files:

  • servers.csv: We currently cut off 2 days (today = 2018-03-06 and the day before = 2018-03-05), but it would be sufficient to cut off just 1 day (today). The reason is that this file is based on consensuses and referenced server descriptors, all of which are typically available at the end of a day.
  • ipv6servers.csv: Same as servers.csv, except that we don't cut off anything yet, though I think we should, following the same rationale as above.
  • advbwdist.csv: Same as servers.csv, except that we already cut off just 1 day, so there's no need to change anything here.
  • bandwidth.csv: This file is based on statistics reported in extra-info descriptors, and those might take more time to come in. We're also not doing any estimates on the numbers we go so far, but we're simply adding up what we have. So, if 5% of statistics are still missing, those missing statistics will still change the end result by 5%. I suggest to wait 3 days. We currently cut off 4, but I think 3 should be sufficient. The better (long-term) solution would be to compensate missing data by extrapolating what we have, but we're not there yet.
  • connbidirect2.csv: Same as for bandwidth.csv, except that we're providing averages where missing descriptors don't affect the result as much. Cutting of 2 days will be fine (today and yesterday).
  • clients.csv and userstats-combined.csv: Same as for connbidirect2.csv, except that we're being smarter about estimating numbers from given reports. Cutting of 2 days will be enough (today and yesterday).
  • hidserv.csv: Same as clients.csv et al., except we're being quite smart about extrapolating reported statistics, so that we might even cut off just 1 day. But let's do 2 days as before to be on the safe side.
  • torperf-1.1.csv: OnionPerf only provides completed days, so it depends on when we get those files and whether we get all of them at once. I'm less certain here, but I think we're doing okay by cutting off 2 days.
  • webstats.csv: I don't have good data, because was down for a couple days now. This might also change after switching to CollecTor's webstats module. I'd say we don't touch this now and revisit it after switching to CollecTor.

Please review commit 450d9f1 in my updated task-25196 branch. If possible, I'd like to make changes tomorrow (Thursday).

comment:10 Changed 3 months ago by iwakeh

Regarding webstats our spec states in section 4.1:

In addition, log lines are treated differently according to the date they contain:

During an import process the sanitizer takes all log line dates into account and determines the reference interval as stretching from the oldest date to the youngest date encountered. Depending on the reference interval log lines are not yet processed, if their date is on the edges of the reference interval, i.e., the date is not at least a day younger than the older endpoint or the date is only LIMIT days older than the younger endpoint, where LIMIT is initially set to two, but this might change if necessary.
If the younger endpoint of the reference interval coincides with the current system date, the day before is used as the new younger reference interval endpoint, which ensures that the sanitizer won't publish logs prematurely, i.e., before there is a chance that they are complete. Thus, processing of log lines carrying such date is postponed.
All log lines with dates for which the sanitizer already published a log file are discarded in order to avoid altering published logs.

This means that logs are published (earliest) two days before today; two days before current system day. And, as soon as logs are published no more log lines with these dates will be added to new logs.

Last edited 3 months ago by iwakeh (previous) (diff)

comment:11 Changed 3 months ago by iwakeh

Status: needs_reviewmerge_ready

According to the attached statistics all the proposed cut-off day changes make sense. The code implements the necessary changes.
As tordir.sql is touched anyway, could the ORDER BY 1, 2, 3 ... lines be changed using the column names, too?

The date computations in java should start using LocalDate and friends, I'll add that to a different/new ticket.

A cut-off for webstats seems to be unnecessary. These can be computed as soon as a log for a date is available (cf. reasoning in comment:10), b/c no more information for a date with a published log will be added in future logs.

All in all, merge ready.

comment:12 Changed 3 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Thanks for reviewing! I changed the ORDER BY 1, 2, 3 parts in tordir.sql, squashed, and rebased to master. Let's tackle the dates in that new ticket. And okay, let's not cut off anything from webstats for now and see how that goes. Closing. Thanks!

Note: See TracTickets for help on using tickets.