Opened 4 years ago

Last modified 17 months ago

#16555 assigned enhancement

Make user statistics more robust against outliers

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

tl;wr: From June 11 to 13, 2015, the number of bridge users briefly went up from around 20k to 140k. A closer investigation of the underlying data revealed that the aggregate statistics reported by a single bridge were responsible for this major spike. The estimation method used for user statistics should be made robust against outliers, possibly by applying the more recently developed techniques that are used to extrapolate hidden-service statistics.

Here are more details about that single bridge reporting almost unbelievable high statistics: It's the bridge with nickname "solemnizersfiaun" and hashed fingerprint 420C39C86B0E71F653E18552B28B9189DA2F1377 that reported to have served up to 80k users. But from the bandwidth statistics it looks like that bridge actually answered a huge number of consensus requests during those days in June. It pushed up to 20 MB/s, which is probably rather unusual for a bridge. A closer look at the descriptor tells us that most of these bytes were used to answer directory requests. (I didn't do the math whether a such a burst over a few hours would be sufficient to write 800k compressed consensuses.) So, either the bridge is telling us the truth, or it's lying to us in a very sophisticated way.

And it's not only that bridge that reported very high statistics in June. There's another bridge with nickname "Unnamed" and hashed fingerprint 82F37B9A8400A1E0C0730D8E4639150AE11AC640 that reported to have served around 10k users on June 18 and 22. Similarly, that bridge reported extremely high traffic during those days. I didn't look for more bridges, but it's possible that there were more that reported unusual numbers that didn't stand out as much as these.

So, I'm not sure if we'll find out what exactly happened there, but it seems very unrealistic that these directory requests were generated by actual human users. That's why I think we should remove these outliers in our estimation method.

Child Tickets

Attachments (5)

userstats-bridge-country.png (9.3 KB) - added by karsten 4 years ago.
solemnizersfiaun-clients.png (14.9 KB) - added by karsten 4 years ago.
solemnizersfiaun-bandwidth.png (16.1 KB) - added by karsten 4 years ago.
Unnamed-clients.png (17.9 KB) - added by karsten 4 years ago.
Unnamed-bandwidth.png (19.3 KB) - added by karsten 4 years ago.

Download all attachments as: .zip

Change History (10)

Changed 4 years ago by karsten

Changed 4 years ago by karsten

Changed 4 years ago by karsten

Changed 4 years ago by karsten

Attachment: Unnamed-clients.png added

Changed 4 years ago by karsten

Attachment: Unnamed-bandwidth.png added

comment:1 Changed 4 years ago by karsten

I just attached PNG versions of the graphs linked above, in case they will be gone or look different next week.

comment:2 Changed 2 years ago by karsten

Severity: Normal
Type: defectenhancement

comment:3 Changed 18 months ago by karsten

Component: Metrics/WebsiteMetrics/Statistics

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

comment:4 Changed 17 months ago by karsten

Keywords: metrics-2018 added

comment:5 Changed 17 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.