Investigate drop in Tor Browser update pings in early 2017, 2018, 2019 and 2020

added component::metrics/analysis owner::metrics-team priority::medium severity::normal status::new type::task labels

In version 6.5 which was published on January 24, with ticket #19481 (moved) we changed the update URL from https://www.torproject.org/dist/torbrowser/update_2/ to https://aus1.torproject.org/dist/torbrowser/update_2/. The www.torproject.org/dist/ URL was redirected to dist.torproject.org which then returned the result, whereas with the aus1.torpoject.org URL the result is returned directly without a redirect. A possible explanation for the drop from 2017-01-24 is that the update ping were counted twice before the URL change because of the redirect.

In version 6.5.2 we changed the update_2 part in the URL to update_3, with ticket #19316 (moved). Initially metrics didn't count the update_3 requests as update pings, so this caused a drop in the update pings graph, but this has now been fixed. What we can see now is an increase in the update pings around the 4th and 5th of April, but it does not seem related to the URL change as there was no release around that time. I don't know the reason for this increase in update pings.

Fine question. Looks like we're missing some other resource string used for update pings, but I don't know which. Here's the requests we're including, by month, site, and resource_part up to update_[23]/`:

webstats=> SELECT '2017-0' || date_part('month', log_date) AS month, site,
webstats->     substr(resource_string, 1,
webstats(>       strpos(resource_string, 'update_') + 8) AS resource_part,
webstats->     SUM(count) AS count
webstats->   FROM files NATURAL JOIN requests NATURAL JOIN resources
webstats->   WHERE resource_string LIKE '%/torbrowser/update\__/%'
webstats->   AND resource_string NOT LIKE '%.xml'
webstats->   AND response_code = 200
webstats->   AND method = 'GET'
webstats->   AND log_date >= '2017-01-01'
webstats->   GROUP BY month, site, resource_part
webstats->   ORDER BY month, count DESC;
  month  |          site          |                  resource_part                   |  count   
---------+------------------------+--------------------------------------------------+----------
 2017-01 | dist.torproject.org    | /torbrowser/update_2/                            | 48888500
 2017-01 | aus1.torproject.org    | /torbrowser/update_2/                            |  3576134
 2017-01 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/        |      119
 2017-01 | dist.torproject.org    | https://dist.torproject.org/torbrowser/update_2/ |        2
 2017-02 | aus1.torproject.org    | /torbrowser/update_2/                            | 17695061
 2017-02 | dist.torproject.org    | /torbrowser/update_2/                            |  2827113
 2017-02 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/        |      536
 2017-03 | aus1.torproject.org    | /torbrowser/update_2/                            | 19250809
 2017-03 | dist.torproject.org    | /torbrowser/update_2/                            |  1977765
 2017-03 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/        |      616
 2017-04 | aus1.torproject.org    | /torbrowser/update_2/                            | 31079925
 2017-04 | aus1.torproject.org    | /torbrowser/update_3/                            | 16694038
 2017-04 | dist.torproject.org    | /torbrowser/update_2/                            |  1469608
 2017-04 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/        |      386
 2017-05 | aus1.torproject.org    | /torbrowser/update_3/                            | 39459138
 2017-05 | aus1.torproject.org    | /torbrowser/update_2/                            |   991946
 2017-05 | dist.torproject.org    | /torbrowser/update_2/                            |   982639
 2017-05 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/        |      529
(18 rows)

boklm, any ideas which other resource string we should be including?

Trac:
Cc: N/A to gk

boklm, gk, any ideas what we're missing?

Trac:
Cc: gk to gk, boklm

Replying to karsten:

boklm, gk, any ideas what we're missing?

Not yet. How are we dealing with redirects we have/had in place? Do/did we double-count requests that get/got redirected?

Replying to karsten:

boklm, gk, any ideas what we're missing?

I don't see what is missing, or if something is missing.

Would it be possible to run the same request, for the days January 24, 25, 26 (when the update pings dropped), and April 4, 5, 6 (when they increased), to try to understand what changed? Maybe seeing which type of URL dropped or increased on those days can tell us if we are missing something.

Replying to gk:

Not yet. How are we dealing with redirects we have/had in place? Do/did we double-count requests that get/got redirected?

We're disregarding redirects (code 302) and only counting succeeded requests (code 200). Should we do this differently?

Replying to boklm:

Replying to karsten:

boklm, gk, any ideas what we're missing?

I don't see what is missing, or if something is missing.

Would it be possible to run the same request, for the days January 24, 25, 26 (when the update pings dropped), and April 4, 5, 6 (when they increased), to try to understand what changed? Maybe seeing which type of URL dropped or increased on those days can tell us if we are missing something.

Sure, here's the output:

webstats=> SELECT log_date, site,
webstats->     substr(resource_string, 1,
webstats(>       strpos(resource_string, 'update_') + 8) AS resource_part,
webstats->     SUM(count) AS count
webstats->   FROM files NATURAL JOIN requests NATURAL JOIN resources
webstats->   WHERE resource_string LIKE '%/torbrowser/update\__/%'
webstats->   AND resource_string NOT LIKE '%.xml'
webstats->   AND response_code = 200
webstats->   AND method = 'GET'
webstats->   AND (log_date = '2017-01-24'
webstats(>     OR log_date = '2017-01-25'
webstats(>     OR log_date = '2017-01-26'
webstats(>     OR log_date = '2017-04-04'
webstats(>     OR log_date = '2017-04-05'
webstats(>     OR log_date = '2017-04-06')
webstats->   GROUP BY log_date, site, resource_part
webstats->   ORDER BY log_date, count DESC;
  log_date  |          site          |               resource_part               |  count  
------------+------------------------+-------------------------------------------+---------
 2017-01-24 | dist.torproject.org    | /torbrowser/update_2/                     | 2025386
 2017-01-24 | aus1.torproject.org    | /torbrowser/update_2/                     |   33549
 2017-01-24 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/ |       1
 2017-01-25 | dist.torproject.org    | /torbrowser/update_2/                     |  692113
 2017-01-25 | aus1.torproject.org    | /torbrowser/update_2/                     |  151832
 2017-01-26 | aus1.torproject.org    | /torbrowser/update_2/                     |  381621
 2017-01-26 | dist.torproject.org    | /torbrowser/update_2/                     |  362971
 2017-01-26 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/ |       2
 2017-04-04 | aus1.torproject.org    | /torbrowser/update_2/                     |  655434
 2017-04-04 | dist.torproject.org    | /torbrowser/update_2/                     |   50278
 2017-04-04 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/ |       8
 2017-04-05 | aus1.torproject.org    | /torbrowser/update_2/                     | 1488508
 2017-04-05 | dist.torproject.org    | /torbrowser/update_2/                     |   51111
 2017-04-05 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/ |      23
 2017-04-06 | aus1.torproject.org    | /torbrowser/update_2/                     | 1847522
 2017-04-06 | dist.torproject.org    | /torbrowser/update_2/                     |   50576
 2017-04-06 | archive.torproject.org | /tor-package-archive/torbrowser/update_2/ |      11
(17 rows)

Would you want to play with the database yourself? It's ~3G uncompressed, so it shouldn't be that hard to dump and compress it. You'd have to create a local PostgreSQL database and import that file, and then you could run requests like this yourself. (I'd still be around to help with the schema as needed!)

Moving all tickets to Metrics/Statistics that are more related to the data-aggregating modules rather than the website parts of metric-web.

Trac:
Component: Metrics/Website to Metrics/Statistics

Tweak summary.

Trac:
Summary: tor browser update URL change and the update ping metrics to Investigate drop in Tor Browser update pings in early 2017, possibly caused by update URL change

Interestingly it seems this is happening again with a X.5 release. It seems we need a better theory assuming both incidents can be explained by the same underlying cause.

Trac:
Summary: Investigate drop in Tor Browser update pings in early 2017, possibly caused by update URL change to Investigate drop in Tor Browser update pings in early 2017 and 2018

The drop from 2018-01-24 seems to be related to the release of Tor Browser 7.5. However I can't find any change between 7.0.11 and 7.5 that could explain that. The app.update.* prefs seems to be the same in both versions.

Replying to karsten:

Would you want to play with the database yourself? It's ~3G uncompressed, so it shouldn't be that hard to dump and compress it. You'd have to create a local PostgreSQL database and import that file, and then you could run requests like this yourself. (I'd still be around to help with the schema as needed!)

Yes, if you can send me a dump of this database, I will look more closely at the numbers from the drop around 2018-01-24 to try to understand it.

Trac:
Cc: gk, boklm to gk, boklm, brade, mcs

Replying to boklm:

Yes, if you can send me a dump of this database, I will look more closely at the numbers from the drop around 2018-01-24 to try to understand it.

Great! I just created a database dump and sent you the link via private mail.

On April 6, 2018, we had again a big increase in the number of pings: https://metrics.torproject.org/webstats-tb.html?start=2018-04-01&end=2018-04-17

From March 26 to April 10, we also had an increase in downloads and signature downloads: https://metrics.torproject.org/webstats-tb.html?start=2018-04-01&end=2018-04-17

Replying to boklm:

On April 6, 2018, we had again a big increase in the number of pings: https://metrics.torproject.org/webstats-tb.html?start=2018-04-01&end=2018-04-17

The last release was on March 26, so this does not seem to be related to a new release.

From March 26 to April 10, we also had an increase in downloads and signature downloads: https://metrics.torproject.org/webstats-tb.html?start=2018-04-01&end=2018-04-17

This one seems related to the new release. However it is surprising to see the number increasing in 2 days, staying stable for around 12 days, then decreasing back to the previous level in 4 days. It is also the first time we see a big increase in signature downloads. It seems signatures were downloaded around 1.2M times in 12 days.

Trac:
Cc: gk, boklm, brade, mcs to gk, boklm, brade, mcs, arthuredelstein

We had another annual drop recently. Looking at the graphs it seems release related (but not major release related).

Trac:
Summary: Investigate drop in Tor Browser update pings in early 2017 and 2018 to Investigate drop in Tor Browser update pings in early 2017, 2018, and 2019

And the 2019 drop recovered a few weeks ago.

How strange!

With #31755 (moved) we can now see separate graphs for the alpha and stable releases: https://metrics.torproject.org/webstats-tb-channel.html?start=2019-01-04&end=2019-06-02

We can see that the update pings halved for both alpha and stable at the end of January 2019 (with the releases of versions 8.0.5 and 8.5a7). And it doubled again in May, but only for the stable.

Moving to Metrics/Analysis, because this ticket is more about understanding the data than improving the data-processing module that provides the data. (Sorry for not adding anything new, just doing spring cleaning.)

Trac:
Component: Metrics/Statistics to Metrics/Analysis

We had an other drop, on January 8, the same day as the 9.0.3 release: http://rougmnvswfsmd4dq.onion/webstats-tb-platform.html?start=2020-01-01&end=2020-01-20

Trac:
Summary: Investigate drop in Tor Browser update pings in early 2017, 2018, and 2019 to Investigate drop in Tor Browser update pings in early 2017, 2018, 2019 and 2020

Trac:

I took another look at the database to see if what we're seeing here is an artifact of processing web server request logs.

First, I looked at requested sites to see if this is in any way related to the change from www.tp.o/dist.tp.o to aus1.tp.o that was mentioned 3 years ago or similar changes after that. Here's a graph:

No, this is not the reason. There was just one such switch, and even though it coincides with the first drop, there was no other switch after that.

Second, I looked at requested servers to see if this is maybe related to web servers being added to or removed from the rotation. For example, it could be that some servers do not properly sync their request logs to the metrics server. Here's a graph:

There have been quite a few web servers over the years, so that this graph is a bit harder to read. That's why I included absolute and relative numbers. The first drop is a bit confusing, also related to the site change, but the other three drops do not show any correlation with server changes at start or end of the drop. The second drop had server changes in the middle of the drop, but apparently those did not affect numbers much. The third and fourth drop did not have any server changes.

All in all, I don't see a bug in the data processing part, at least not an obvious one. It seems to me that these requests are real.

I'm changing the ticket type to task, because a defect in a Metrics/* component implies that there's a bug in our code, and right now I don't see that that's the case.

Trac:
Type: defect to task

Investigate drop in Tor Browser update pings in early 2017, 2018, 2019 and 2020

Child items 0

Activity