Opened 3 years ago

Closed 3 years ago

#20331 closed enhancement (fixed)

Speed up Metrics graphs by up to 4 seconds

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Some of the .csv files on Metrics have grown over time, with clients.csv being the largest with a current size of 25M. Still, we're reading these files every time we're drawing a graph, which may be a graph with slightly different parameters like start or end date. This idea was okay when files were small, but it's making updates to graphs slower that they have to be. On my laptop with an SSD drive it takes 4 seconds to read that file to memory. I'd guess that it takes a few seconds on the server, too, and that's waiting time for the user between hitting the "Update graph" button and receiving a new graph image back. In other words, graphs could be even more interactive if we resolved this issue.

I'm aware of this issue for a while now, but I somehow only thought of fixing this by putting .csv files into a database, which would require a bit of code. Today I came up with a simpler idea: we could simply cache .csv file contents in memory. All .csv files require 89M on disk, and even if they grow to double the current size that still seems plausible to keep in memory.

I briefly looked into the memoise package that's also available in Debian stable (r-cran-memoise), but the version in Debian doesn't support the timeout parameter, and I'm also not sure whether that mechanism would clutter memory over time by not invalidating older copies. But assuming we can get that running, here's a minimal example with a recent version of that package which does support the timeout parameter (with *** where read.csv() returns file contents from memory):

> require(memoise)
Loading required package: memoise
> read.csv <- memoise(utils::read.csv, ~timeout(10))
> 
> while (TRUE) {
+   print(paste(Sys.time(), " Reading clients.csv"))
+   read.csv("clients.csv")
+   print(paste(Sys.time(), " Sleeping for a second"))
+   Sys.sleep(1)
+ }
[1] "2016-10-10 11:18:11  Reading clients.csv"
[1] "2016-10-10 11:18:15  Sleeping for a second"
[1] "2016-10-10 11:18:16  Reading clients.csv"     ***
[1] "2016-10-10 11:18:16  Sleeping for a second"
[1] "2016-10-10 11:18:17  Reading clients.csv"     ***
[1] "2016-10-10 11:18:17  Sleeping for a second"
[1] "2016-10-10 11:18:18  Reading clients.csv"     ***
[1] "2016-10-10 11:18:18  Sleeping for a second"
[1] "2016-10-10 11:18:19  Reading clients.csv"     ***
[1] "2016-10-10 11:18:19  Sleeping for a second"
[1] "2016-10-10 11:18:20  Reading clients.csv"
[1] "2016-10-10 11:18:24  Sleeping for a second"
[1] "2016-10-10 11:18:25  Reading clients.csv"     ***
[1] "2016-10-10 11:18:25  Sleeping for a second"
[1] "2016-10-10 11:18:26  Reading clients.csv"     ***
[1] "2016-10-10 11:18:26  Sleeping for a second"
[1] "2016-10-10 11:18:27  Reading clients.csv"     ***
[1] "2016-10-10 11:18:27  Sleeping for a second"
[1] "2016-10-10 11:18:28  Reading clients.csv"     ***
[1] "2016-10-10 11:18:28  Sleeping for a second"

We could set the timeout to 1 hour or so, to ensure that processed .csv file contents are at most 1 hour old. These files are only updated twice per day anyway, so that should be fine.

A minor downside is that the first user in a given hour will experience a delay of a few seconds for reading the .csv file to memory. This could be removed by an approach where we're periodically checking whether file contents have changed.

Something else I didn't evaluate are race conditions. What if we're reading a .csv file to memory and a second request comes in requiring us to read that .csv file?

Final note: let's not overengineer this. If we switch to another graphing engine we'll lose this effort. Still, it seems like low-hanging fruit to eliminate this delay of a few seconds.

Child Tickets

Change History (1)

comment:1 Changed 3 years ago by karsten

Resolution: fixed
Status: newclosed

Alright, I made some good progress here by switching from read.csv() to load():

https://gitweb.torproject.org/metrics-web.git/commit/?id=1f90b72368bfd2a15ee60efb7fbbcc62b3c03cdc

Performance gain as compared to pre-a91f2dc:

  • userstats-relay-country: 3.0 seconds
  • userstats-bridge-country: 3.5 seconds
  • userstats-bridge-transport: 3.5 seconds
  • userstats-bridge-version: 3.5 seconds
  • userstats-bridge-combined: 2.7 seconds

I'll call this done, in particular with regard to not overengineering this. Closing.

Note: See TracTickets for help on using tickets.