Opened 6 weeks ago

Last modified 5 weeks ago

#28799 assigned enhancement

Use readr's read_csv() to speed up drawing graphs

Reported by: karsten Owned by: metrics-team
Priority: Low Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by karsten)

Let's use R.cache to speed up drawing graphs. I already prepared a patch that I'm going to post here as soon as I have a ticket number. From the commit message:

Over two years ago, in commit 1f90b72 from October 2016, we made our user graphs faster by avoiding to read the large .csv file on demand. Instead we read it once as part of the daily update, saved it to disk as .RData file using R's save() function, and loaded it back to memory using R's load() function when drawing a graph.

This approach worked okay. It just had two disadvantages:

  1. We had to write a small amount of R code for each graph type, which is why we only did it for graphs with large .csv files.
  2. Running these small R script as part of the daily update made it harder to move away from Ant towards a Java-only execution model.

The new approach implemented in this commit uses R.cache, which caches data for use by concurrent Rserve clients. The first time we read a .csv file we save it to the cache, and all subsequent times we just load it back from the cache. We're using the file name and last modified time as key in the cache to avoid using stale data. We're also clearing the cache on startup to avoid running out of disk space.

One somewhat unwanted side effect is that drawing the first graph from a new .csv file may take a few more seconds as compared to drawing subsequent graphs. This seems acceptable, though.

Requires installing the R.cache package from CRAN, which is available on Debian as r-cran-r.cache.

Edit: Turns out that we don't want R.cache but readr's read_csv() instead. See comments below.

Child Tickets

Change History (9)

comment:1 Changed 6 weeks ago by karsten

Status: assignedneeds_review

comment:2 Changed 6 weeks ago by notirl

Status: needs_reviewneeds_information

This commit looks OK. I'm not sure about the approach though. We had talked about using the same CSVs for these graphs as we make available for download so that we don't have two different CSVs and it is easier to plot custom graphs using our code as a starting point.

For the graphs that I've been making for various requests I've been using the readr library which works nicely with the tidyr universe of packages. What would the performance impact be of reading the CSVs from a ramdisk instead of caching them in R?

comment:3 in reply to:  2 Changed 6 weeks ago by karsten

Replying to notirl:

This commit looks OK. I'm not sure about the approach though. We had talked about using the same CSVs for these graphs as we make available for download so that we don't have two different CSVs and it is easier to plot custom graphs using our code as a starting point.

For the graphs that I've been making for various requests I've been using the readr library which works nicely with the tidyr universe of packages. What would the performance impact be of reading the CSVs from a ramdisk instead of caching them in R?

That's an interesting idea. Couple thoughts:

  • Where and when would we write the per-graph CSV files that would then become the starting point for graphs and partial CSV file exports?
    • If we use R for this, the code will be rather simple, but we'd still have an R part in our daily updater which we're currently trying to make Java-only.
    • We could execute some R code to write per-graph CSV files when starting Rserve, but we'd have to re-run it whenever the daily updater has finished. Sounds like it could get messy.
    • If we move this code to Java, we might want to look into statistics libraries to do something similar like what tidyr/dplyr does. The current approach with Java Collections classes is a bit limited.
  • The ramdisk sounds like it would be just as fast as the cache I'm suggesting. But how would we make sure it always has the most recent data, including after reboots?

Happy to discuss this more!

comment:4 Changed 6 weeks ago by karsten

Oh, and last time I looked, readr was not in Debian stable. Looks like it's in backports now, which I guess would work.

comment:5 Changed 5 weeks ago by karsten

I'm just looking into using readr instead of R.cache. So far, worst case performance is ~1 second for the "Bridge users by country and transport" graph, which is acceptable. Using readr would also solve the issue of not having to run any R code as part of the daily updater, and we could still look into more sophisticated solutions next year.

comment:6 Changed 5 weeks ago by karsten

Status: needs_informationneeds_review

Alright, turns out that readr works even better than R.cache! Some stats:

load() R.cache n=1 R.cache n>1 read_csv()
userstats-relay-country 1.148 4.564 1.879 1.198
userstats-bridge-country 0.842 1.746 1.711 0.990
userstats-bridge-transport 0.761 1.805 1.714 0.769
userstats-bridge-version 0.774   1.696 1.707 0.842
userstats-bridge-combined 8.937  12.346 9.448 1.222
webstats-tb 0.355 3.132 0.371 0.691
webstats-tb-platform 2.916 0.341 0.392 0.635
webstats-tb-locale 3.035   0.484 0.456 0.779
webstats-tm 0.185  0.248 0.219 0.435

We're currently using load() to load the .RData files back to memory that we prepared as part of the daily update. My previous suggestion was to use R.cache, with performance varying depending on whether we have read a CSV file before. The latest suggestion is to use read_csv() from the readr package.

I'd say readr is the clear winner, despite minimal performance decreases for some of the user graphs.

Please review commit 323bfbf in my task-28799-2 branch.

comment:7 Changed 5 weeks ago by irl

Status: needs_reviewmerge_ready

Glad to hear that this is fast enough that we don't need to cache. That is awesome!

The commit looks good to me. It is nice to see the new code looking a lot clearer and more readable.

comment:8 Changed 5 weeks ago by karsten

Description: modified (diff)
Priority: MediumLow
Status: merge_readyaccepted
Summary: Use R.cache to speed up drawing graphsUse readr's read_csv() to speed up drawing graphs

Thanks for looking! Merged with a small tweak, and deployed.

Setting back to accepted for the remaining graphs after we gathered some more experience with this new approach. That could easily happen in 2019.

comment:9 Changed 5 weeks ago by karsten

Owner: changed from karsten to metrics-team
Status: acceptedassigned
Note: See TracTickets for help on using tickets.