Opened 3 weeks ago
Last modified 13 days ago
#29346 new enhancement
Document why our CSV files are in tidy/long format and how to process them
Reported by: | karsten | Owned by: | metrics-team |
---|---|---|---|
Priority: | Medium | Milestone: | |
Component: | Metrics/Website | Version: | |
Severity: | Normal | Keywords: | |
Cc: | metrics-team, gaba | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
This ticket is based on a discussion in Brussels.
The issue we talked about is that it can sometimes be difficult to import our per-graph CSV files into applications like LibreOffice Calc or services like CKAN and make charts out of them.
The reason is that we chose to use tidy/"long" data formats for our CSV files. For example, the following lines are contained in the relayflags.csv file:
date,flag,relays 2007-10-27,Exit,602 2007-10-27,Fast,1126 2007-10-27,Guard,244 2007-10-27,Running,1254 2007-10-27,Stable,586 2007-10-28,Exit,592 2007-10-28,Fast,1115 2007-10-28,Guard,293 2007-10-28,Running,1244 2007-10-28,Stable,578 [...]
However, charting applications expect the data in the messy/"wide" format:
date,Exit,Fast,Guard,Running,Stable 2007-10-27,602,1126,244,1254,586 2007-10-28,592,1115,293,1244,578 [...]
We briefly discussed in Brussels to change our formats accordingly, to please LibreOffice Calc et al. However, after giving this some more thoughts, I'm opposed to this idea.
There are reasons why we picked the tidy format in the first place: it's more flexible, because we don't have to worry about having to add or remove columns at any time. It's also somewhat easier to handle with statistics tools/languages like R and the very powerful tidyverse libraries. See also Hadley Wickham's Tidy Data paper which is a really good read on this topic: https://www.jstatsoft.org/article/view/v059i10
What can we do? I don't want to make the data harder to process for anyone, and sometimes LibreOffice Calc or CKAN can be great tools to get a first impression on a data set. We can also not expect everyone to use R or SPSS or MATLAB. But maybe we can solve this with better documentation rather than changing the way we're doing things.
The magic word here seems to be: pivot table. This random blog post that I just found seems to be a good start for people wanting to wrangle our tidy data into whatever they need for making charts: https://blog.datawrapper.de/pivottables/
And this random CKAN plugin that I did not try out could be a way to teach CKAN how to use our tidy data formats for its preview visualizations: https://github.com/routetopa/ckanext-pivottable
So, how about we document the reasons for choosing tidy data formats on the Statistics page and linking to a few tutorials for processing our data with common charting tools? Ideally, we would add links rather than write a lot of text on our own, though.
Does this sound plausible?
Child Tickets
Change History (3)
comment:1 Changed 3 weeks ago by
Cc: | gaba added |
---|
comment:2 Changed 2 weeks ago by
comment:3 Changed 13 days ago by
There are other tools (like openrefine) to do pivot table.
With libreoffice:
https://schoolofdata.org/handbook/courses/gentle-introduction-exploring-and-understanding-data/
With open refine:
https://openup.org.za/articles/openrefine-unpivot-tutorial.html
This sounds plausible, however I think we become the de facto maintainer for the pivottable plugin, which was developed as part of a fixed-term EU funding arrangement that seems to have ended in 2017. At least the JavaScript library it is a wrapper for seems to be currently maintained. There is probably also some pivot table function in LibreOffice that I've not seen so documenting the format is certainly helpful.