Opened 9 months ago

Last modified 8 months ago

#29346 new enhancement

Document why our CSV files are in tidy/long format and how to process them

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords:
Cc: metrics-team, gaba Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This ticket is based on a discussion in Brussels.

The issue we talked about is that it can sometimes be difficult to import our per-graph CSV files into applications like LibreOffice Calc or services like CKAN and make charts out of them.

The reason is that we chose to use tidy/"long" data formats for our CSV files. For example, the following lines are contained in the relayflags.csv file:

date,flag,relays
2007-10-27,Exit,602
2007-10-27,Fast,1126
2007-10-27,Guard,244
2007-10-27,Running,1254
2007-10-27,Stable,586
2007-10-28,Exit,592
2007-10-28,Fast,1115
2007-10-28,Guard,293
2007-10-28,Running,1244
2007-10-28,Stable,578
[...]

However, charting applications expect the data in the messy/"wide" format:

date,Exit,Fast,Guard,Running,Stable
2007-10-27,602,1126,244,1254,586
2007-10-28,592,1115,293,1244,578
[...]

We briefly discussed in Brussels to change our formats accordingly, to please LibreOffice Calc et al. However, after giving this some more thoughts, I'm opposed to this idea.

There are reasons why we picked the tidy format in the first place: it's more flexible, because we don't have to worry about having to add or remove columns at any time. It's also somewhat easier to handle with statistics tools/languages like R and the very powerful tidyverse libraries. See also Hadley Wickham's Tidy Data paper which is a really good read on this topic: https://www.jstatsoft.org/article/view/v059i10

What can we do? I don't want to make the data harder to process for anyone, and sometimes LibreOffice Calc or CKAN can be great tools to get a first impression on a data set. We can also not expect everyone to use R or SPSS or MATLAB. But maybe we can solve this with better documentation rather than changing the way we're doing things.

The magic word here seems to be: pivot table. This random blog post that I just found seems to be a good start for people wanting to wrangle our tidy data into whatever they need for making charts: https://blog.datawrapper.de/pivottables/

And this random CKAN plugin that I did not try out could be a way to teach CKAN how to use our tidy data formats for its preview visualizations: https://github.com/routetopa/ckanext-pivottable

So, how about we document the reasons for choosing tidy data formats on the Statistics page and linking to a few tutorials for processing our data with common charting tools? Ideally, we would add links rather than write a lot of text on our own, though.

Does this sound plausible?

Child Tickets

Change History (3)

comment:1 Changed 8 months ago by gaba

Cc: gaba added

comment:2 Changed 8 months ago by irl

This sounds plausible, however I think we become the de facto maintainer for the pivottable plugin, which was developed as part of a fixed-term EU funding arrangement that seems to have ended in 2017. At least the JavaScript library it is a wrapper for seems to be currently maintained. There is probably also some pivot table function in LibreOffice that I've not seen so documenting the format is certainly helpful.

Note: See TracTickets for help on using tickets.