The main reason for removing them from the public interface is to reduce future maintenance effort. These files will still exist and we will still use them internally, we'd just not make them available for download anymore. The reduced maintenance effort mainly comes from not having to write similar specifications when adding new pre-aggregated CSV files.
Note that the data will still be available via the "Download data as CSV." links on all graph pages. See #25382 (moved) for that.
What reasons might exist for leaving these CSV files still available. Are people using them and cannot switch easily to the new, ad-hoc generated CSV files?
If we decide to take these files down, we'll have to give users a reasonable heads-up. 1--2 months seems reasonable. We should announce that decision at least on stats.html and possibly in other places, too.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
irl, I vaguely remember that you were planning to use the stats/*.csv files for something related to metrics-bot. What was that, and would that also work with the new graph-based CSV files?
Trac: Cc: metrics-team, dcf to metrics-team, dcf, irl
I might be almost unique in this, but I do actively use the existing full user metric .csv files for the infolabe-anomalies mailing list (http://lists.infolabe.net/lists/listinfo/infolabe-anomalies). This is carrying out anomaly detection on a nightly basis looking at user numbers.
I'm not entirely sure to what extent this data would still be available under the proposed change. I'm fine to update my code a bit to change URLs, etc, but if the data disappeared entirely it would harm some of my research quite considerably.
In a more general sense, I'd strongly advocate for an 'open data' principle. You never know what research the next person coming along might do, and what it might contribute. At the moment it's a simple curl/wget to get that data -- if I had to wrangle a web interface as a precondition to carrying out analysis on Tor metrics then it would (possibly) be extremely awkward.
Again, though, maybe I'm misunderstanding the proposed changes.
#25240 is the relevant ticket, with the idea being to add some tweets to metrics-bot about the number of users. It sounds to me that the new graph based ones are better for this use case than the raw files under stats/.
A few questions though:
Would we update stats.html to describe the format of the new CSV files along with the API for retrieving them programmatically?
Do we describe how we get from CollecTor to the new graph based CSV files (which would include the steps necessary for getting to the stats/*.csv files)?
I'm not opposed to removing them from the website for the reason that I feel they are intermediate artefacts of the analysis/visualisation process, as long as we are not creating them with secret data that would not be accessible to others (afaik it's all in CollecTor) and as long as we describe our process.
For most use cases I think the graph based CSV files are going to be easier for new researchers to get started with anyway.
I might be almost unique in this, but I do actively use the existing full user metric .csv files for the infolabe-anomalies mailing list (http://lists.infolabe.net/lists/listinfo/infolabe-anomalies). This is carrying out anomaly detection on a nightly basis looking at user numbers.
I'm not entirely sure to what extent this data would still be available under the proposed change. I'm fine to update my code a bit to change URLs, etc, but if the data disappeared entirely it would harm some of my research quite considerably.
In a more general sense, I'd strongly advocate for an 'open data' principle. You never know what research the next person coming along might do, and what it might contribute. At the moment it's a simple curl/wget to get that data -- if I had to wrangle a web interface as a precondition to carrying out analysis on Tor metrics then it would (possibly) be extremely awkward.
Again, though, maybe I'm misunderstanding the proposed changes.
Your use case is special, yet we should make sure it's supported.
FWIW, you wouldn't have to fight with web interfaces to get the data, as the new CSV files are also available via curl/wget, just with optional parameters.
But it might be that none of the new CSV files has all the data you want. We should fix that and ensure that your use case continues to work even after taking down the old CSV files.
#25240 is the relevant ticket, with the idea being to add some tweets to metrics-bot about the number of users. It sounds to me that the new graph based ones are better for this use case than the raw files under stats/.
A few questions though:
Would we update stats.html to describe the format of the new CSV files along with the API for retrieving them programmatically?
Ideally, the new CSV files won't need as much description as the old ones. Column names are chosen in reference to graphs, units are the same as in the graph, parameters are the same as for the graph, and so on. The current plan is to include a short comment header in the CSV files, see #25387 (moved). Ideally, we wouldn't have to write and maintain another HTML page for that. It's yet one more thing we need to write whenever we want to provide a new graph, and we'd want to make it easier to add new graphs, not harder.
Do we describe how we get from CollecTor to the new graph based CSV files (which would include the steps necessary for getting to the stats/*.csv files)?
Yes, this is still relevant. And this deserves a HTML page somewhere. Could be stats.html or a new page.
I'm not opposed to removing them from the website for the reason that I feel they are intermediate artefacts of the analysis/visualisation process, as long as we are not creating them with secret data that would not be accessible to others (afaik it's all in CollecTor) and as long as we describe our process.
For most use cases I think the graph based CSV files are going to be easier for new researchers to get started with anyway.
Today I wanted to make a graph that uses userstats-combined.csv, like in comment:5:ticket:20495. The graph shows, for one transport, the top countries that use that transport. (It's the opposite of https://metrics.torproject.org/userstats-bridge-combined.html, which shows, for one country, the top transports used by that country.) It would be less convenient to make my graph without userstats-combined.csv.
for one transport, the top countries that use that transport
Currently the arguments that are passed to the graphing backend are only date ranges and countries, so plotting for any one thing that isn't a country won't be supported. In the new approach, would the CSV for all countries still be usable for this? If not, we should consider what other similar use cases we might be breaking.
for one transport, the top countries that use that transport
Currently the arguments that are passed to the graphing backend are only date ranges and countries, so plotting for any one thing that isn't a country won't be supported. In the new approach, would the CSV for all countries still be usable for this? If not, we should consider what other similar use cases we might be breaking.
Thanks for the input! I'm still not sure what's the best thing to do here. But we're not going to decide anything before the Tor meeting in Rome. And when we change something, that will be after discussing it more here first.
We started out with providing the CSV files linked from stats.html. These are quite detailed, which is great for researchers, but also somewhat difficult to handle for less technical users.
We now provide 1 CSV file per graph that is generated ad-hoc to contain just the data that is displayed in the graph. These files are likely more usable for non-technical users, but as we see here they lack details for researchers.
How about we find a middle ground? We could provide 1 CSV file per graph and contain all data that could possibly be displayed in that graph. Basically, it would be the same link as in 2., but there wouldn't be any parameters. That's still less data than in 1. but more data than in 2. We would likely have to add a legend to the CSV file header to explain the columns, but we're planning to do that anyway (#25446 (moved)).
How about we find a middle ground? We could provide 1 CSV file per graph and contain all data that could possibly be displayed in that graph. Basically, it would be the same link as in 2., but there wouldn't be any parameters. That's still less data than in 1. but more data than in 2. We would likely have to add a legend to the CSV file header to explain the columns, but we're planning to do that anyway (#25446 (moved)).
Thoughts on this?
Or would it be a lot easier to discuss this idea with example data at hand? If so, I can produce some data.
Are we strongly against the idea of providing two CSV files? I'd like to see the current CSV that only contains the data used to produce the plot, and then additionally the full CSV pre-filtering that would contain all the data.
This would work for the use case where you want to do your own processing on the data and would also work for the use case where someone wanted to produce plots using the same data that we have already filtered and processed.
For the full CSV file, a header would probably be useful. It may also be useful to have an HTML page that contains a list of all the available CSV files but the specifications for those files could be documented in the headers of the CSVs. We wouldn't need to list the individual pre-filtered CSV files on that page.
Are we strongly against the idea of providing two CSV files?
I have been thinking a lot about this yesterday, and I think the answer is: yes.
Providing two types of CSV files pretty much doubles our effort for adding new aggregations or graphs as well as changing or removing parts. I'd prefer the process for adding or improving graphs to get easier, not harder.
Let's try to provide just one type of CSV files, assuming that we don't break existing, valid use cases.
But let's find a way to stop providing our pre-aggregated statistics files. They are not the best interface that we can provide. And they are an interface that can become quite painful to maintain in the future.
I'd like to see the current CSV that only contains the data used to produce the plot, and then additionally the full CSV pre-filtering that would contain all the data.
This would work for the use case where you want to do your own processing on the data and would also work for the use case where someone wanted to produce plots using the same data that we have already filtered and processed.
For the full CSV file, a header would probably be useful. It may also be useful to have an HTML page that contains a list of all the available CSV files but the specifications for those files could be documented in the headers of the CSVs. We wouldn't need to list the individual pre-filtered CSV files on that page.
Understood, I think.
Here's another suggestion:
We provide 1 CSV file per graph that is parameterized by default and that can also be requested without any parameters. The link on the graph page would contain the same parameters as the graph, so that the CSV file content would be pretty close to what's shown in the graph. Except that the file might contain a few more columns. But the header would explain those columns. And the header would also say that it's possible to drop parameters to get more data for different parameter combinations of this graph.
Let's make this more concrete by adding sample data:
The CSV link on the current Relay users graph page would read (line break added for visibility):
## The Tor Project## URL: https://metrics.torproject.org/userstats-relay-country.csv?start=2018-02-07&end=2018-05-08&country=all&events=off## Insert some specification...#date,country,users,downturns,upturns,lower,upper2018-02-07,,4071868,,,,2018-02-08,,3815277,,,,2018-02-09,,4000274,,,,[...]2018-05-03,,2296101,,,,2018-05-04,,2341577,,,,2018-05-05,,2229328,,,,
Now, if someone's interested in date for all dates, a break-down by all possible countries, and possible censorship events, they'd simply take out all parameters and fetch the following file (link does not work yet):
## The Tor Project## URL: https://metrics.torproject.org/userstats-relay-country.csv## Insert some specification...#date,country,users,downturns,upturns,lower,upper2011-03-06,a1,1443,,,,2011-03-06,a2,424,,,,2011-03-06,ae,8395,,,,[...]2018-05-06,zw,245,FALSE,FALSE,122,3892018-05-06,,2220344,,,,2018-05-06,??,25797,,,,
For comparison, the current CSV file, that we wouldn't provide anymore, starts and ends with the following lines:
Note that the bridge user data would still be available on the various bridge users graphs.
And we could discuss whether it makes sense to include the frac column in the relay users CSV file or not. If we include it, it would be there in the parameterized CSV file as well as the non-parameterized CSV file. I guess this is a trade-off between usability ("less is more") and usefulness ("more details can help").
We provide 1 CSV file per graph that is parameterized by default and that can also be requested without any parameters. The link on the graph page would contain the same parameters as the graph, so that the CSV file content would be pretty close to what's shown in the graph. Except that the file might contain a few more columns. But the header would explain those columns. And the header would also say that it's possible to drop parameters to get more data for different parameter combinations of this graph.
[...]
Thoughts?
I believe you have just described a method of implementing the two CSV files for the two use cases I wanted to make sure we supported. (:
Alright, I now have some code to review. This code implements my latest suggestion above. The R code is tested, the one-line Java code change is not. Please take a look at my task-25383 branch with two commits, 0975ca1 and d3b280a.
If this branch looks reasonable, I'll test-deploy it for a few days to let others try out the new behavior. It's fully backward-compatible (it only adds columns and rows), so we're not breaking anything by deploying it and going back to master shortly after.
Next steps after that, in no particular order:
Decide where to add the legend (Java or R).
Discuss whether we want to use wide/long format for these CSVs. Yes, we should have had this discussion a few weeks back, but it's better to have it next week than never.
Decide how we announce and make changes in the future, in particular backward-incompatible ones. For example, Onionoo has a "next_major_version_scheduled" field to announce backward-incompatible changes, and we need something like that, too.
Add a note to stats.html saying when it's going to go away.
Add a note to CSV file header saying it's still BETA until the same date as mentioned on stats.html, maybe with 2 or 4 weeks overlap.
Alright, I now have some code to review. This code implements my latest suggestion above. The R code is tested, the one-line Java code change is not. Please take a look at my task-25383 branch with two commits, 0975ca1 and d3b280a.
The second commit looks ok. Regarding the first commit: Interesting, I wasn't aware of the !! operator in R. But, why can't we change the (Java) calls of the R functions and have different names for parameters and columns? This would be more readable.
...
Decide how we announce and make changes in the future, in particular backward-incompatible ones. For example, Onionoo has a "next_major_version_scheduled" field to announce backward-incompatible changes, and we need something like that, too.
Good question! Should this be a part of the comment in the csv file?
Alright, I now have some code to review. This code implements my latest suggestion above. The R code is tested, the one-line Java code change is not. Please take a look at my task-25383 branch with two commits, 0975ca1 and d3b280a.
The second commit looks ok. Regarding the first commit: Interesting, I wasn't aware of the !! operator in R. But, why can't we change the (Java) calls of the R functions and have different names for parameters and columns? This would be more readable.
I thought about that, too, but figured it would be a bigger change now. But I'm open to the idea. What do you suggest? Add _ to parameter names? Or _param? Like:
Decide how we announce and make changes in the future, in particular backward-incompatible ones. For example, Onionoo has a "next_major_version_scheduled" field to announce backward-incompatible changes, and we need something like that, too.
Good question! Should this be a part of the comment in the csv file?
We provide 1 CSV file per graph that is parameterized by default and that can also be requested without any parameters. The link on the graph page would contain the same parameters as the graph, so that the CSV file content would be pretty close to what's shown in the graph. Except that the file might contain a few more columns. But the header would explain those columns. And the header would also say that it's possible to drop parameters to get more data for different parameter combinations of this graph.
We provide 1 CSV file per graph that is parameterized by default and that can also be requested without any parameters. The link on the graph page would contain the same parameters as the graph, so that the CSV file content would be pretty close to what's shown in the graph. Except that the file might contain a few more columns. But the header would explain those columns. And the header would also say that it's possible to drop parameters to get more data for different parameter combinations of this graph.
Bumping this ticket for the open review part. Or should we merge the changes up to d3b280a and leave the refactoring patch 2dbfc52 as improvement for June?
I made more progress on this ticket. Going through the remaining steps from comment 16 above:
Next steps after that, in no particular order:
Decide where to add the legend (Java or R).
Maybe the CSV file header is not the right place for this legend after all. The specification of parameters and columns can be quite long, and if we also plan to include scheduled and past changes, the header section will be even longer. Oh, and whatever we write here won't change in the CSV file that somebody downloaded, until they decide to download a new CSV file from us.
I tried out something else: extend our existing stats.html to also cover the per-graph CSV files. The CSV file header could then include a link to that page or possibly even a subsection on that page.
I'll post a branch shortly.
Discuss whether we want to use wide/long format for these CSVs. Yes, we should have had this discussion a few weeks back, but it's better to have it next week than never.
I made remarks in the extended stats.html page to change the format. This could be the first scheduled change that would become effective a couple weeks later.
Decide how we announce and make changes in the future, in particular backward-incompatible ones. For example, Onionoo has a "next_major_version_scheduled" field to announce backward-incompatible changes, and we need something like that, too.
We could include remarks like the ones I made on stats.html, and we might even add a change log to the top of that page to summarize past and upcoming changes.
Add a note to stats.html saying when it's going to go away.
In the page.
Add a note to CSV file header saying it's still BETA until the same date as mentioned on stats.html, maybe with 2 or 4 weeks overlap.
I did not touch CSV file headers yet. Once we have a fixed deprecation date, let's include it there.
Does it make sense to specify our per-graph CSV files there, rather than in the CSV file header?
Is the format with two subsections Parameters and Columns okay? Is something missing?
Are specifications roughly correct/plausible?
Do the suggestions make sense? The rule of thumb for deciding which columns we need was: "it should require a code change to change columns, and neither the user should be able to control which columns exist by their choice of parameters, nor should the available data have any influence on that."
Regarding timing, how about we deploy this page still in July, make suggested changes by August 15, take out pre-aggregated stats files by September 15, and handle any questions coming out of that in the two weeks before the Mexico City meeting?
Changing priority back to high for the still-in-July bit. Thanks!
Trac: Priority: Medium to High Status: assigned to needs_review
Does it make sense to specify our per-graph CSV files there, rather than in the CSV file header?
I think yes. The CSV files are machine-readable first, human-readable is not the priority for these files.
Is the format with two subsections Parameters and Columns okay? Is something missing?
I think this is OK. Perhaps we need an example GET request to start this document off. We're really documenting an HTTP API, not just the individual files. Perhaps we need to say that either we do, or we don't, guarantee the ordering of the columns. If we add/remove columns later would we change the ordering? Especially with removal, would we pad with nulls or something like that? (Anything by transport is particularly affected by this).
Are specifications roughly correct/plausible?
Nothing is jumping out at me as obviously wrong. I haven't considered every one thoroughly yet though.
Do the suggestions make sense? The rule of thumb for deciding which columns we need was: "it should require a code change to change columns, and neither the user should be able to control which columns exist by their choice of parameters, nor should the available data have any influence on that."
The suggestions do make sense, and would solve the immediate column ordering issue (although we should still make a statement as to what we would expect to happen in the future). I commented on #26950 (moved) separately.
Regarding timing, how about we deploy this page still in July, make suggested changes by August 15, take out pre-aggregated stats files by September 15, and handle any questions coming out of that in the two weeks before the Mexico City meeting?
Does it make sense to specify our per-graph CSV files there, rather than in the CSV file header?
I think yes. The CSV files are machine-readable first, human-readable is not the priority for these files.
Agreed.
Is the format with two subsections Parameters and Columns okay? Is something missing?
I think this is OK. Perhaps we need an example GET request to start this document off. We're really documenting an HTTP API, not just the individual files.
Good point. I'll add something.
Perhaps we need to say that either we do, or we don't, guarantee the ordering of the columns. If we add/remove columns later would we change the ordering? Especially with removal, would we pad with nulls or something like that? (Anything by transport is particularly affected by this).
Good questions. I think I'd rather not want to guarantee the ordering of columns and instead require users to refer to columns by name and not index. In particular the null padding sounds like it would implicitly stop us from removing unnecessary changes, which seems bad. So, yes, let's state this at the start of the page. I'll add something.
Are specifications roughly correct/plausible?
Nothing is jumping out at me as obviously wrong. I haven't considered every one thoroughly yet though.
Okay.
Do the suggestions make sense? The rule of thumb for deciding which columns we need was: "it should require a code change to change columns, and neither the user should be able to control which columns exist by their choice of parameters, nor should the available data have any influence on that."
The suggestions do make sense, and would solve the immediate column ordering issue (although we should still make a statement as to what we would expect to happen in the future). I commented on #26950 (moved) separately.
Perfect!
Regarding timing, how about we deploy this page still in July, make suggested changes by August 15, take out pre-aggregated stats files by September 15, and handle any questions coming out of that in the two weeks before the Mexico City meeting?
Sounds good to me!
merge_ready for this page.
Setting to needs_revision for the page header part. Will move it back to needs_review for whenever I have something.
Alright, I just pushed commit f7793c7 with a new page header to my task-25383-2 branch. Please take a look. We can still tweak the text there and provide more information. However, if possible, I'd like to deploy this page tomorrow and also announce it on tor-dev@, so that people have at least two weeks as a heads up before suggested changes become effective. Thanks!
Awesome! Thanks for the review! Merged, deployed, and announced on tor-dev@. Created #26998 (moved) and #27000 (moved) for the scheduled changes. Other than that, I think we're done here. Closing. Thanks again!
Trac: Status: merge_ready to closed Resolution: N/Ato fixed