Write Python version of filter.R to parse Torperf's new .mergedata format

added component::archived/torperf points::4 priority::high resolution::duplicate status::closed type::enhancement labels

Trac:
Owner: karsten to N/A
Status: new to assigned

Setting priority to major, because our analyses depend on visualizing all Torperf data, not just that part in the .data files. Blocking on #2672 (closed). I think Tom offered to work on this ticket.

Trac:
Priority: normal to major
Component: Metrics to Torperf
Points: N/A to 4

#2672 (closed) is fixed now, so we can continue working on this ticket.

tomb said he started looking into this, so I'm assigning the ticket to him.

Trac:
Owner: N/A to tomb

Tom, did you make any progress on this? If not, I might steal this ticket from you, because Mike and I want to parse and graph new Torperf results next week.

Stealing.

Trac:
Owner: tomb to karsten

See branch task2687 in my public repository for a possible patch.

Trac:
Status: assigned to needs_review

Tom, you said you also have a partial patch? Can you make that available somewhere?

Trac:
Cc: N/A to tomb

Trac:
parsemergedata.R

Partial merge data parser

I have added my draft mergedata parser, the TODO items are commented in the code.

Thanks for attaching your code. It's very interesting to read someone else's R code. I learn a lot by doing so. :)

However, I'm afraid neither of our attempts are sufficient yet. Here are a few comments:

I'd rather want to avoid adding another dependency with the "stringr" package, unless we have to. I replaced the str_* functions with standard R functions, e.g., unlist(strsplit()), which seemed to work.
Writing to CSV doesn't work yet. I'd like to know if we can export the parsed data easily.
The major issue is that parsing takes much too long. I parsed 1 week of 50 KB downloads containing 4247 rows, 2013 of which are measurements. Your script took 2:54 minutes for this task or 1:47 minutes when using the standard R functions instead of the stringr stuff. My script takes 0:25 minutes for this task which is still far too much. For comparison, reading the output CSV file takes only 314 milliseconds. People will want to parse months or even years of data coming from a dozen Torperf runs or more. This shouldn't take hours, but minutes. So, we should aim for at most a few seconds for the week of data. Plus, the script should scale linearly for more data, which I'm not sure is the case for our attempts.

If we cannot find an efficient way to parse these files, let's take one step back. What data formats are there that allow us to add or remove columns easily and that can be parsed efficiently in R? CSV is fast, but inflexible. Space-separated key-value-pairs are flexible, but apparently slow. What else is there? XML?

I wrote Karsten yesterday with some brainstorming on how to improve efficiency. I have had some new thoughts since then which I will prototype today and then document.

With respect to the csv output, yes this is easy. I just had that bit commented out so I could concentrate on the data structure and input checking. I'll add the output functionality now.

Pasting your email and replying to it here:

What I am saying is, maybe R is just the wrong tool for the really string heavy stuff. I could write a small parser in c, using lex and yacc so that the parser can be an efficient state machine. This parser could then be called from an R script. The parser does the front end string processing and can dump it into the csv. We then read the csv into the R code to crunch the stats.

Seems like this would use the best features of each tool. I can certainly make my current R approach output to csv, that is just a few lines of code at the bottom. I was focusing on testing the data structure before producing text output.

Since the input language is so simple and has a regular level grammar the state machine will be super efficient since there is no need for a lookahead or LR parsing the way there would be with a context free grammar. The advantage is that the state machine would be run byte by byte over the input in a single pass. Very low memory requirement since you only need to buffer on an as needed basis. You don't have to read in the characters as a large matrix which may be what R does. I don't know how many lines R buffers at once, but with lex and yacc you know the buffer is a small constant size. That way we really know that our O(n) single pass through the text doesn't have any hidden side costs.

I haven't given up hope to use R as the tool to parse Torperf output yet. After all, we are free to choose whatever data format we want, so we can still make it fit our parsing capabilities. What I'd really want to avoid is to add yet another technology to the Torperf zoo. And parsing lines with key-value pairs shouldn't be a hard problem for R.

Do you know which part of our R approaches slows us down so horribly? I think R is just mad at us for using a for loop and wants us to use a real R thing instead. But I haven't figured out a better way to parse stuff.

Replying to karsten:

Pasting your email and replying to it here:

What I am saying is, maybe R is just the wrong tool for the really string heavy stuff. I could write a small parser in c, using lex and yacc so that the parser can be an efficient state machine. This parser could then be called from an R script. The parser does the front end string processing and can dump it into the csv. We then read the csv into the R code to crunch the stats.

Seems like this would use the best features of each tool. I can certainly make my current R approach output to csv, that is just a few lines of code at the bottom. I was focusing on testing the data structure before producing text output.

Yes, it would be a good approach if R really couldn't parse your file efficiently.

Since the input language is so simple and has a regular level grammar the state machine will be super efficient since there is no need for a lookahead or LR parsing the way there would be with a context free grammar. The advantage is that the state machine would be run byte by byte over the input in a single pass. Very low memory requirement since you only need to buffer on an as needed basis. You don't have to read in the characters as a large matrix which may be what R does.

Not likely. It looks very much like R just reads in a line at a time using whatever buffering stdio provides.

I don't know how many lines R buffers at once, but with lex and yacc you know the buffer is a small constant size. That way we really know that our O(n) single pass through the text doesn't have any hidden side costs.

It's not a single pass through the text. Each time you process an input line, you copy all of the preceding lines:

117	    mergedata_vector <- c(mergedata_vector, my.mergedata)

That's O(n^2^) right there.

Replying to rransom:

{{{ 117 mergedata_vector <- c(mergedata_vector, my.mergedata) }}}

That's O(n^2^) right there.

That's my guess, too. In theory, it could be implemented more efficiently than O(n^2^), but apparently it isn't. Both Tom's and my earlier approaches use code like this.

I searched for how others solve similar problems and came up with some lapply, sapply, and do.call magic. See branch task2687 in my public repository for an alternative implementation. This code finishes parsing the file mentioned above in 4.3 seconds. Note that this is still a prototype that only parses the .data pieces of the .mergedata files. But in theory, it should work for all fields in .mergedata.

Hey all!

If we can hold on for just a little while, I have a new version almost complete, but I really want to finish it before I post it. My new version incorporates code from Karsten's version as well as mine. I'd love to get it totally finished, but I do have to test the comparative efficiency of the program against both good data, and data with some errors mixed in. I've got this test code written, but I want to get it all wrapped together in a branch before you all evaluate it.

BTW: The reason I didn't post my complete email to Karsten was because it has some ideas that were already out of date since I had done further thinking.

W/r/t adding all the data into a big vector, this may be the problem, and it is what I meant when I said above that "I have some ideas" about where to look for the problem.

If R makes any kind of sense at all this could not possibly be O(n^2) since it is tail recursion. If list insertion at the tail of a vector is any more than adding a pointer to the already allocated memory for the line, then R is not the right choice. This is not impossible. My experience with R like languages makes me intuit that insertion at the end of a vector is a pointer operation, but I have not yet gotten the chance to learn the details of R that are not listed in the language specification which doesn't say. I know that it is constant time in other list oriented languages such as LISP, Haskel, and SML.

I plan to try i/o code more in a c style with data output in small buffer flushes rather than being in memory in a single large data structure, however I used the data structure approach because of my general impression that this is more in the spirit of R. If I understand correctly, R was designed to handle really large matrices, which is the structure I chose. I think this is almost a defining feature of R.

Bottom line: I am comparing several approaches to find the best one. I will post a working if not perfect version soon.

Replying to tomb:

W/r/t adding all the data into a big vector, this may be the problem, and it is what I meant when I said above that "I have some ideas" about where to look for the problem.

If R makes any kind of sense at all this could not possibly be O(n^2^) since it is tail recursion. If list insertion at the tail of a vector is any more than adding a pointer to the already allocated memory for the line, then R is not the right choice. This is not impossible. My experience with R like languages makes me intuit that insertion at the end of a vector is a pointer operation, but I have not yet gotten the chance to learn the details of R that are not listed in the language specification which doesn't say. I know that it is constant time in other list oriented languages such as LISP, Haskel, and SML.

NO.

A vector is an array. By using the vector concatenation operation, you were explicitly creating a new array in each iteration, containing a copy of the preceding array and one additional element.

It is possible to build a vector from beginning to end in amortized linear time: Store a pointer to a mutable vector and the number of elements already stored in the vector in a mutable data structure, add each new element to the vector by mutating the vector and element count (known in Common LISP as a vector's ‘tail pointer’), and double the length of the vector (its ‘capacity’ in Python; I don't know what Common LISP calls this) every time you run out of room and need to copy all preceding elements to a new vector.

It is also possible to build a linked list of processed lines in reverse order: In LISP terms, replace your list with (CONS NEW-ELEMENT OLD-LIST).

Or you can build a linked list of processed lines in the same order in which you received them, by mutating the ‘CDR’s of list cells. lapply (known in Scheme as map and in Common LISP as MAPCAR) most likely does this.

I plan to try i/o code more in a c style with data output in small buffer flushes rather than being in memory in a single large data structure, however I used the data structure approach because of my general impression that this is more in the spirit of R. If I understand correctly, R was designed to handle really large matrices, which is the structure I chose. I think this is almost a defining feature of R.

Use the R standard library. Library functions are more likely to be properly designed, implemented, and optimized than anything you can write in the R interpreted language without digging into the guts of how R is implemented.

@rransom: will make your suggested changes

I'm somewhat worried about the time we spend on this ticket. Maybe we should have two substeps here:

Come up with a prototype in R that parses files containing key=value pairs as efficiently as possible. There's no need to make this code beautiful yet, just fast. AFAIK, 4.3 seconds are the lowest bid so far, but I hope we can do better.
Decide if the prototype is fast enough to stick with R and/or the .mergedata format. If not, discuss which piece should be replaced? If the prototype is fast enough, clean up the prototype and call this ticket done.

I'm mainly interested if we can define the .mergedata format as the new standard Torperf output format. Or rather, I want to tweak that format a little bit which I'm going to write in a separate ticket. But I need to know whether the general key=value format is something we can work with.

Trac:
parsemergedata.2.R

I broke karsten's and my code down into it's barest essentials.

Even with everything but the actual parsing and output removed they are far too slow.

I ran R's native profiling to find out what they were spending their time on, and found it to be string manipulation and output.

I found no significant difference between aggregation of output in data frame, vector, or immediate appending to the native output buffer. I went through about a dozen possible implementations and found none of them to be more than a small constant factor different in run time.

I profiled memory consumption and found that all the versions I experimented with had modest O(n) memory consumption.

Tentative conclusion: R is ill suited to significant string manipulation Tentative recommendation: Let R crunch numbers and stats, but do the string manipulation in a different language. Why not move the string manipulation into the programs that provide the .data and .mergedata?

Replying to tomb:

Tentative conclusion: R is ill suited to significant string manipulation Tentative recommendation: Let R crunch numbers and stats, but do the string manipulation in a different language.

Okay. I didn't expect R to be incapable of handling this data format, because R is really fast at parsing CSV files, tables, and so on. But I agree with you. Let's stop trying to use R for this task.

Why not move the string manipulation into the programs that provide the .data and .mergedata?

You mean why not produce both the .mergedata format and another format that R can handle more easily? Why would we need the .mergedata format then? We should agree on a single data format that describes Torperf data.

If we find another format that R can handle more easily, we should only use that format. But we want to make sure that the data format can be extended easily. For example, if we want to add another parameter like CBT, we want to do that without breaking old stuff. Or we might want to have some fields show up in only some of the measurements, like hidden service substeps, but without writing NA for them in all other measurements. And we want to be able to remove fields that we don't need anymore. The key=value formats seems more flexible than CSV here. See #3036 (closed) for the Torperf data format discussion.

So, what can we do about this ticket? How about we rewrite filter.R in Python? The rest of Torperf is written in Python, so that we can expect people to have that available. I'm changing the ticket summary accordingly. If you disagree about Python or rewriting filter.R in it, we can always change it to something else.

Thanks!

Trac:
Status: needs_review to assigned
Owner: karsten to N/A
Summary: Update filter.R to parse Torperf's new .mergedata format to Write Python version of filter.R to parse Torperf's new .mergedata format

The filter.R script in my task2687 branch broke a few days ago when applying it to a .mergedata file that had the DIDTIMEOUT key in some but not all of the lines. Time to come up with a replacement!

Closing. #3285 (closed) is the new ticket.

Trac:
Resolution: N/A to duplicate
Status: assigned to closed

closed

changed time estimate to 32h

mentioned in issue #3036 (closed)

mentioned in issue #3285 (closed)

Write Python version of filter.R to parse Torperf's new .mergedata format

Child items ...

Activity