Our current model for distinguishing failures, timeouts, and successes is rather simple/arbitrary/confusing:
Timeout: We count any measurement with DIDTIMEOUT=1 and/or with DATACOMPLETE<1 as timeout.
Failure: We count any measurement that doesn't have DIDTIMEOUT=1 and that has DATACOMPLETE>=1 and READBYTES<FILESIZE as failure.
Success: We count everything else as success.
We're plotting timeouts and failures here. This is not as useful as it could be.
It would be so much better to enumerate all possible failure cases and include failure information in .tpf output files. Examples:
Turns out that long-running tor instances sometimes fail to keep up-to-date directory information (#29743 (moved)), and as a result OnionPerf cannot make measurements.
Sometimes streams are closed with reason TIMEOUT or DESTROY (pages 1 and 2 of the first attachment on #29744 (moved)), and I bet there are several subcases in each of these.
Regarding timeouts, it does happen that streams are closed by OnionPerf (pages 3 and 4 of that same attachment on #29744 (moved)).
There are likely more failure cases that might be less frequent that I either did not include them in the #29744 (moved) graphs or did not even run into them at all in the logs I looked at.
Can we enumerate all or at least the most typical failure cases and define criteria for clearly distinguishing them from each other and from timeouts and from successes?
Can we also try to unambiguously identify these failure cases in existing tor/torctl/tgen logs that we process for .tpf files, so that we could include failure case IDs for them in the .tpf files?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Worth noting that the OnionPerf json output records the following error codes when a transfer fails, which come directly form TGen:
shd-tgen-transfer.c: case TGEN_XFER_ERR_NONE
shd-tgen-transfer.c: case TGEN_XFER_ERR_AUTH
shd-tgen-transfer.c: case TGEN_XFER_ERR_READ
shd-tgen-transfer.c: case TGEN_XFER_ERR_WRITE
shd-tgen-transfer.c: case TGEN_XFER_ERR_TIMEOUT
shd-tgen-transfer.c: case TGEN_XFER_ERR_STALLOUT
shd-tgen-transfer.c: case TGEN_XFER_ERR_PROXY
shd-tgen-transfer.c: case TGEN_XFER_ERR_MISC
We should consider making OnionPerf include the code (NONE, AUTH, READ etc) in .tpf output.
I can also work towards analyzing archived data to see how these codes map to the examples above and classify them. For now, these are my initial thoughts:
A 'NONE' error corresponds to a successful transfer
A 'TIMEOUT' error is recorded by OP if the TGen client fails to complete a transfer in a certain amount of time (defined in the TGen model graph)
A 'STALLOUT' error is recorded by OP if the TGen client fails to transfer any bytes in a certain amount of time (defined in the TGen model graph)
However, OP records a 'READ' error if the TGen server times out of a transfer, and possibly due to other failures
Streams being destroyed usually result in a 'PROXY' error, these can be further correlated with the torctl logs to define subcases
Interesting! These sound like they could be very useful to better understand how the application observes failures and timeouts. But as you say, we'll probably have to further correlate these error codes with torctl logs to get a better insight into failures observed by the tor process.
Ideally, we'd have combined error cases/codes for failures/timeouts observed by the application and the tor process, rather than including OnionPerf's error code and then something from torctl events in .tpf output. Or we'll have to specify somewhere which combinations are possible and what exactly they mean. This is probably not trivial to do, but if we don't do it, others will try it and do it slightly wrong, so it's better if we do it.
Can I help with the digging? Maybe we can share the digging work without duplicating the effort?
Thank you for your patience with this. The plan is to extend the code at https://github.com/ana-cc/OPvalidator, on the basis that tor and tgen measurement results in the OP json output can be matched up by source port and time - and so we can get a lot more information per measurement, such as combined tor/tgen error codes and full path for each measurement. This is half a day's worth of work.
The next step after writing code is to analyze some logs where we have an idea of what caused a failure. I will need some logs where we observed failure cases - do you have these?
Your idea to extend that code that matches tgen logs and tor control port event logs sounds interesting. Is that going to replace OnionPerf's analysis.py? If yes, why don't you extend or replace that code rather than start a new code base?
However, I wonder if we could start simpler here by simply looking at the tgen logs alone:
For an initial classification of failure cases it might be sufficient to learn when a request fails and how. Like, in which request phase does a request fail and how much time has elapsed up to that point? Maybe the tgen logs also tell us how a request failed, that is, whether the tor process sent an error or tgen ran into a timeout or stallout (even though we're setting stallout high enough that this is currently not the case) or checksum error or whatever. It would be good to know what fraction of requests succeeded and what fractions failed at the various request stages. This is all based on tgen information, which is the application point of view that treats tor as a black box.
The next step after that, for me, would be to match tgen logs with tor control port event logs. I wonder why we'd be using the source port for this. Is that to handle potentially overlapping requests? Do we handle cases where a source port is re-used over the day, by including time? And what do we do if no corresponding source port is found in the other log file, or is that scenario unrealistic/impossible? In short, this sounds complicated and potentially error-prone. Maybe we could simplify this by doing the matching solely based on timing information? And do you think we could also match tor logs (not control port events) by using the same timing information? Assuming that there's anything interesting in these logs.
Sadly, the weekend is almost here and I likely won't be able to spend much time on this analysis over the weekend. But if I find time, I'll start by reading tgen logs and writing little helper tools to classify failure cases solely based on tgen logs. I'll share measurement identifiers of some sort for failure cases as I find them.
Please disregard my suggestion to drop the source port entirely. I hadn't thought of requests being started every 5 minutes and running for up to 60 minutes. I still had the torperf model in mind where 50 KiB requests would be started every 5 minutes and killed 5 seconds before the next request. Same with 1 MiB/5 MiB requests with different timings. So, this suggestion didn't really make sense. Sorry for the possible confusion.
In the spirit of release early release often, I'm sharing some early results here:
$ grep "_changeState" onionperf_2019-0*.tgen.log | sed 's/.*moving from state //' | sed 's/ to state />/' | tr '\n' ';' | sed 's/CONNECT/_CONNECT/g' | tr '_' '\n' | sort | uniq -c | sort -nr18476 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS; 81 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>ERROR;COMMAND>ERROR; 5 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD; 4 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;SUCCESS>ERROR;PAYLOAD>ERROR; 3 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;SUCCESS>ERROR;RESPONSE>ERROR; 3 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>ERROR; 2 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;SUCCESS>ERROR;PAYLOAD>ERROR; 2 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;PAYLOAD>ERROR; 1 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS; 1
Explanation: These are the state changes for >18k OnionPerf runs performed by op-ab since mid-January. Each measurement starts with CONNECT, followed by a state change to INIT, then CHOICE, and so on. The first line contains the successful runs, whereas most of the following lines contain errors.
I need to take a break now, but after the break I'll look more at those lines except for the first to see what happens. I hope to find out why these measurements broke. Maybe I'll have to look at tor controller event logs or tor logs. Still, looking at 101 cases sounds manually sounds like it could work. We'll see!
Where there is an error that stops the flow of state changes, the json will only contain the states that succeeded. My theory is that we can infer the phase where a transfer failed directly from the json output of OnionPerf. I've been working on a script that produces a *.csv output with all the information from the json for all onionperf errors. You can see a sample *csv file here (can't attach as trac thinks it's spam):
To test the theory above, I suggest we compare your results with the results of my python script over the same data. Here is the python script you can run over the same data you analyzed to see if we can get the same results (i.e same phase where the transfer errored). The *csv records other information also. You need python3 and python3-pandas installed for the script to run.
[ Also the script matches whether there was an error recorded in tor stream for each transfer. Matching is done on the port+timestamp. So far I can see only onionperf 'proxy' errors are directly linked to tor errors...let me know if you need more info on what the columns in the csv are]
Updated the script and csv file to actually show tor circuit recorded errors and fixed a column name - previous version did not dump this info. Please use this version for comparison. Note that column 'state failed' records the first state that was NOT found in the json. Use the same links to download.
Hmm, I'm running into this issue on a fresh Debian Stretch with python3-pandas installed:
{{{
$ python3 op_errors.py --path jsonfiles/
Traceback (most recent call last):
File "op_errors.py", line 5, in
import utils
ImportError: No module named 'utils'
}}}
Thanks for spotting this, delete the lines that say 'import utils, import logging, import re' initially the script was using those, but I simplified a lot and don't need the libraries anymore. I will update on my end and push to git.
Now, when I remove those three lines and re-run the script, I get the following error:
$ python3 op_errors.py --path jsonfiles/Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pandas/indexes/base.py", line 2134, in get_loc return self._engine.get_loc(key) File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433) File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279) File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742) File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)KeyError: 'failure_reason_local_y'During handling of the above exception, another exception occurred:Traceback (most recent call last): File "op_errors.py", line 91, in <module> main() File "op_errors.py", line 85, in main all_csv_data["failure_reason_local"] = all_csv_data["failure_reason_local_y"] File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2059, in __getitem__ return self._getitem_column(key) File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column return self._get_item_cache(key) File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache values = self._data.get(item) File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3543, in get loc = self.items.get_loc(item) File "/usr/lib/python3/dist-packages/pandas/indexes/base.py", line 2136, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433) File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279) File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742) File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)KeyError: 'failure_reason_local_y'
This error seems to go away if I unxz the single file contained in my jsonfiles/ directory. It would be nice if the script either handles xz-compressed files or ignores them (maybe after printing out a warning) and complains about not having any not-xz-compressed input files to process.
However, after unxz-ing the input file, I get this error:
$ python3 op_errors.py --path jsonfiles/Traceback (most recent call last): File "op_errors.py", line 91, in <module> main() File "op_errors.py", line 53, in main tgen_dfObj = pd.DataFrame.from_dict(tgen_data, orient="index", columns= ["endpoint_local", "error_code", "endpoint_remote", "is_error", "total_bytes_read", "transfer_id", "unix_ts_end", "unix_ts_start", "elapsed_seconds"])TypeError: from_dict() got an unexpected keyword argument 'columns'
Which Git repository is that where you're developing this script? I'd like to pull from that rather than editing the script file I have here.
Hi, this has been pushed to https://github.com/ana-cc/OPvalidator.
Ok, so my development maching runs testing, the issue is your version of pandas is older.
I've pushed a version that keeps compatibility with version 0.19 (the one in stretch).
I also improved my initial analysis a few days ago by first reordering tgen log lines by transfer and then looking at state change sequences.
A quick comparison of your script and my result shows that your script detects only a subset of 29 of the 95 failures that my updated script detects.
Here are the sequences found by my script together with their frequencies:
81 OP_UNKNOWN_1 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA ERROR, transport_error NONE STATUS, transfer COMMAND ERROR, transfer_error NONE PROXY] 6 OP_UNKNOWN_2 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE PAYLOAD, transport SUCCESS ERROR, transport_error NONE READ, transfer PAYLOAD ERROR, transfer_error NONE READ] 3 OP_UNKNOWN_3 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transport SUCCESS ERROR, transport_error NONE READ, transfer RESPONSE ERROR, transfer_error NONE AUTH] 2 OP_UNKNOWN_4 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE ERROR, transfer_error NONE AUTH] 2 OP_TIMEOUT_5 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE PAYLOAD, transfer PAYLOAD ERROR, transfer_error NONE TIMEOUT] 1 OP_TIMEOUT_6 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE ERROR, transfer_error NONE TIMEOUT]
The script only showed errors where it was correlated with a circuit id (an inner join in other words). I've updated it so it outputs all errors in the csv (eg I find a total of 123 errors between 10/01 and 10/04). Thanks for spotting this, the updated version is on git!
Okay, that adds a lot more of the cases I'm seeing. Still, 3 are missing:
OP_TIMEOUT_5: transfer5m,572OP_UNKNOWN_1: transfer50k,563 and transfer50k,581
Any idea why these are still missing?
(Note how we'll later be able to re-use these fixes to your script to further classify the failures! So even if this bug hunting might seem annoying at the moment, we might benefit from it later on.)
Okay, that adds a lot more of the cases I'm seeing. Still, 3 are missing:
{{{
OP_TIMEOUT_5: transfer5m,572
OP_UNKNOWN_1: transfer50k,563 and transfer50k,581
}}}
Any idea why these are still missing?
I can't actually seem to find these in any of my json files or tgen logs just by grepping - could you tell me which tor log file this came from by any chance?
Looks like they are all contained in the same file. Are you missing that one maybe?
$ grep -m1 "transfer5m,572," *.logonionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 11:33:05 1547292785.630821 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer5m,572,op-ab,GET,5242880,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state RESPONSE$ grep -m1 "transfer50k,563," *.logonionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 10:47:59 1547290079.082864 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer50k,563,op-ab,GET,51200,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state ERROR$ grep -m1 "transfer50k,581," *.logonionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 12:19:58 1547295598.455009 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer50k,581,op-ab,GET,51200,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state ERROR
Ok thank you for that (was indeed missing that day). This was very interesting to debug, and it is not a bug in my script unfortunately:
I downloaded the op-ab hosted tgen and torctl log for 01-13 and recreated the json analysis file from them. It turns out the analysis file on op-ab is different than the one recreated from the same logs on my computer. I believe something has happened and onionperf has been cut short in the middle of the analysis for that particular day (my file is 16MB, op-ab's is 7.7MB). When I recreate any of the other json files from their respective logs (have tried for ~10 other days), the contents and sizes match up.
The good news is, if you run the python script on the json file produced from the logs, all the missing errors are found.
However, this is concerning. I have asked irl to examine op-ab for any clues as to what could have happened at midnight that day. As a first step, am thinking to write a script that rebuilds the analysis from the logs and validates the existing results. What do you think?
Hey, that's cool, thanks for finding that last remaining bug! (Do you think we could move that discussion to another ticket; it seems important to get that fixed, but it's not directly related to figure out in what ways OnionPerf measurements can fail.)
I guess the next step here is trying to find out why those ~100 measurements failed. I'll try to start looking at that this afternoon, but I'd say feel free to do the same in parallel, as time permits. Thanks!
Hey, that's cool, thanks for finding that last remaining bug! (Do you think we could move that discussion to another ticket; it seems important to get that fixed, but it's not directly related to figure out in what ways OnionPerf measurements can fail.)
This plot uses your script with a minor extension:
diff --git a/op_errors.py b/op_errors.pyindex 1c8b278..7169e4d 100644--- a/op_errors.py+++ b/op_errors.py@@ -131,6 +131,7 @@ def main(): #if there are no failures at all in the circuit data then the csv column will simply be left empty pass header = [+ 'unix_ts_end', 'hostname_local', 'transfer_id', 'is_error', 'error_code', 'state_failed', 'total_seconds', 'endpoint_remote', 'total_bytes_read', 'circuit_id', 'stream_id','buildtime_seconds', 'failure_reason_local',
I fed it with all OnionPerf .json files that we have.
Then I combined the three fields error_code, failure_reason_local (if present), and failure_reason_remote (if present, and only if failure_reason_local is present, too) into a combined error code.
The result is that we have 11 combined error codes now, which are all in the graph.
The next step will be to understand in more detail what causes these errors. For example:
READ is a fun one. The cases I looked at (all from op-ab) were all onion service cases. The server had completed sending the response, and all data was "in flight". Yet, some time later, the client had its connection closed shortly before receiving the last remaining bytes. This could be a bug. Still, needs closer investigation.
acute, if you'd like to take a look, too, maybe write down which combined error codes you're going to look at, so that we can avoid duplicating effort. (Thanks for all your efforts so far!)
acute, if you'd like to take a look, too, maybe write down which combined error codes you're going to look at, so that we can avoid duplicating effort. (Thanks for all your efforts so far!)
I would like to tackle the proxy errors - they appear to be a good place to start with as they seem to be correlated with tor errors/logs. I'm also thinking of trying to replicate them. If this is ok with you, I'll start having a look :)
I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.
However, I'm afraid that I'm running out of time for the failure analysis. Other stuff piles up here, and even though I'd very much enjoy hunting bugs, I cannot help with this analysis anymore. Sorry!
I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do?
Or would we have to switch to .json files for Tor Metrics? If we'll have to use .json files, this may be out of scope for the next months, otherwise we could add a graph like this to Tor Metrics very soon.
I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.
However, I'm afraid that I'm running out of time for the failure analysis.
I'm happy to come up with a couple of paragraphs to describe each failure case - already have explanations for some of the proxy errors. Would that be helpful?
I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do? Or would we have to switch to .json files for Tor Metrics?
The tpf files are built from the json files at the time of analysis. This means we have to change the analysis code and then rerun the analysis for every single pair of logs. Do we have all the logs?
Current PRs to be merged into the analysis code relate to #29374 (moved), and immediately after that #26673 (moved), then this can be my next priority for the analysis.
I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.
However, I'm afraid that I'm running out of time for the failure analysis.
I'm happy to come up with a couple of paragraphs to describe each failure case - already have explanations for some of the proxy errors. Would that be helpful?
Yes, that would be very helpful!
Regarding lengths, I think that we could use different lengths depending on intended audience:
very short, user-oriented explanations for the graph page,
medium-sized, one or two sentence explanations for the .tpf format specification page, and
as-detailed-as-we-want explanations on Trac tickets where we discuss failure cases and why they occurred.
So, yes, anything you write about these cases will be useful!
I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do? Or would we have to switch to .json files for Tor Metrics?
The tpf files are built from the json files at the time of analysis. This means we have to change the analysis code and then rerun the analysis for every single pair of logs. Do we have all the logs?
Current PRs to be merged into the analysis code relate to #29374 (moved), and immediately after that #26673 (moved), then this can be my next priority for the analysis.
Sounds perfect! Yes, I think we still have all the logs, and I can reprocess them here. I'll wait for the changed analysis code then. Thanks!
The code for changing the analysis output to include the error code combinations has been written, but I'm waiting for previous PRs to be merged before submitting a new one.
I've also written a reprocessing module for onionperf, and have tested the reprocessing of 3 months' worth of files from op-ab. These files have the optional PARTIAL fields (as per #26673 (moved)) as well as an optional ERRORCODE field where there were errors:
I've uploaded the reprocessed files here if you want to have a play:
Alright, I extended metrics-lib to parse these fields and metrics-web to import them into the database. Everything looks fine!
Excellent! Thank you for the update!
What's the plan for merging things into OnionPerf? And where would I find the code/patch for reprocessing logs? Thanks!
All the PRs have now been submitted and awaiting review. You could install the latest dev version from the following github branch and run onionperf --reprocess on the logs, maybe in a test environment:
All the PRs have now been submitted and awaiting review. You could install the latest dev version from the following github branch and run onionperf --reprocess on the logs, maybe in a test environment:
Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!
Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!
For a debian stable installation, the instructions in https://github.com/torproject/onionperf/ appear to work with the only change that tor needs to be built from source or installed from backports. Here is a step-by-step short version:
To call reprocessing, you need to point OP at the directories where the logs are, and then it does the rest. You can additionally tell it where to put the results. This will multiprocess the logs, expanding to the number of available cores. On my 4-core i5 it took about 45 minutes to process 3 months' worth of data from op-ab.
Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!
Will update the project README with all this info as well!
Thanks for writing those instructions! I was able to reprocess three months worth of logs from op-ab, op-hk, op-nl, and op-us. Yay!
One thing I noticed was that the script uses the hostname for the SOURCE field and for the file name. This doesn't matter if reprocessing is done on the same host as measurements, but this might not be the case with months/years of logs. Maybe it's worth pointing this out, so that whoever reprocesses past logs needs to change the hostname before reprocessing logs.
Next step is to wait for the GitHub branch to be reviewed and merged? Setting reviewer to irl and priority to high, so that we can ideally start reprocessing logs early next week. Thanks!
Trac: Priority: Medium to High Reviewer: N/Ato irl Status: new to needs_review
Ah - it is very easy to allow for the SOURCE to be given on the command line as a parameter - will make sure to add this to the branch and include it in the docs. You are correct, this now just needs merging - Thanks for bumping the priority!
I've only just added the -n or --nickname command line argument which specifies the SOURCE where the logs were produced. This has not yet been merged: https://github.com/torproject/onionperf/pull/22