Opened 3 months ago

Closed 3 weeks ago

Last modified 3 weeks ago

#29787 closed enhancement (fixed)

Enumerate possible failure cases and include failure information in .tpf output

Reported by: karsten Owned by: metrics-team
Priority: High Milestone:
Component: Metrics/Onionperf Version:
Severity: Normal Keywords:
Cc: metrics-team, acute Actual Points:
Parent ID: Points:
Reviewer: irl Sponsor:

Description

Our current model for distinguishing failures, timeouts, and successes is rather simple/arbitrary/confusing:

  • Timeout: We count any measurement with DIDTIMEOUT=1 and/or with DATACOMPLETE<1 as timeout.
  • Failure: We count any measurement that doesn't have DIDTIMEOUT=1 and that has DATACOMPLETE>=1 and READBYTES<FILESIZE as failure.
  • Success: We count everything else as success.

We're plotting timeouts and failures here. This is not as useful as it could be.

It would be so much better to enumerate all possible failure cases and include failure information in .tpf output files. Examples:

  • Turns out that long-running tor instances sometimes fail to keep up-to-date directory information (#29743), and as a result OnionPerf cannot make measurements.
  • Sometimes streams are closed with reason TIMEOUT or DESTROY (pages 1 and 2 of the first attachment on #29744), and I bet there are several subcases in each of these.
  • Regarding timeouts, it does happen that streams are closed by OnionPerf (pages 3 and 4 of that same attachment on #29744).
  • There are likely more failure cases that might be less frequent that I either did not include them in the #29744 graphs or did not even run into them at all in the logs I looked at.

Can we enumerate all or at least the most typical failure cases and define criteria for clearly distinguishing them from each other and from timeouts and from successes?

Can we also try to unambiguously identify these failure cases in existing tor/torctl/tgen logs that we process for .tpf files, so that we could include failure case IDs for them in the .tpf files?

Child Tickets

Attachments (2)

op_errors-2019-04-24.png (180.3 KB) - added by karsten 2 months ago.
op_errors-2019-04-24-2.png (165.1 KB) - added by karsten 2 months ago.

Download all attachments as: .zip

Change History (37)

comment:1 Changed 3 months ago by acute

Worth noting that the OnionPerf json output records the following error codes when a transfer fails, which come directly form TGen:

shd-tgen-transfer.c: case TGEN_XFER_ERR_NONE
shd-tgen-transfer.c: case TGEN_XFER_ERR_AUTH
shd-tgen-transfer.c: case TGEN_XFER_ERR_READ
shd-tgen-transfer.c: case TGEN_XFER_ERR_WRITE
shd-tgen-transfer.c: case TGEN_XFER_ERR_TIMEOUT
shd-tgen-transfer.c: case TGEN_XFER_ERR_STALLOUT
shd-tgen-transfer.c: case TGEN_XFER_ERR_PROXY
shd-tgen-transfer.c: case TGEN_XFER_ERR_MISC

We should consider making OnionPerf include the code (NONE, AUTH, READ etc) in .tpf output.
I can also work towards analyzing archived data to see how these codes map to the examples above and classify them. For now, these are my initial thoughts:

  • A 'NONE' error corresponds to a successful transfer
  • A 'TIMEOUT' error is recorded by OP if the TGen client fails to complete a transfer in a certain amount of time (defined in the TGen model graph)
  • A 'STALLOUT' error is recorded by OP if the TGen client fails to transfer any bytes in a certain amount of time (defined in the TGen model graph)
  • However, OP records a 'READ' error if the TGen server times out of a transfer, and possibly due to other failures
  • Streams being destroyed usually result in a 'PROXY' error, these can be further correlated with the torctl logs to define subcases

Will update after digging some more.

comment:2 Changed 3 months ago by karsten

Interesting! These sound like they could be very useful to better understand how the application observes failures and timeouts. But as you say, we'll probably have to further correlate these error codes with torctl logs to get a better insight into failures observed by the tor process.

Ideally, we'd have combined error cases/codes for failures/timeouts observed by the application and the tor process, rather than including OnionPerf's error code and then something from torctl events in .tpf output. Or we'll have to specify somewhere which combinations are possible and what exactly they mean. This is probably not trivial to do, but if we don't do it, others will try it and do it slightly wrong, so it's better if we do it.

Can I help with the digging? Maybe we can share the digging work without duplicating the effort?

comment:3 Changed 3 months ago by acute

Thank you for your patience with this. The plan is to extend the code at https://github.com/ana-cc/OPvalidator, on the basis that tor and tgen measurement results in the OP json output can be matched up by source port and time - and so we can get a lot more information per measurement, such as combined tor/tgen error codes and full path for each measurement. This is half a day's worth of work.

The next step after writing code is to analyze some logs where we have an idea of what caused a failure. I will need some logs where we observed failure cases - do you have these?

comment:4 Changed 3 months ago by karsten

Hi acute!

Your idea to extend that code that matches tgen logs and tor control port event logs sounds interesting. Is that going to replace OnionPerf's analysis.py? If yes, why don't you extend or replace that code rather than start a new code base?

However, I wonder if we could start simpler here by simply looking at the tgen logs alone:

  1. For an initial classification of failure cases it might be sufficient to learn when a request fails and how. Like, in which request phase does a request fail and how much time has elapsed up to that point? Maybe the tgen logs also tell us how a request failed, that is, whether the tor process sent an error or tgen ran into a timeout or stallout (even though we're setting stallout high enough that this is currently not the case) or checksum error or whatever. It would be good to know what fraction of requests succeeded and what fractions failed at the various request stages. This is all based on tgen information, which is the application point of view that treats tor as a black box.
  1. The next step after that, for me, would be to match tgen logs with tor control port event logs. I wonder why we'd be using the source port for this. Is that to handle potentially overlapping requests? Do we handle cases where a source port is re-used over the day, by including time? And what do we do if no corresponding source port is found in the other log file, or is that scenario unrealistic/impossible? In short, this sounds complicated and potentially error-prone. Maybe we could simplify this by doing the matching solely based on timing information? And do you think we could also match tor logs (not control port events) by using the same timing information? Assuming that there's anything interesting in these logs.

Sadly, the weekend is almost here and I likely won't be able to spend much time on this analysis over the weekend. But if I find time, I'll start by reading tgen logs and writing little helper tools to classify failure cases solely based on tgen logs. I'll share measurement identifiers of some sort for failure cases as I find them.

Thanks!

comment:5 Changed 3 months ago by karsten

Please disregard my suggestion to drop the source port entirely. I hadn't thought of requests being started every 5 minutes and running for up to 60 minutes. I still had the torperf model in mind where 50 KiB requests would be started every 5 minutes and killed 5 seconds before the next request. Same with 1 MiB/5 MiB requests with different timings. So, this suggestion didn't really make sense. Sorry for the possible confusion.

In the spirit of release early release often, I'm sharing some early results here:

$ grep "_changeState" onionperf_2019-0*.tgen.log | sed 's/.*moving from state //' | sed 's/ to state />/' | tr '\n' ';' | sed 's/CONNECT/_CONNECT/g' | tr '_' '\n' | sort | uniq -c | sort -nr
18476 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;
  81 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>ERROR;COMMAND>ERROR;
   5 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;
   4 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;SUCCESS>ERROR;PAYLOAD>ERROR;
   3 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;SUCCESS>ERROR;RESPONSE>ERROR;
   3 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>ERROR;
   2 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;SUCCESS>ERROR;PAYLOAD>ERROR;
   2 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;PAYLOAD>ERROR;
   1 CONNECT>INIT;INIT>CHOICE;CHOICE>REQUEST;REQUEST>RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;
   1 

Explanation: These are the state changes for >18k OnionPerf runs performed by op-ab since mid-January. Each measurement starts with CONNECT, followed by a state change to INIT, then CHOICE, and so on. The first line contains the successful runs, whereas most of the following lines contain errors.

I need to take a break now, but after the break I'll look more at those lines except for the first to see what happens. I hope to find out why these measurements broke. Maybe I'll have to look at tor controller event logs or tor logs. Still, looking at 101 cases sounds manually sounds like it could work. We'll see!

comment:6 Changed 3 months ago by acute

It appears the tgen states are recorded by onionperf in the json output:

RESPONSEA;RESPONSEA>RESPONSEB;RESPONSEB>RESPONSEC;RESPONSEC>SUCCESS;COMMAND>RESPONSE;RESPONSE>PAYLOAD;PAYLOAD>CHECKSUM;CHECKSUM>SUCCESS;

...corresponds to:

    "proxy_choice": 0.003267, //CHOICE
    "proxy_init": 0.000639,   //INIT
    "proxy_request": 0.007366, //REQUEST
    "proxy_response": 1.042777, //RESPONSE A/B/C
    "response": 1.246581, //RESPONSE
    "payload_progress" //PAYLOAD
    "socket_connect": 0.000555, //CONNECT
    "checksum": 13.956267, //CHECKSUM
    "command": 10.765423, //COMMAND
    [...]

Where there is an error that stops the flow of state changes, the json will only contain the states that succeeded. My theory is that we can infer the phase where a transfer failed directly from the json output of OnionPerf. I've been working on a script that produces a *.csv output with all the information from the json for all onionperf errors. You can see a sample *csv file here (can't attach as trac thinks it's spam):

To test the theory above, I suggest we compare your results with the results of my python script over the same data. Here is the python script you can run over the same data you analyzed to see if we can get the same results (i.e same phase where the transfer errored). The *csv records other information also. You need python3 and python3-pandas installed for the script to run.

python3 op_errors.py --path folder-with-json-files

[ Also the script matches whether there was an error recorded in tor stream for each transfer. Matching is done on the port+timestamp. So far I can see only onionperf 'proxy' errors are directly linked to tor errors...let me know if you need more info on what the columns in the csv are]

comment:7 Changed 3 months ago by acute

Updated the script and csv file to actually show tor circuit recorded errors and fixed a column name - previous version did not dump this info. Please use this version for comparison. Note that column 'state failed' records the first state that was NOT found in the json. Use the same links to download.

comment:8 Changed 3 months ago by karsten

Hmm, I'm running into this issue on a fresh Debian Stretch with python3-pandas installed:

$ python3 op_errors.py --path jsonfiles/
Traceback (most recent call last):
  File "op_errors.py", line 5, in <module>
    import utils
ImportError: No module named 'utils'

comment:9 in reply to:  8 Changed 3 months ago by acute

Replying to karsten:

Hmm, I'm running into this issue on a fresh Debian Stretch with python3-pandas installed:

$ python3 op_errors.py --path jsonfiles/
Traceback (most recent call last):
  File "op_errors.py", line 5, in <module>
    import utils
ImportError: No module named 'utils'

Thanks for spotting this, delete the lines that say 'import utils, import logging, import re' initially the script was using those, but I simplified a lot and don't need the libraries anymore. I will update on my end and push to git.

comment:10 Changed 3 months ago by karsten

Now, when I remove those three lines and re-run the script, I get the following error:

$ python3 op_errors.py --path jsonfiles/
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'failure_reason_local_y'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "op_errors.py", line 91, in <module>
    main()
  File "op_errors.py", line 85, in main
    all_csv_data["failure_reason_local"] = all_csv_data["failure_reason_local_y"]
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/usr/lib/python3/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'failure_reason_local_y'

This error seems to go away if I unxz the single file contained in my jsonfiles/ directory. It would be nice if the script either handles xz-compressed files or ignores them (maybe after printing out a warning) and complains about not having any not-xz-compressed input files to process.

However, after unxz-ing the input file, I get this error:

$ python3 op_errors.py --path jsonfiles/
Traceback (most recent call last):
  File "op_errors.py", line 91, in <module>
    main()
  File "op_errors.py", line 53, in main
    tgen_dfObj = pd.DataFrame.from_dict(tgen_data, orient="index", columns= ["endpoint_local", "error_code", "endpoint_remote", "is_error", "total_bytes_read", "transfer_id", "unix_ts_end", "unix_ts_start", "elapsed_seconds"])
TypeError: from_dict() got an unexpected keyword argument 'columns'

Which Git repository is that where you're developing this script? I'd like to pull from that rather than editing the script file I have here.

Thanks!

comment:11 Changed 3 months ago by acute

Hi, this has been pushed to https://github.com/ana-cc/OPvalidator.
Ok, so my development maching runs testing, the issue is your version of pandas is older.
I've pushed a version that keeps compatibility with version 0.19 (the one in stretch).

comment:12 Changed 3 months ago by karsten

Alright, I got your script running. Yay!

I also improved my initial analysis a few days ago by first reordering tgen log lines by transfer and then looking at state change sequences.

A quick comparison of your script and my result shows that your script detects only a subset of 29 of the 95 failures that my updated script detects.

Here are the sequences found by my script together with their frequencies:

   81 OP_UNKNOWN_1 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA ERROR, transport_error NONE STATUS, transfer COMMAND ERROR, transfer_error NONE PROXY]
   6 OP_UNKNOWN_2 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE PAYLOAD, transport SUCCESS ERROR, transport_error NONE READ, transfer PAYLOAD ERROR, transfer_error NONE READ]
   3 OP_UNKNOWN_3 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transport SUCCESS ERROR, transport_error NONE READ, transfer RESPONSE ERROR, transfer_error NONE AUTH]
   2 OP_UNKNOWN_4 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE ERROR, transfer_error NONE AUTH]
   2 OP_TIMEOUT_5 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE PAYLOAD, transfer PAYLOAD ERROR, transfer_error NONE TIMEOUT]
   1 OP_TIMEOUT_6 [transport CONNECT INIT, transport INIT CHOICE, transport CHOICE REQUEST, transport REQUEST RESPONSEA, transport RESPONSEA RESPONSEB, transport RESPONSEB RESPONSEC, transport RESPONSEC SUCCESS, transfer COMMAND RESPONSE, transfer RESPONSE ERROR, transfer_error NONE TIMEOUT]

Your script finds:

 16 OP_UNKNOWN_1 (missing 65!)
  6 OP_UNKNOWN 2
  3 OP_UNKNOWN_3
  2 OP_UNKNOWN_4
  1 OP_TIMEOUT_5 (missing 1!)
  1 OP_TIMEOUT_6

Here are the missing transfer IDs:

transfer50k:11585 transfer50k:6186 transfer50k:563 transfer1m:3946 transfer1m:6079 transfer50k:10503 transfer50k:10594 transfer50k:10975 transfer50k:11040 transfer50k:11225 transfer50k:1154 transfer50k:11586 transfer50k:11587 transfer50k:11771 transfer50k:11888 transfer50k:12189 transfer50k:12475 transfer50k:12680 transfer50k:12723 transfer50k:1301 transfer50k:13433 transfer50k:13478 transfer50k:15107 transfer50k:15893 transfer50k:15902 transfer50k:16012 transfer50k:17344 transfer50k:17724 transfer50k:17791 transfer50k:18207 transfer50k:18257 transfer50k:18449 transfer50k:2460 transfer50k:2473 transfer50k:2903 transfer50k:3201 transfer50k:3310 transfer50k:3844 transfer50k:4601 transfer50k:4685 transfer50k:5067 transfer50k:5665 transfer50k:5754 transfer50k:581 transfer50k:5907 transfer50k:5912 transfer50k:6468 transfer50k:6551 transfer50k:6738 transfer50k:6886 transfer50k:7089 transfer50k:7266 transfer50k:7481 transfer50k:7503 transfer50k:7586 transfer50k:8200 transfer50k:8259 transfer50k:8277 transfer5m:14290 transfer5m:14570 transfer5m:2020 transfer5m:5208 transfer5m:5394 transfer5m:5422 transfer5m:6319 (these are the 65 OP_UNKNOWN_1 cases)

transfer5m:572 (this is the 1 missing OP_TIMEOUT_5 case)

Would you want to take a look why your script does not detect these failures?

comment:13 Changed 3 months ago by acute

The script only showed errors where it was correlated with a circuit id (an inner join in other words). I've updated it so it outputs all errors in the csv (eg I find a total of 123 errors between 10/01 and 10/04). Thanks for spotting this, the updated version is on git!

comment:14 Changed 3 months ago by karsten

Okay, that adds a lot more of the cases I'm seeing. Still, 3 are missing:

OP_TIMEOUT_5: transfer5m,572
OP_UNKNOWN_1: transfer50k,563 and transfer50k,581

Any idea why these are still missing?

(Note how we'll later be able to re-use these fixes to your script to further classify the failures! So even if this bug hunting might seem annoying at the moment, we might benefit from it later on.)

comment:15 in reply to:  14 Changed 3 months ago by acute

Replying to karsten:

Okay, that adds a lot more of the cases I'm seeing. Still, 3 are missing:

OP_TIMEOUT_5: transfer5m,572
OP_UNKNOWN_1: transfer50k,563 and transfer50k,581

Any idea why these are still missing?

I can't actually seem to find these in any of my json files or tgen logs just by grepping - could you tell me which tor log file this came from by any chance?

comment:16 Changed 2 months ago by karsten

Looks like they are all contained in the same file. Are you missing that one maybe?

$ grep -m1 "transfer5m,572," *.log
onionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 11:33:05 1547292785.630821 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer5m,572,op-ab,GET,5242880,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state RESPONSE
$ grep -m1 "transfer50k,563," *.log
onionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 10:47:59 1547290079.082864 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer50k,563,op-ab,GET,51200,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state ERROR
$ grep -m1 "transfer50k,581," *.log
onionperf_2019-01-13_23:59:59.tgen.log:2019-01-12 12:19:58 1547295598.455009 [info] [shd-tgen-transfer.c:363] [_tgentransfer_changeState] transfer transfer50k,581,op-ab,GET,51200,(null),0,state=COMMAND,error=NONE moving from state COMMAND to state ERROR

comment:17 Changed 2 months ago by acute

Ok thank you for that (was indeed missing that day). This was very interesting to debug, and it is not a bug in my script unfortunately:

I downloaded the op-ab hosted tgen and torctl log for 01-13 and recreated the json analysis file from them. It turns out the analysis file on op-ab is different than the one recreated from the same logs on my computer. I believe something has happened and onionperf has been cut short in the middle of the analysis for that particular day (my file is 16MB, op-ab's is 7.7MB). When I recreate any of the other json files from their respective logs (have tried for ~10 other days), the contents and sizes match up.

The good news is, if you run the python script on the json file produced from the logs, all the missing errors are found.

However, this is concerning. I have asked irl to examine op-ab for any clues as to what could have happened at midnight that day. As a first step, am thinking to write a script that rebuilds the analysis from the logs and validates the existing results. What do you think?

To recreate my steps:

  wget https://op-ab.onionperf.torproject.net/tgen-client/log_archive/onionperf_2019-01-13_23%3A59%3A59.tgen.log
  wget https://op-ab.onionperf.torproject.net/tor-client/log_archive/onionperf_2019-01-13_23%3A59%3A59.torctl.log
  onionperf analyze --tgen onionperf_2019-01-13_23:59:59.tgen.log --torctl onionperf_2019-01-13_23:59:59.torctl.log
  unxz onionperf.analysis.json.xz                    
  wget https://op-ab.onionperf.torproject.net/2019-01-13.onionperf.analysis.json.xz
  unxz 2019-01-13.onionperf.analysis.json.xz                    
  diff onionperf.analysis.json 2019-01-13.onionperf.analysis.json

comment:18 Changed 2 months ago by karsten

Hey, that's cool, thanks for finding that last remaining bug! (Do you think we could move that discussion to another ticket; it seems important to get that fixed, but it's not directly related to figure out in what ways OnionPerf measurements can fail.)

I guess the next step here is trying to find out why those ~100 measurements failed. I'll try to start looking at that this afternoon, but I'd say feel free to do the same in parallel, as time permits. Thanks!

comment:19 in reply to:  18 Changed 2 months ago by acute

Replying to karsten:

Hey, that's cool, thanks for finding that last remaining bug! (Do you think we could move that discussion to another ticket; it seems important to get that fixed, but it's not directly related to figure out in what ways OnionPerf measurements can fail.)

This issue has been filed under bug #30238.

Changed 2 months ago by karsten

Attachment: op_errors-2019-04-24.png added

comment:20 Changed 2 months ago by karsten

Alright, I finally made some progress here!

Last things first, I made the following plot:


This plot uses your script with a minor extension:

diff --git a/op_errors.py b/op_errors.py
index 1c8b278..7169e4d 100644
--- a/op_errors.py
+++ b/op_errors.py
@@ -131,6 +131,7 @@ def main():
             #if there are no failures at all in the circuit data then the csv column will simply be left empty
             pass
         header = [
+            'unix_ts_end', 'hostname_local',
             'transfer_id', 'is_error', 'error_code', 'state_failed',
             'total_seconds', 'endpoint_remote', 'total_bytes_read',
             'circuit_id', 'stream_id','buildtime_seconds', 'failure_reason_local',

I fed it with all OnionPerf .json files that we have.

Then I combined the three fields error_code, failure_reason_local (if present), and failure_reason_remote (if present, and only if failure_reason_local is present, too) into a combined error code.

The result is that we have 11 combined error codes now, which are all in the graph.

The next step will be to understand in more detail what causes these errors. For example:

  • READ is a fun one. The cases I looked at (all from op-ab) were all onion service cases. The server had completed sending the response, and all data was "in flight". Yet, some time later, the client had its connection closed shortly before receiving the last remaining bytes. This could be a bug. Still, needs closer investigation.

acute, if you'd like to take a look, too, maybe write down which combined error codes you're going to look at, so that we can avoid duplicating effort. (Thanks for all your efforts so far!)

comment:21 in reply to:  20 Changed 2 months ago by acute

Replying to karsten:

Alright, I finally made some progress here!

Last things first, I made the following plot:

This plot looks amazing, thank you for this!

acute, if you'd like to take a look, too, maybe write down which combined error codes you're going to look at, so that we can avoid duplicating effort. (Thanks for all your efforts so far!)

I would like to tackle the proxy errors - they appear to be a good place to start with as they seem to be correlated with tor errors/logs. I'm also thinking of trying to replicate them. If this is ok with you, I'll start having a look :)

comment:22 Changed 2 months ago by karsten

Sounds like a fine plan, please go ahead! Thanks!

Changed 2 months ago by karsten

Attachment: op_errors-2019-04-24-2.png added

comment:23 Changed 2 months ago by karsten

I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.


However, I'm afraid that I'm running out of time for the failure analysis. Other stuff piles up here, and even though I'd very much enjoy hunting bugs, I cannot help with this analysis anymore. Sorry!

I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do?

Or would we have to switch to .json files for Tor Metrics? If we'll have to use .json files, this may be out of scope for the next months, otherwise we could add a graph like this to Tor Metrics very soon.

comment:24 in reply to:  23 ; Changed 2 months ago by acute

Replying to karsten:

I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.
However, I'm afraid that I'm running out of time for the failure analysis.

I'm happy to come up with a couple of paragraphs to describe each failure case - already have explanations for some of the proxy errors. Would that be helpful?

I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do? Or would we have to switch to .json files for Tor Metrics?

The tpf files are built from the json files at the time of analysis. This means we have to change the analysis code and then rerun the analysis for every single pair of logs. Do we have all the logs?
Current PRs to be merged into the analysis code relate to #29374, and immediately after that #26673, then this can be my next priority for the analysis.

comment:25 in reply to:  24 Changed 2 months ago by karsten

Replying to acute:

Replying to karsten:

I made a hopefully even more useful plot that distinguishes between public and onion server and that groups errors by OnionPerf/tgen (prefix: TGEN) and tor (prefix: TOR). Admitted, it's a lot of information in a single plot. It probably needs a fair amount of textual explanation to be readable for Tor Metrics users.
However, I'm afraid that I'm running out of time for the failure analysis.

I'm happy to come up with a couple of paragraphs to describe each failure case - already have explanations for some of the proxy errors. Would that be helpful?

Yes, that would be very helpful!

Regarding lengths, I think that we could use different lengths depending on intended audience:

  • very short, user-oriented explanations for the graph page,
  • medium-sized, one or two sentence explanations for the .tpf format specification page, and
  • as-detailed-as-we-want explanations on Trac tickets where we discuss failure cases and why they occurred.

So, yes, anything you write about these cases will be useful!

I'm wondering how we could add a graph like this to Tor Metrics. We earlier speculated about including more failure information in .tpf files, but we would then have to produce new .tpf files for past measurements. Is that trivial to do? Or would we have to switch to .json files for Tor Metrics?

The tpf files are built from the json files at the time of analysis. This means we have to change the analysis code and then rerun the analysis for every single pair of logs. Do we have all the logs?
Current PRs to be merged into the analysis code relate to #29374, and immediately after that #26673, then this can be my next priority for the analysis.

Sounds perfect! Yes, I think we still have all the logs, and I can reprocess them here. I'll wait for the changed analysis code then. Thanks!

comment:26 Changed 6 weeks ago by acute

The code for changing the analysis output to include the error code combinations has been written, but I'm waiting for previous PRs to be merged before submitting a new one.
I've also written a reprocessing module for onionperf, and have tested the reprocessing of 3 months' worth of files from op-ab. These files have the optional PARTIAL fields (as per #26673) as well as an optional ERRORCODE field where there were errors:

I've uploaded the reprocessed files here if you want to have a play:

https://erg.abdn.ac.uk/~ana/reprocessing_test/

comment:27 Changed 5 weeks ago by karsten

Alright, I extended metrics-lib to parse these fields and metrics-web to import them into the database. Everything looks fine!

What's the plan for merging things into OnionPerf? And where would I find the code/patch for reprocessing logs? Thanks!

comment:28 in reply to:  27 ; Changed 4 weeks ago by acute

Replying to karsten:

Alright, I extended metrics-lib to parse these fields and metrics-web to import them into the database. Everything looks fine!

Excellent! Thank you for the update!

What's the plan for merging things into OnionPerf? And where would I find the code/patch for reprocessing logs? Thanks!

All the PRs have now been submitted and awaiting review. You could install the latest dev version from the following github branch and run onionperf --reprocess on the logs, maybe in a test environment:

https://github.com/ana-cc/onionperf/commits/steins_gate

comment:29 in reply to:  28 ; Changed 4 weeks ago by karsten

Replying to acute:

All the PRs have now been submitted and awaiting review. You could install the latest dev version from the following github branch and run onionperf --reprocess on the logs, maybe in a test environment:

https://github.com/ana-cc/onionperf/commits/steins_gate

Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!

comment:30 in reply to:  29 Changed 4 weeks ago by acute

Replying to karsten:

Replying to acute:

Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!

For a debian stable installation, the instructions in https://github.com/torproject/onionperf/ appear to work with the only change that tor needs to be built from source or installed from backports. Here is a step-by-step short version:

sudo apt install git cmake make build-essential gcc libigraph0-dev libglib2.0-dev python-dev libxml2-dev python-lxml python-networkx python-scipy python-matplotlib python-numpy libevent-dev libssl-dev

echo 'deb http://deb.debian.org/debian stretch-backports main' >> /etc/apt/sources.list
apt update
apt-get -t stretch-backports install python-stem tor


git clone https://github.com/shadow/tgen.git
cd tgen
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/home/$USER/.local
make
ln -s ~/tgen/build/tgen /usr/bin/tgen

git clone https://github.com/ana-cc/onionperf
cd onionperf
git checkout steins_gate
sudo python setup.py build
sudo python setup.py install

To call reprocessing, you need to point OP at the directories where the logs are, and then it does the rest. You can additionally tell it where to put the results. This will multiprocess the logs, expanding to the number of available cores. On my 4-core i5 it took about 45 minutes to process 3 months' worth of data from op-ab.

onionperf reprocess --tgen-dir onionperf-data/tgen-client/ --torctl-dir onionperf-data/tor-client/ --prefix my-new-results/
Last edited 4 weeks ago by acute (previous) (diff)

comment:31 in reply to:  29 Changed 4 weeks ago by acute

Would you mind writing down a quick how-to for installing/running this on a vanilla Debian/Ubuntu host? Like, what packages are required, where do we get Stem from, do we need python or python3, is an installation required, etc.? It seems like a lot of things have changed, and the README isn't fully up-to-date anymore. Thanks!

Will update the project README with all this info as well!

comment:32 Changed 4 weeks ago by karsten

Priority: MediumHigh
Reviewer: irl
Status: newneeds_review

Thanks for writing those instructions! I was able to reprocess three months worth of logs from op-ab, op-hk, op-nl, and op-us. Yay!

One thing I noticed was that the script uses the hostname for the SOURCE field and for the file name. This doesn't matter if reprocessing is done on the same host as measurements, but this might not be the case with months/years of logs. Maybe it's worth pointing this out, so that whoever reprocesses past logs needs to change the hostname before reprocessing logs.

Next step is to wait for the GitHub branch to be reviewed and merged? Setting reviewer to irl and priority to high, so that we can ideally start reprocessing logs early next week. Thanks!

comment:33 Changed 4 weeks ago by acute

Ah - it is very easy to allow for the SOURCE to be given on the command line as a parameter - will make sure to add this to the branch and include it in the docs. You are correct, this now just needs merging - Thanks for bumping the priority!

comment:34 Changed 3 weeks ago by irl

Resolution: fixed
Status: needs_reviewclosed

This has been merged.

comment:35 Changed 3 weeks ago by acute

I've only just added the -n or --nickname command line argument which specifies the SOURCE where the logs were produced. This has not yet been merged: https://github.com/torproject/onionperf/pull/22

Example usage:
onionperf reprocess --tgen-dir onionperf-data/tgen-client/ --torctl-dir onionperf-data/tor-client/ --prefix my-new-results/ -n op-ab.torproject.net

Note: See TracTickets for help on using tickets.