We need a way to measure, over time, split up by type of directory
request, how much bandwidth a client uses for requests and for
responses.
Right now my approach is to catch the size of the inbuf in
connection_dir_client_reached_eof and log a message with the status
code and the purpose of the request (DIR_PURPOSE_*).
We should include both directory lookups that are successful and those
that are not.
For the case where we interact with a directory and the status-code is
erroneous we log that using the above approach. For the case where we
are simply unable to reach a given directory, we should log an error
reason as well as how much data that have been fetched before the error
was encountered. This should be possible in
connection_dir_request_failed.
We should, if possible, measure this with and without circuit
overhead. But not if it's too hard.
We should, if possible, count failed circuits that are opened only for
directory requests. But not if it's too hard.
I haven't looked into the circuit/channel code yet. I'll do that once
the two first bullets are solved, but hopefully this shouldn't be too
difficult.
How should we go on about this measuring code in general:
I don't think the log message approach is suitable to enter the codebase.
Elsewhere in the code we do collect metrics and save them to a set of
"stats" files. Should we use the same approach for this measurement such that
it can be reused at a later point in time, where someone might, say,
look into using an entirely different compression algorithm than what
is provided by zlib or a likewise task that requires knowledge of the
currently used bandwidth used by directory requests/responses.
One of the requests I have for my "onboarding period" is to get a lot
of feedback on timeframes: I currently have tor running locally that
emits information on failed directory requests as well as bytes read
and written for a request. If we go for using an additional stats-file
I should be able to have some code working at noon Monday that could
be reviewed. Once that code is in I'll continue with support for
measurements that includes the overhead of the circuit connection.
Once that is done we should have a talk about testing the various
scenarios listed in the parent ticket (#21205 (moved)).
We need a way to measure, over time, split up by type of directory
request, how much bandwidth a client uses for requests and for
responses.
Right now my approach is to catch the size of the inbuf in
connection_dir_client_reached_eof and log a message with the status
code and the purpose of the request (DIR_PURPOSE_*).
This is a plausible place to look at the response size. It doesn't, however, cover request size -- some requests are significant in length.
I think all of the requests should go out through directory_send_command, so that would be the place to instrument for those.
We should include both directory lookups that are successful and those
that are not.
For the case where we interact with a directory and the status-code is
erroneous we log that using the above approach. For the case where we
are simply unable to reach a given directory, we should log an error
reason as well as how much data that have been fetched before the error
was encountered. This should be possible in
connection_dir_request_failed.
We should, if possible, measure this with and without circuit
overhead. But not if it's too hard.
We should, if possible, count failed circuits that are opened only for
directory requests. But not if it's too hard.
I haven't looked into the circuit/channel code yet. I'll do that once
the two first bullets are solved, but hopefully this shouldn't be too
difficult.
I hope so too. The kinds of overhead here are:
TLS record headers
Cells used to create and tear down circuits and streams
Circuit headers and relay cell headers.
Padding bytes in relay data cells that aren't full.
For some of those it might be "good enough" to estimate the overhead.
How should we go on about this measuring code in general:
I don't think the log message approach is suitable to enter the codebase.
Doing that at level "info" or "debug" it would be okay for debugging purposes, but it doesn't seem to me this is the best approach for us to actually use for statistics reporting.
Elsewhere in the code we do collect metrics and save them to a set of
"stats" files. Should we use the same approach for this measurement such that
it can be reused at a later point in time, where someone might, say,
look into using an entirely different compression algorithm than what
is provided by zlib or a likewise task that requires knowledge of the
currently used bandwidth used by directory requests/responses.
Currently, I think we (mostly?) use that pattern only when the statistics are ones that we're recording persistently across runs, and we want to include them in router descriptors. I wouldn't mind extending it, if it makes sense in this case though.
One of the requests I have for my "onboarding period" is to get a lot
of feedback on timeframes: I currently have tor running locally that
emits information on failed directory requests as well as bytes read
and written for a request. If we go for using an additional stats-file
I should be able to have some code working at noon Monday that could
be reviewed. Once that code is in I'll continue with support for
measurements that includes the overhead of the circuit connection.
I've added some enhancements to the log statements in the directory code, which should give us some, but not all the information we need.
I'd like these 3 patches to be reviewed. They are located in the bugs/21206 branch on https://gitlab.com/ahf/tor. The direct link to the commits is: https://gitlab.com/ahf/tor/commits/bugs/21206
I'll proceed with looking into the different kinds of overhead there might be.
I've updated my branch to support keeping information on the number of sent and received RELAY_DATA cells. This should allow us to calculate the following overheads as mentioned above:
The number of bytes sent/received in the payloads.
The number of cells containing data that have been passed over the directory connection.
We should therefore be able to estimate:
Relay cell headers.
Padding bytes.
I'm unsure how we should measure/estimate the TLS record sizes and the circuit headers/cells needed to setup the directory connection. I'll have to dig into that to be sure.
I've left the two additional unsigned data_cells_received and unsigned data_cells_sent members on dir_connection_t as unguarded, since their lifetime is relatively short. If we do not want to have them around unless for very specific builds I'll wrap them in a guard.
I guess I should say I'd like this opportunity to learn how many consensus fetches a client in each of these scenarios does per day. It is supposed to be a new fetch every 2-4 hours, meaning on average 8 fetches per day. But in the past (I'll dig up the ticket # if I can find it) I've noticed that we seem to fetch more often than that. Here we are with some instrumentation infrastructure for learning the answer!