Measure client up/down bandwidth for directory requests, split by type

changed milestone to %Tor: unspecified

added 034-removed-20180328 034-triage-20180328 component::core tor/tor milestone::Tor: unspecified parent::21205 points::3 priority::medium reviewer::nickm severity::normal sponsor4 sponsor::4 status::new type::defect labels

Trac:
Owner: N/A to ahf
Status: new to assigned

We need a way to measure, over time, split up by type of directory request, how much bandwidth a client uses for requests and for responses.

Right now my approach is to catch the size of the inbuf in connection_dir_client_reached_eof and log a message with the status code and the purpose of the request (DIR_PURPOSE_*).

We should include both directory lookups that are successful and those that are not.

For the case where we interact with a directory and the status-code is erroneous we log that using the above approach. For the case where we are simply unable to reach a given directory, we should log an error reason as well as how much data that have been fetched before the error was encountered. This should be possible in connection_dir_request_failed.

We should, if possible, measure this with and without circuit overhead. But not if it's too hard.

We should, if possible, count failed circuits that are opened only for directory requests. But not if it's too hard.

I haven't looked into the circuit/channel code yet. I'll do that once the two first bullets are solved, but hopefully this shouldn't be too difficult.

How should we go on about this measuring code in general:

I don't think the log message approach is suitable to enter the codebase.
Elsewhere in the code we do collect metrics and save them to a set of "stats" files. Should we use the same approach for this measurement such that it can be reused at a later point in time, where someone might, say, look into using an entirely different compression algorithm than what is provided by zlib or a likewise task that requires knowledge of the currently used bandwidth used by directory requests/responses.
One of the requests I have for my "onboarding period" is to get a lot of feedback on timeframes: I currently have tor running locally that emits information on failed directory requests as well as bytes read and written for a request. If we go for using an additional stats-file I should be able to have some code working at noon Monday that could be reviewed. Once that code is in I'll continue with support for measurements that includes the overhead of the circuit connection.

Once that is done we should have a talk about testing the various scenarios listed in the parent ticket (#21205 (moved)).

Replying to ahf:

We need a way to measure, over time, split up by type of directory request, how much bandwidth a client uses for requests and for responses.

Right now my approach is to catch the size of the inbuf in connection_dir_client_reached_eof and log a message with the status code and the purpose of the request (DIR_PURPOSE_*).

This is a plausible place to look at the response size. It doesn't, however, cover request size -- some requests are significant in length.

I think all of the requests should go out through directory_send_command, so that would be the place to instrument for those.

We should include both directory lookups that are successful and those that are not.

For the case where we interact with a directory and the status-code is erroneous we log that using the above approach. For the case where we are simply unable to reach a given directory, we should log an error reason as well as how much data that have been fetched before the error was encountered. This should be possible in connection_dir_request_failed.

We should, if possible, measure this with and without circuit overhead. But not if it's too hard.

We should, if possible, count failed circuits that are opened only for directory requests. But not if it's too hard.

I haven't looked into the circuit/channel code yet. I'll do that once the two first bullets are solved, but hopefully this shouldn't be too difficult.

I hope so too. The kinds of overhead here are:

TLS record headers
Cells used to create and tear down circuits and streams
Circuit headers and relay cell headers.
Padding bytes in relay data cells that aren't full.

For some of those it might be "good enough" to estimate the overhead.

How should we go on about this measuring code in general:

I don't think the log message approach is suitable to enter the codebase.

Doing that at level "info" or "debug" it would be okay for debugging purposes, but it doesn't seem to me this is the best approach for us to actually use for statistics reporting.

Elsewhere in the code we do collect metrics and save them to a set of "stats" files. Should we use the same approach for this measurement such that it can be reused at a later point in time, where someone might, say, look into using an entirely different compression algorithm than what is provided by zlib or a likewise task that requires knowledge of the currently used bandwidth used by directory requests/responses.

Currently, I think we (mostly?) use that pattern only when the statistics are ones that we're recording persistently across runs, and we want to include them in router descriptors. I wouldn't mind extending it, if it makes sense in this case though.

One of the requests I have for my "onboarding period" is to get a lot of feedback on timeframes: I currently have tor running locally that emits information on failed directory requests as well as bytes read and written for a request. If we go for using an additional stats-file I should be able to have some code working at noon Monday that could be reviewed. Once that code is in I'll continue with support for measurements that includes the overhead of the circuit connection.

sounds good!

I've added some enhancements to the log statements in the directory code, which should give us some, but not all the information we need.

I'd like these 3 patches to be reviewed. They are located in the bugs/21206 branch on https://gitlab.com/ahf/tor. The direct link to the commits is: https://gitlab.com/ahf/tor/commits/bugs/21206

I'll proceed with looking into the different kinds of overhead there might be.

Trac:
Status: assigned to needs_review

This all looks good to me.

Trac:
Status: needs_review to merge_ready

Landed some changes to my branch that hasn't been reviewed yet. Changing to 'needs revision' before asking for another review.

Trac:
Status: merge_ready to needs_revision

I've updated my branch to support keeping information on the number of sent and received RELAY_DATA cells. This should allow us to calculate the following overheads as mentioned above:

The number of bytes sent/received in the payloads.
The number of cells containing data that have been passed over the directory connection.

We should therefore be able to estimate:

Relay cell headers.
Padding bytes.

I'm unsure how we should measure/estimate the TLS record sizes and the circuit headers/cells needed to setup the directory connection. I'll have to dig into that to be sure.

I've left the two additional unsigned data_cells_received and unsigned data_cells_sent members on dir_connection_t as unguarded, since their lifetime is relatively short. If we do not want to have them around unless for very specific builds I'll wrap them in a guard.

Trac:
Status: needs_revision to needs_review

Trac:
Keywords: N/A deleted, review-group-16 added

Trac:
Reviewer: N/A to nickm
Sponsor: N/A to Sponsor4

lgtm; merging to master for 0.3.1. Please close if done?

Trac:
Status: needs_review to new

Trac:
Keywords: review-group-16 deleted, N/A added

Trac:
Keywords: sponsor4 deleted, sponsor4 TorCoreTeam201702 added

ahf or others: anything more to be done here?

I guess I should say I'd like this opportunity to learn how many consensus fetches a client in each of these scenarios does per day. It is supposed to be a new fetch every 2-4 hours, meaning on average 8 fetches per day. But in the past (I'll dig up the ticket # if I can find it) I've noticed that we seem to fetch more often than that. Here we are with some instrumentation infrastructure for learning the answer!

Remove Sponsor4 keyword, now that Sponsor4 is the value of the Sponsor field.

Trac:
Keywords: sponsor4 TorCoreTeam201702 deleted, TorCoreTeam201702, sponsor4 added

Trac:
Keywords: TorCoreTeam201702 deleted, N/A added

ahf, I'm assigning you these compression tickets, but please let me know if you'd like me to take any of them on. cheers!

Trac:
Status: new to assigned

These feature and bugfix tickets have no patches. The earliest they will get done is 0.3.4.

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.4.x-final

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

Liberating some of the tickets that ahf had.

Trac:
Owner: ahf to N/A
Cc: N/A to ahf

Change tickets that are assigned to nobody to "new".

Trac:
Status: assigned to new

changed time estimate to 24h

moved to tpo/core/tor#21206 (closed)

Measure client up/down bandwidth for directory requests, split by type

Child items 0

Activity