Opened 3 years ago

Last modified 2 months ago

#21206 assigned defect

Measure client up/down bandwidth for directory requests, split by type

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: sponsor4, 034-triage-20180328, 034-removed-20180328
Cc: ahf Actual Points:
Parent ID: #21205 Points: 3
Reviewer: nickm Sponsor: Sponsor4

Description

See parent ticket for context.

We need a way to measure, over time, split up by type of directory request, how much bandwidth a client uses for requests and for responses.

We should include both directory lookups that are successful and those that are not.

We should, if possible, measure this with and without circuit overhead. But not if it's too hard.

We should, if possible, count failed circuits that are opened only for directory requests. But not if it's too hard.

Child Tickets

Change History (23)

comment:1 Changed 3 years ago by ahf

Owner: set to ahf
Status: newassigned

comment:2 Changed 3 years ago by ahf

We need a way to measure, over time, split up by type of directory
request, how much bandwidth a client uses for requests and for
responses.

Right now my approach is to catch the size of the inbuf in
connection_dir_client_reached_eof and log a message with the status
code and the purpose of the request (DIR_PURPOSE_*).

We should include both directory lookups that are successful and those
that are not.

For the case where we interact with a directory and the status-code is
erroneous we log that using the above approach. For the case where we
are simply unable to reach a given directory, we should log an error
reason as well as how much data that have been fetched before the error
was encountered. This should be possible in
connection_dir_request_failed.

We should, if possible, measure this with and without circuit
overhead. But not if it's too hard.

We should, if possible, count failed circuits that are opened only for
directory requests. But not if it's too hard.

I haven't looked into the circuit/channel code yet. I'll do that once
the two first bullets are solved, but hopefully this shouldn't be too
difficult.

How should we go on about this measuring code in general:

  • I don't think the log message approach is suitable to enter the codebase.
  • Elsewhere in the code we do collect metrics and save them to a set of "stats" files. Should we use the same approach for this measurement such that it can be reused at a later point in time, where someone might, say, look into using an entirely different compression algorithm than what is provided by zlib or a likewise task that requires knowledge of the currently used bandwidth used by directory requests/responses.
  • One of the requests I have for my "onboarding period" is to get a lot of feedback on timeframes: I currently have tor running locally that emits information on failed directory requests as well as bytes read and written for a request. If we go for using an additional stats-file I should be able to have some code working at noon Monday that could be reviewed. Once that code is in I'll continue with support for measurements that includes the overhead of the circuit connection.


Once that is done we should have a talk about testing the various
scenarios listed in the parent ticket (#21205).

comment:3 in reply to:  2 Changed 3 years ago by nickm

Replying to ahf:

We need a way to measure, over time, split up by type of directory
request, how much bandwidth a client uses for requests and for
responses.

Right now my approach is to catch the size of the inbuf in
connection_dir_client_reached_eof and log a message with the status
code and the purpose of the request (DIR_PURPOSE_*).

This is a plausible place to look at the response size. It doesn't, however, cover request size -- some requests are significant in length.

I think all of the requests should go out through directory_send_command, so that would be the place to instrument for those.

We should include both directory lookups that are successful and those
that are not.

For the case where we interact with a directory and the status-code is
erroneous we log that using the above approach. For the case where we
are simply unable to reach a given directory, we should log an error
reason as well as how much data that have been fetched before the error
was encountered. This should be possible in
connection_dir_request_failed.

We should, if possible, measure this with and without circuit
overhead. But not if it's too hard.

We should, if possible, count failed circuits that are opened only for
directory requests. But not if it's too hard.

I haven't looked into the circuit/channel code yet. I'll do that once
the two first bullets are solved, but hopefully this shouldn't be too
difficult.

I hope so too. The kinds of overhead here are:

  • TLS record headers
  • Cells used to create and tear down circuits and streams
  • Circuit headers and relay cell headers.
  • Padding bytes in relay data cells that aren't full.

For some of those it might be "good enough" to estimate the overhead.

How should we go on about this measuring code in general:

  • I don't think the log message approach is suitable to enter the codebase.

Doing that at level "info" or "debug" it would be okay for debugging purposes, but it doesn't seem to me this is the best approach for us to actually use for statistics reporting.

  • Elsewhere in the code we do collect metrics and save them to a set of "stats" files. Should we use the same approach for this measurement such that it can be reused at a later point in time, where someone might, say, look into using an entirely different compression algorithm than what is provided by zlib or a likewise task that requires knowledge of the currently used bandwidth used by directory requests/responses.

Currently, I think we (mostly?) use that pattern only when the statistics are ones that we're recording persistently across runs, and we want to include them in router descriptors. I wouldn't mind extending it, if it makes sense in this case though.

  • One of the requests I have for my "onboarding period" is to get a lot of feedback on timeframes: I currently have tor running locally that emits information on failed directory requests as well as bytes read and written for a request. If we go for using an additional stats-file I should be able to have some code working at noon Monday that could be reviewed. Once that code is in I'll continue with support for measurements that includes the overhead of the circuit connection.


sounds good!

comment:4 Changed 3 years ago by ahf

Status: assignedneeds_review

I've added some enhancements to the log statements in the directory code, which should give us some, but not all the information we need.

I'd like these 3 patches to be reviewed. They are located in the bugs/21206 branch on https://gitlab.com/ahf/tor. The direct link to the commits is: https://gitlab.com/ahf/tor/commits/bugs/21206

I'll proceed with looking into the different kinds of overhead there might be.

comment:5 Changed 3 years ago by nickm

Status: needs_reviewmerge_ready

This all looks good to me.

comment:6 Changed 3 years ago by ahf

Status: merge_readyneeds_revision

Landed some changes to my branch that hasn't been reviewed yet. Changing to 'needs revision' before asking for another review.

comment:7 Changed 3 years ago by ahf

Status: needs_revisionneeds_review

I've updated my branch to support keeping information on the number of sent and received RELAY_DATA cells. This should allow us to calculate the following overheads as mentioned above:

  • The number of bytes sent/received in the payloads.
  • The number of cells containing data that have been passed over the directory connection.

We should therefore be able to estimate:

  • Relay cell headers.
  • Padding bytes.

I'm unsure how we should measure/estimate the TLS record sizes and the circuit headers/cells needed to setup the directory connection. I'll have to dig into that to be sure.

I've left the two additional unsigned data_cells_received and unsigned data_cells_sent members on dir_connection_t as unguarded, since their lifetime is relatively short. If we do not want to have them around unless for very specific builds I'll wrap them in a guard.

comment:8 Changed 3 years ago by nickm

Keywords: review-group-16 added

comment:9 Changed 2 years ago by nickm

Reviewer: nickm
Sponsor: Sponsor4

comment:10 Changed 2 years ago by nickm

Status: needs_reviewnew

lgtm; merging to master for 0.3.1. Please close if done?

comment:11 Changed 2 years ago by nickm

Keywords: review-group-16 removed

comment:12 Changed 2 years ago by nickm

Keywords: TorCoreTeam201702 added

comment:13 Changed 2 years ago by arma

ahf or others: anything more to be done here?

comment:14 Changed 2 years ago by arma

I guess I should say I'd like this opportunity to learn how many consensus fetches a client in each of these scenarios does per day. It is supposed to be a new fetch every 2-4 hours, meaning on average 8 fetches per day. But in the past (I'll dig up the ticket # if I can find it) I've noticed that we seem to fetch more often than that. Here we are with some instrumentation infrastructure for learning the answer!

comment:15 Changed 2 years ago by nickm

Remove Sponsor4 keyword, now that Sponsor4 is the value of the Sponsor field.

comment:16 Changed 2 years ago by nickm

Keywords: sponsor4 TorCoreTeam201702sponsor4, TorCoreTeam201702

comment:17 Changed 2 years ago by nickm

Keywords: TorCoreTeam201702 removed

comment:18 Changed 2 years ago by nickm

Status: newassigned

ahf, I'm assigning you these compression tickets, but please let me know if you'd like me to take any of them on. cheers!

comment:19 Changed 18 months ago by teor

Milestone: Tor: 0.3.1.x-finalTor: 0.3.4.x-final

These feature and bugfix tickets have no patches. The earliest they will get done is 0.3.4.

comment:20 Changed 17 months ago by nickm

Keywords: 034-triage-20180328 added

comment:21 Changed 17 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:22 Changed 17 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:23 Changed 2 months ago by gaba

Cc: ahf added
Owner: ahf deleted

Liberating some of the tickets that ahf had.

Note: See TracTickets for help on using tickets.