Opened 6 years ago

Last modified 21 months ago

#8786 needs_revision enhancement

Add extra-info line that tracks the number of consensus downloads over each pluggable transport

Reported by: asn Owned by:
Priority: Low Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: pt, tor-bridge, metrics privcount-maybe needs-design
Cc: Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor: SponsorQ-can

Description

In #5040, Karsten suggested to add yet another line for measuring obfsbridge stats.

He wants a dirreq-v3-transport line with the exact same format as bridge-ip-transports, that counts consensus fetches instead of direct connections. This will improve the granularity of bridge statistics, and it will help us count users accurately in scenarios like flashproxy (where each client is actually a flashproxy bridge).

This means that we should be considering the GEOIP_CLIENT_NETWORKSTATUS_V2 and GEOIP_CLIENT_NETWORKSTATUS events in this case, instead of GEOIP_CLIENT_CONNECT.

Child Tickets

Change History (41)

comment:1 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final

comment:2 Changed 5 years ago by nickm

Keywords: 026-triaged-1 026-deferrable added
Priority: normalminor
Type: defectenhancement

comment:3 Changed 5 years ago by dcf

See here for a case where this affected measurement of two transports running on the same bridge:

comment:4 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

comment:5 Changed 5 years ago by nickm

Status: newassigned

comment:6 Changed 5 years ago by nickm

Keywords: 027-triaged-1-out added

Marking triaged-out items from first round of 0.2.7 triage.

comment:7 Changed 5 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.???

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

comment:8 Changed 4 years ago by karsten

Severity: Normal
Status: assignedneeds_review

I implemented this feature (and the tightly related feature to track the number of consensus downloads by IP version) in branch task-8786 in my public repository. It's yet untested except for unit tests (and we should only test it with PublishServerDescriptor 0 in the public network).

Review much appreciated. I'll happily review another branch (of comparable size) in exchange.

comment:9 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.8.x-final

If the code's done, it can get reviewed for 0.2.8

comment:10 Changed 4 years ago by dcf

I don't know the code, but I read through the patch.

The patch uses the "%d" format specifier for the output of round_to_next_multiple_of in geoip_get_requests_by_transport_history, but uses the "%u" format specifier in geoip_get_requests_by_version_history.

It surprised me that geoip_get_requests_by_version_history returns NULL when both the v4 and v6 counts are 0, instead of returning "v4=0,v6=0". But I guess it matches what geoip_get_requests_by_transport_history does when all counts are 0.

comment:11 in reply to:  10 Changed 4 years ago by karsten

Replying to dcf:

I don't know the code, but I read through the patch.

Thanks! Much appreciated! (Please let me know if you have a branch for me to review, and I'll take a look.)

The patch uses the "%d" format specifier for the output of round_to_next_multiple_of in geoip_get_requests_by_transport_history, but uses the "%u" format specifier in geoip_get_requests_by_version_history.

Agreed. Changed to "%u" in both cases.

It surprised me that geoip_get_requests_by_version_history returns NULL when both the v4 and v6 counts are 0, instead of returning "v4=0,v6=0". But I guess it matches what geoip_get_requests_by_transport_history does when all counts are 0.

Right, I think returning a count of 0 has little value. But I went even further and changed "v4=8,v6=0" to just "v4=8", because that "v6=0" doesn't provide much value, either.

Changes pushed as separate commit to the same branch. Thanks again!

comment:12 Changed 4 years ago by asn

Hello. The patch looks reasonable. I see it hasn't got any real world testing yet. Would you like me to run this on my bridge for 2-3 days and see what kind of results we get?

comment:13 Changed 4 years ago by karsten

Status: needs_reviewneeds_revision

Quick update: asn was so kind to run this branch on LeifEricson, one of the bridges shipped with Tor Browser. After 24 hours that bridge reported the following lines:

dirreq-v3-transport <OR>=28872
dirreq-v3-version v4=28872
[...]
bridge-ip-versions v4=31040,v6=0
bridge-ip-transports <OR>=8,obfs3=22072,obfs4=8280,scramblesuit=688

Obviously, all directory requests are being attributed to <OR> connections and coming in via IPv4. While the latter is at least plausible, the former is clearly wrong. This is caused by passing NULL as third parameter to geoip_note_client_seen in the following part of directory.c:

    if (1) {
      struct in_addr in;
      tor_addr_t addr;
      if (tor_inet_aton((TO_CONN(conn))->address, &in)) {
        tor_addr_from_ipv4h(&addr, ntohl(in.s_addr));
        geoip_note_client_seen(GEOIP_CLIENT_NETWORKSTATUS,
                               &addr, NULL,
                               time(NULL));

I'm not sure yet how we can learn the transport name at this point. IIRC, conn is a dir_connection_t, but in case of tunneled directory requests there are also an edge_connection_t and an or_connection_t involved. And in order to learn the transport, we'll have to check the channel_t to invoke its get_transport_name function.

Here's a wild idea: we use conn->dirreq_id to find the circuit_t in circuit_get_globel_list() with matching dirreq_id, and then we can learn the transport via TO_OR_CIRCUIT(circ)->p_chan. Totally untested, of course.

Speaking of tunneled vs. direct directory requests, there's also the possibility that a directory request is made to the DirPort, and in that case we couldn't say it's using transport <OR>. Maybe we'll have to introduce a new transport <DIR> to count those directory requests in dirreq-v3-transport. I don't expect there to be many such requests, but if it's easy to be correct here, let's try that.

Regarding dirreq-v3-version, when looking at the code above it's unclear whether the distinction between IPv4 and IPv6 actually works. Without testing, I'd say that we should try passing (TO_CONN(conn))->address to geoip_note_client_seen rather than messing with it first.

All the above is a mix of note-to-self and request-for-help. If something of the above seems obviously wrong or right to somebody here, please do let me know. Otherwise, I'll hack on this in a few days.

comment:14 Changed 4 years ago by karsten

Status: needs_revisionneeds_review

There, I implemented my idea above and gave it some testing using Chutney. Not sure if it's the most elegant approach, but I think it works. See the updated branch here.

asn, mind reviewing the change and giving this branch another chance on your bridge, ideally with PublishServerDescriptor 0 being set and you and I meeting in #tor-dev 24.000000000 hours later?

comment:15 in reply to:  14 Changed 4 years ago by asn

Cc: athena added

Replying to karsten:

There, I implemented my idea above and gave it some testing using Chutney. Not sure if it's the most elegant approach, but I think it works. See the updated branch here.

asn, mind reviewing the change and giving this branch another chance on your bridge, ideally with PublishServerDescriptor 0 being set and you and I meeting in #tor-dev 24.000000000 hours later?

Patch looks like it could work. I'm now running it on my bridge.

Ideally we would want to functionify all the extra code and put it on a separate function, to not pollute directory_handle_command_get() further.

Also, I'm still wondering if there is an O(1) way to get from dir_connection_t to channel_t by following links (maybe through connection_t). Because the current approach is O(n) where n is the number of open circuits. Not sure if this is premature optimization or not. Nick or Andrea would know.

Will get back to you tomorrow about the results.

comment:16 Changed 4 years ago by nickm

Not premature optimization, but I do have a plan to make our lookups of this kind much faster without so many of the silly tricks we've done in the past.

comment:17 Changed 4 years ago by nickm

Owner: set to karsten
Status: needs_reviewassigned

setting owner

comment:18 Changed 4 years ago by nickm

Owner: set to karsten
Status: needs_reviewassigned

setting owner

comment:19 Changed 4 years ago by nickm

Status: assignedneeds_review

setting owner

comment:20 Changed 4 years ago by nickm

Quick notes:

  • The new block in directory_handle_command_get should be its own function; directory_handle_command_get is big enough as it is.
  • I think v3_reqs_by_version should be uint64_t, for peace of mind
  • What is the type of the target for v3_reqs_by_transport ? uint32_t? Something crammed into a uintptr_t? Something else? The documentation should say.

Otherwise looks fine. I'm wondering whether this should wait for 0.2.9 though so that I can merge an appropriate fast lookup function for you.

comment:21 Changed 4 years ago by nickm

Status: needs_reviewneeds_revision

comment:22 Changed 4 years ago by karsten

nickm, thanks a lot for the review! I worked on those suggestion and pushed them to the same branch. But let me start with another fix that I started working on before reading your review:

tl;wr: My branch (commit cda77b4) is broken, though only in a way that it logs something inaccurate, not in a harmful way. I believe that neither the bridges' configurations nor the extended OR port implementation are to blame. I wrote a fix for this.

After adding extensive logging to all code having to do with dirreq_id and letting it run for hours, I found the issue with my branch (commit cda77b4). Here's what happened:

  • The client sends a BEGIN_DIR cell on its circuit to the bridge.
  • The bridge assigns dirreq_id 9779 to the circuit and the circuit's p_chan, opens an edge connection and a dir connection, and assigns the same dirreq_id 9779 to both new connections. It then sends back a cell (CONNECTED?) to the client and waits for the client to send its GET request via that circuit.
  • The client sends another BEGIN_DIR cell on the same circuit.
  • The bridge assigns dirreq_id 9780 to the circuit and its p_chan, overwriting previous value 9779. It also creates a new edge connection and a new dir connection and sets their dirreq_id to 9780, but that's irrelevant here.
  • The client sends the GET request to the dir connection with dirreq_id 9779, which is a request for a consensus.
  • The bridge attempts to find circuit with dirreq_id 9779 to find out the transport name of its p_chan, but there is no circuit with that dirreq_id anymore, because it was overwritten with 9780 above. The bridge falls back to counting this request as coming in via the <OR> transport, which is wrong. It should probably count it as <??>, but really we should fix this bug and count it as obfs3 in this case.
  • The client sends a second GET request for the bridge's server descriptor to the dir connection with dirreq_id 9780.

I'm also pasting the logs below:

Feb 22 14:34:48.463 [notice] [...]
Feb 22 14:34:48.673 [notice] XXX8786 Assigning new dirreq_id 9779 to circuit with p_circ_id 3181807694 and non-zero dirreq_id 9776 and to p_chan with global_identifier 5 and previous dirreq_id 9776.
Feb 22 14:34:48.673 [notice] XXX8786 Assigning dirreq_id 9779 to new edge connection.
Feb 22 14:34:48.674 [notice] XXX8786 Assigning dirreq_id 9779 to new dir connection.
Feb 22 14:34:48.674 [notice] XXX8786 Assigning new dirreq_id 9780 to circuit with p_circ_id 3181807694 and non-zero dirreq_id 9779 and to p_chan with global_identifier 5 and previous dirreq_id 9779.
Feb 22 14:34:48.674 [notice] XXX8786 Assigning dirreq_id 9780 to new edge connection.
Feb 22 14:34:48.674 [notice] XXX8786 Assigning dirreq_id 9780 to new dir connection.
Feb 22 14:34:48.675 [notice] XXX8786 Request for /tor/status-vote/current/consensus-microdesc/1AFF35+2C2591+E34739.
Feb 22 14:34:48.675 [notice] XXX8786 Counting directory request for /tor/status-vote/current/consensus-microdesc/1AFF35+2C2591+E34739 with dirreq_id 9779 as <OR>, because we didn't find a circuit with matching dirreq_id.
Feb 22 14:34:48.676 [notice] XXX8786 Request for /tor/server/authority.
Feb 22 14:34:53.461 [notice] [...]

The bug here is that the dirreq_id-assigning code does not take concurrent directory requests over the same circuit into account. We can expect that the stats based on dirreq_id, in particular "dirreq-v3-tunneled-dl" lines, are partially broken, too. Fortunately, we're not using those lines for anything critical. That's the bad news.

The good news is that I wrote a fix for counting requests by transport. Please see commit a743f3a in the updated branch task-8786 in my public repository. The commit message says what was fixed. Pasting here, though it overlaps with the description above:

    The previous approach to look up the transport name of a directory
    request by going through the list of circuits and finding the circuit
    with same dirreq_id was broken.  That dirreq_id may already have been
    overwritten by a newly arriving BEGIN_DIR cell before the HTTP GET
    request that we're about to include in the stats arrives and gets
    parsed.  This issue also affects "dirreq-v3-tunneled-dl" lines, but
    fixing those is independent of implementing this new feature.
    
    The new approach is to store the transport name in the locally created
    dir connection when processing the BEGIN_DIR cell.  That way it cannot
    be changed by a subsequent BEGIN_DIR cell on the same circuit.  The
    downside is a new char* in dir_connection_t, the upside is that we can
    avoid iterating over all circuits.

nickm, I also tried to address your suggestions in subsequent commit 1508787.

I would like to get this branch tested on asn's and mrphs' bridges and on an IPv6 bridge before getting it merged. It also contains a temp commit for testing that must be reverted before merging! Leaving this ticket at needs_revision until that is all done. Deferring until 0.2.9 is fine by me.

comment:23 Changed 4 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final

Thanks for all the hard work, Karsten! I think that deferring is the right choice at this point.

comment:24 Changed 3 years ago by nickm

Sponsor: SponsorS-can

comment:25 Changed 3 years ago by nickm

Points: 3

comment:26 Changed 3 years ago by nickm

Keywords: nickm-deferred-20160905 added
Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

Deferring many tickets that are in needs_revision with no progress noted for a while, where I think they could safely wait till 0.3.0 or later.

Please feel free to move these back to 0.2.9 if you finish the revisions soon.

comment:27 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:28 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:29 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:30 Changed 2 years ago by nickm

Keywords: 027-triaged-in added

comment:31 Changed 2 years ago by nickm

Keywords: 027-triaged-in removed

comment:32 Changed 2 years ago by nickm

Keywords: 027-triaged-1-out removed

comment:33 Changed 2 years ago by nickm

Keywords: 026-triaged-1 removed

comment:34 Changed 2 years ago by nickm

Keywords: nickm-deferred-20160905 removed

comment:35 Changed 2 years ago by nickm

Keywords: 026-deferrable removed

comment:36 Changed 2 years ago by nickm

Cc: athena removed
Keywords: metrics privcount needs-design added; flashproxy removed
Summary: Add extra-info line that tracks the number of consensus downloads of each pluggable transportsAdd extra-info line that tracks the number of consensus downloads over each pluggable transport

comment:37 Changed 2 years ago by teor

Keywords: privcount-maybe added; privcount removed

PrivCount doesn't do bridge or PT statistics yet, nor does it do consensus downloads yet.

comment:38 Changed 2 years ago by nickm

Sponsor: SponsorS-canSponsorQ-can

comment:39 Changed 21 months ago by karsten

Owner: karsten deleted
Status: needs_revisionassigned

I don't think I'll find the time to build this in the foreseeable future. And maybe we'll be using PrivCount for these statistics by whenever I'd find the time again. I'm unassigning this ticket from myself, to have a more accurate view on what tickets I'm actually working on.

comment:40 Changed 21 months ago by karsten

Status: assignedneeds_revision

comment:41 Changed 21 months ago by teor

We have a "no new stats outside of PrivCount" guideline.
We can make a list and set priorities in a few months' time.

Note: See TracTickets for help on using tickets.