Opened 9 years ago

Closed 9 years ago

#2372 closed enhancement (implemented)

Export BridgeDB's pool assignments

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Keywords:
Cc: nickm Actual Points:
Parent ID: #2537 Points:
Reviewer: Sponsor:

Description

Following the (actually newly introduced) tradition of summarizing IRC
discussions in Trac or email, here's what Roger and I discussed today:

I'm interested in learning whether keeping a certain fraction of bridges
unassigned, that is not distributing them via email or HTTP, is a good
idea. AIUI, the idea was to have a small set of fresh bridges in case we
come up with a new distribution channel or want to give out fresh bridges
manually. This idea might fail if people who run a bridge that ends up in
the unallocated pool decide that their bridge is not being useful. They
might turn off their bridge or delete their keys in order to get a new
fingerprint and end up in another pool. If many people do so, we might
better allocate all bridges to pools directly and start a new pool
whenever there's a new distribution channel. Given the high churn of
bridges, we might have a sufficient set of fresh bridges in that pool very
soon. Also, if we want to give out bridges manually, we might give out
bridges from the other pools which may have higher uptime than bridges in
the unallocated pool. Allocating all bridges also means we don't have to
explain to bridge operators why their bridge is also useful even if it
doesn't have any users right now.

So, we need to export pool assignments from BridgeDB somehow. Currently,
we have log files of the following format:

Jan 10 01:41:14 [DEBUG] Leaving bridge 1.2.3.4:443 dddddddddddddddddddddddddddddddddddddddd unallocated
Jan 10 01:41:14 [DEBUG] Adding bridge 2.3.4.5:443 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee to IP ring 1 (port-443 subring)
Jan 10 01:41:14 [DEBUG] Adding bridge 2.3.4.5:443 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee to IP ring 1 (stable subring)
Jan 10 01:41:14 [DEBUG] Adding bridge 2.3.4.5:443 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee to IP ring 1

If we want to analyze bridge pool assignments we need a better data format
than this log format. Here's a proposed data format for bridge pool
assignments (with sanitized IP addresses and fingerprints):

bridge-pool-assignment 2011-01-10 01:41:14
b 127.0.0.1:443 abcdef0123456789abcdef0123456789abcdef01
b 127.0.0.1:443 0123456789abcdef0123456789abcdef01234567
s IP ring 1 (port-443 subring)
s IP ring 1 (stable subring)
s IP ring 1

The timestamp in the bridge-pool-assignment line is the time when the
assignment is written to disk (twice an hour). Lines starting with b
contain IP address, port, and fingerprint of a bridge. For sanitizing
purposes, I replaced bridge IP addresses with 127.0.0.1 and bridge
identities with their SHA-1 hashes. That's the same approach that we take
for sanitizing bridge descriptors. Lines starting with s contain the
rings or subrings that a bridge is allocated to.

Possible questions that I'm trying to answer with these data are:

  1. Do bridges ever switch pools?
  1. Is bridge uptime affected by the pool assignment?

Are there reasons not to publish the sanitized versions of these bridge
pool assignments? Are any sensitive data left that we need to remove?

Can we change the BridgeDB code to export its pool assignment in this
format (without the sanitizing which I would do on a metrics machine)?

Child Tickets

Attachments (1)

0001-Tweak-function-to-dump-all-assignments-a-bit.patch (3.8 KB) - added by karsten 9 years ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 9 years ago by arma

I don't see an issue with the sanitization approach you describe. Again, the best plan there is probably to write up a quick summary of what exactly the transform is, and for the items in the sanitized form, why you believe they're safe and/or why you still want them. Then when that settles, publish some sample sanitized output and let people pick at it.

One issue that comes to mind that we might want to research is how often a given bridge moves IP address. The method you describe above would lose that info, yes? Whereas if we do a keyed hash of the IP address (and never disclose the key), we could distinguish "same" from "different". I remember we had the keyed hash design in some other sanitization context, but I don't remember which one -- how is the idea working out in that other context?

(It's possible that we already do the keyed hash for the regular bridge descriptors, so we would just need to match up the sha1(fingerprint) in this file with the sha1(fingerprint) in that file and we could look up the IP address. In which case maybe there's merit in doing the same keyed hash in both places, to ease the job of future researchers.)

The main question that I want to answer with this data actually is "what's the correlation between which pool the bridge is in and whether that bridge sees a lot of use from a given country". My guess is there are periods of time where the http bridges are wildly popular in China, and then periods where they are pretty much unused (e.g. because they're not reachable). I wonder how it looks for other countries.

(There are variations of this question that I also want to know the answer to, that don't require this data at all, such as "what's the correlation between the bridge's ORPort and its use in various countries".)

As for changing bridgedb to export its pool assignments in this format, that's fine by me. It will be much easier for you to pick through than having somebody grep lines from logs. Ask Andrew which python dev person you should point at.

comment:2 Changed 9 years ago by karsten

Thanks for the feedback!

I'm going to write that quick summary and send it to or-dev today. If
nobody screams at me until, say, end of this week, I'll publish some
sample output from September 2010 (the oldest data that I have).

We discussed including a (keyed) hash of a bridge's IP address in
sanitized bridge descriptors a year ago. If we decide that it's a good
idea we should also include the same (keyed) hash in the bridge pool
assignments. We can revisit this question if you want, but we might want
to do this in a separate task. I'll look up the mails and open a new
ticket later today.

comment:3 in reply to:  2 Changed 9 years ago by karsten

Replying to karsten:

I'm going to write that quick summary and send it to or-dev today. If
nobody screams at me until, say, end of this week, I'll publish some
sample output from September 2010 (the oldest data that I have).

Posted to or-dev.

We discussed including a (keyed) hash of a bridge's IP address in
sanitized bridge descriptors a year ago. If we decide that it's a good
idea we should also include the same (keyed) hash in the bridge pool
assignments. We can revisit this question if you want, but we might want
to do this in a separate task. I'll look up the mails and open a new
ticket later today.

See #2435. Let's move the discussion about hashed IP addresses there. Once we have a design we can easily adapt it to the bridge pool assignments.

comment:4 Changed 9 years ago by karsten

Cc: nickm added

Can we extend BridgeDB to write a new bridge pool assignment file using the format described above whenever it assigns bridges to rings and copy that file to our bridge sanitizing node? That node will take care of sanitizing.

Alternatively, can we periodically grep BridgeDB's logs for "Leaving bridge .* unallocated|Adding bridge .* to" and copy the output to our bridge sanitizing node? If we take this approach, the bridge sanitizing node will take care of reformatting and sanitizing the logs.

comment:5 Changed 9 years ago by nickm

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

comment:6 Changed 9 years ago by nickm

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

comment:7 in reply to:  6 ; Changed 9 years ago by karsten

Replying to nickm:

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

I don't understand the second half of your last sentence. I noticed that bridges appear in multiple places in the logs, because they occur multiple times in the bridge-descriptors file, right? But shouldn't the same bridge be allocated to the very same ring and subrings when parsing its descriptors, because the allocation is based on the bridge identity and the networkstatus-bridges file?

The suggested log format was derived from the log file I had. We can change the format if that makes things easier. How about we take a) the subrings including IP ring X, stable, or port-443 subring, and b) the bridge IP address out of the new log format? I don't know what to do with the information that a bridge is contained in IP ring X, and all flags, ports, and IP addresses are contained in the bridge descriptors that we already have. The information I'm hoping to learn from the new log format is whether a bridge is allocated to the email or web pool or not allocated at all. How about this new format:

bridge-pool-assignment 2011-01-10 01:41:14
abcdef0123456789abcdef0123456789abcdef01 unallocated
0123456789abcdef0123456789abcdef01234567 web
4567890987654321234567890abcdefedcbabcde email

Replying to nickm:

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

I don't know much about Python, but I can try to work on a BridgeDB patch. I have a working BridgeDB installed on a local machine here. Can you give me some more guidance how to implement the format described in this comment?

I also don't quite understand the last sentence of your second comment. The logs I had (see original task description above) contained lines for unallocated bridges. What distributors are there that only exist in the database, and why do we have to treat them specially?

Thanks!

comment:8 Changed 9 years ago by karsten

Parent ID: #2537

comment:9 in reply to:  7 ; Changed 9 years ago by nickm

Replying to karsten:

Replying to nickm:

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

I don't understand the second half of your last sentence. I noticed that bridges appear in multiple places in the logs, because they occur multiple times in the bridge-descriptors file, right? But shouldn't the same bridge be allocated to the very same ring and subrings when parsing its descriptors, because the allocation is based on the bridge identity and the networkstatus-bridges file?

Yeah. My point was that there's nothing in principle that keeps a bridge from being assigned to more than one place at a time. I don't think there's anything in bridgedb that does that currently thoguh.

The suggested log format was derived from the log file I had. We can change the format if that makes things easier. How about we take a) the subrings including IP ring X, stable, or port-443 subring, and b) the bridge IP address out of the new log format? I don't know what to do with the information that a bridge is contained in IP ring X, and all flags, ports, and IP addresses are contained in the bridge descriptors that we already have. The information I'm hoping to learn from the new log format is whether a bridge is allocated to the email or web pool or not allocated at all. How about this new format:

bridge-pool-assignment 2011-01-10 01:41:14
abcdef0123456789abcdef0123456789abcdef01 unallocated
0123456789abcdef0123456789abcdef01234567 web
4567890987654321234567890abcdefedcbabcde email

Probably also doable. Let me see what I can do. I'm thinking of including the extra information that you say you don't know what to do with, just in case later there's a use for it.

Replying to nickm:

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

I don't know much about Python, but I can try to work on a BridgeDB patch. I have a working BridgeDB installed on a local machine here. Can you give me some more guidance how to implement the format described in this comment?

I hacked up some totally untested code in branch "dump" in my public bridgedb repo. It doesn't handle unallocated bridges yet, and nothing calls it yet, but it should be a good starting point for a patch. Let me know if you have any questions about it.

I also don't quite understand the last sentence of your second comment. The logs I had (see original task description above) contained lines for unallocated bridges. What distributors are there that only exist in the database, and why do we have to treat them specially?

Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

comment:10 Changed 9 years ago by karsten

I tweaked your branch a bit. Please see the attached patch (I don't have a public repository yet). The new format is:

$fingerprint email[ port=$port][ flag=$flag]
$fingerprint https ring=$ring[ port=$port][ flag=$flag]
$fingerprint unallocated

I think the code that calls this needs to move somewhere else, and we should make this configurable. Right now, everything is appended to a file assignments.log.

I'm also unsure what to do with the bucket thing. kaner, do you have an idea how we could include the information to which bucket an unallocated bridge was assigned? Maybe we can write something like $fingerprint unallocated bucket=$bucketname for them?

comment:11 in reply to:  9 ; Changed 9 years ago by kaner


Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

They exist in the database. Is that not enough?

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

comment:12 in reply to:  11 ; Changed 9 years ago by karsten

Replying to kaner:

Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

They exist in the database. Is that not enough?

From reading the code it seems that bucket assignment are made persistent in the database. But I didn't play with BridgeDB enough to be sure.

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

Note that sanitizing the output is not an issue here. This can (and should) happen on the same machine that sanitizes bridge descriptors, so that we have a single place to change the sanitization process, if necessary.

comment:13 Changed 9 years ago by karsten

I fixed a bug in my patch above and pushed a new version to branch dump in my shiny new public repository.

comment:14 Changed 9 years ago by nickm

Status: newneeds_review

Looks okay to me. Shall I merge?

comment:15 in reply to:  12 ; Changed 9 years ago by kaner

Replying to karsten:

They exist in the database. Is that not enough?

From reading the code it seems that bucket assignment are made persistent in the database. But I didn't play with BridgeDB enough to be sure.

Yes, they are.

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

It writes all pool assignments to file. After running it, you have, for instance: (actual numbers changed)

$ wc -l *.brdgs

9000 email-2011-02-13.brdgs
9000 https-2011-02-13.brdgs

400 twitter-2011-02-13.brdgs

1000 unallocated-2011-02-13.brdgs

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

comment:16 in reply to:  15 ; Changed 9 years ago by karsten

Owner: set to karsten
Status: needs_reviewassigned

Replying to nickm:

Looks okay to me. Shall I merge?

Not yet. I want to extend it to output whether an unallocated bridge is assigned to a file bucket. Will set the needs_review flag once I'm done.

Replying to kaner:

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

It writes all pool assignments to file. After running it, you have, for instance: (actual numbers changed)

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

I didn't know that --dump-bridges writes email/https/unallocated briges to disk, too. Why's that?

Also, I don't think --dump-bridges does what I want for a number of reasons. 1) I want to learn about changing assignments as soon as BridgeDB loads new bridges, not when an external script is running. 2) I want to learn about assignments at any point in time, not only the most recent assignment at the time of syncing data. 3) I'm more interested in the fingerprint than IP:port.

comment:17 in reply to:  16 Changed 9 years ago by kaner

Replying to karsten:

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

I didn't know that --dump-bridges writes email/https/unallocated briges to disk, too. Why's that?

Somebody (possibly nickm or arma) asked for it. I guess they didn't really have a particular use in mind yet. Maybe, back at that time, statistics? ;)

Also, I don't think --dump-bridges does what I want for a number of reasons. 1) I want to learn about changing assignments as soon as BridgeDB loads new bridges, not when an external script is running. 2) I want to learn about assignments at any point in time, not only the most recent assignment at the time of syncing data. 3) I'm more interested in the fingerprint than IP:port.

Ok. To close this thread of discussion: After IRC conversation with karsten, it seems the two features 'statistics' and 'dump bridges' aren't really meant to merge.

comment:18 in reply to:  16 Changed 9 years ago by karsten

Status: assignedneeds_review

Replying to karsten:

Replying to nickm:

Looks okay to me. Shall I merge?

Not yet. I want to extend it to output whether an unallocated bridge is assigned to a file bucket. Will set the needs_review flag once I'm done.

Please review branch dump in my public repository. If this branch looks okay, please merge or tell me to do so. This branch outputs the following format to a file called assignments.log:

$fingerprint email[ port=$port][ flag=$flag]
$fingerprint https ring=$ring[ port=$port][ flag=$flag]
$fingerprint unallocated[ bucket=$bucketname]

comment:19 Changed 9 years ago by nickm

I don't see anything wrong with it. How tested is it?

comment:20 Changed 9 years ago by kaner

Just one minor comment: Do you need to have "bridge-pool-assignment" at the beginning of each line of the log file? If it never changes anyway, is there a reason to have it? (I guess there is) Also, the name of the log file might go to bridgedb.conf? Its not really important, just eye-catching.

comment:21 in reply to:  20 Changed 9 years ago by karsten

Status: needs_reviewassigned

Replying to nickm:

I don't see anything wrong with it. How tested is it?

Glad that you like the patch. It's tested with a single bridge descriptor snapshot. I'll run some more tests today and set needs_review again.

Replying to kaner:

Just one minor comment: Do you need to have "bridge-pool-assignment" at the beginning of each line of the log file? If it never changes anyway, is there a reason to have it? (I guess there is)

There's only one bridge-pool-assignment line per dumped assignments. That line contains the timestamp when we wrote these assignments which is important for later evaluation. The single assignment lines don't contain this string anymore, but they only consist of the $fingerprint ... parts described three comments above.

Also, the name of the log file might go to bridgedb.conf? Its not really important, just eye-catching.

Good point. I'll add a config option for this.

comment:22 Changed 9 years ago by karsten

Status: assignedneeds_review

So, I tested the patch extensively and found three bugs! :) Two bugs are in kaner's file bucket code (#2688), and one bug was in this code (we were dumping all unallocated bridges in the database, but we only want running bridges). I fixed my bug and added a config option for the assignments file. Fixing kaner's bugs is unrelated to this patch. That means this patch is ready to be merged!

Please see (but don't merge) branch dump in my public repository for the fix and the new config option.

What I want to get merged (or merge myself) is in branch dump-rebased. If people are happy with this patch, I'll merge it.

comment:23 Changed 9 years ago by karsten

Resolution: implemented
Status: needs_reviewclosed
11:29:47 < kaner> karsten: looks good
11:30:33 < kaner> i reviewed the -rebased branch

Merged and deployed. Closing.

Note: See TracTickets for help on using tickets.