Export BridgeDB's pool assignments

added component::circumvention/bridgedb owner::karsten parent::2537 priority::medium resolution::implemented status::closed type::enhancement labels

I don't see an issue with the sanitization approach you describe. Again, the best plan there is probably to write up a quick summary of what exactly the transform is, and for the items in the sanitized form, why you believe they're safe and/or why you still want them. Then when that settles, publish some sample sanitized output and let people pick at it.

One issue that comes to mind that we might want to research is how often a given bridge moves IP address. The method you describe above would lose that info, yes? Whereas if we do a keyed hash of the IP address (and never disclose the key), we could distinguish "same" from "different". I remember we had the keyed hash design in some other sanitization context, but I don't remember which one -- how is the idea working out in that other context?

(It's possible that we already do the keyed hash for the regular bridge descriptors, so we would just need to match up the sha1(fingerprint) in this file with the sha1(fingerprint) in that file and we could look up the IP address. In which case maybe there's merit in doing the same keyed hash in both places, to ease the job of future researchers.)

The main question that I want to answer with this data actually is "what's the correlation between which pool the bridge is in and whether that bridge sees a lot of use from a given country". My guess is there are periods of time where the http bridges are wildly popular in China, and then periods where they are pretty much unused (e.g. because they're not reachable). I wonder how it looks for other countries.

(There are variations of this question that I also want to know the answer to, that don't require this data at all, such as "what's the correlation between the bridge's ORPort and its use in various countries".)

As for changing bridgedb to export its pool assignments in this format, that's fine by me. It will be much easier for you to pick through than having somebody grep lines from logs. Ask Andrew which python dev person you should point at.

Thanks for the feedback!

I'm going to write that quick summary and send it to or-dev today. If nobody screams at me until, say, end of this week, I'll publish some sample output from September 2010 (the oldest data that I have).

We discussed including a (keyed) hash of a bridge's IP address in sanitized bridge descriptors a year ago. If we decide that it's a good idea we should also include the same (keyed) hash in the bridge pool assignments. We can revisit this question if you want, but we might want to do this in a separate task. I'll look up the mails and open a new ticket later today.

Replying to karsten:

I'm going to write that quick summary and send it to or-dev today. If nobody screams at me until, say, end of this week, I'll publish some sample output from September 2010 (the oldest data that I have).

Posted to or-dev.

We discussed including a (keyed) hash of a bridge's IP address in sanitized bridge descriptors a year ago. If we decide that it's a good idea we should also include the same (keyed) hash in the bridge pool assignments. We can revisit this question if you want, but we might want to do this in a separate task. I'll look up the mails and open a new ticket later today.

See #2435 (moved). Let's move the discussion about hashed IP addresses there. Once we have a design we can easily adapt it to the bridge pool assignments.

Can we extend BridgeDB to write a new bridge pool assignment file using the format described above whenever it assigns bridges to rings and copy that file to our bridge sanitizing node? That node will take care of sanitizing.

Alternatively, can we periodically grep BridgeDB's logs for "Leaving bridge .* unallocated|Adding bridge .* to" and copy the output to our bridge sanitizing node? If we take this approach, the bridge sanitizing node will take care of reformatting and sanitizing the logs.

Trac:
Cc: N/A to nickm
Points: N/A to N/A

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

Trac:
Actualpointsdone: N/A to N/A
Pointsdone: N/A to N/A
Actualpoints: N/A to N/A

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

Replying to nickm:

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

I don't understand the second half of your last sentence. I noticed that bridges appear in multiple places in the logs, because they occur multiple times in the bridge-descriptors file, right? But shouldn't the same bridge be allocated to the very same ring and subrings when parsing its descriptors, because the allocation is based on the bridge identity and the networkstatus-bridges file?

The suggested log format was derived from the log file I had. We can change the format if that makes things easier. How about we take a) the subrings including IP ring X, stable, or port-443 subring, and b) the bridge IP address out of the new log format? I don't know what to do with the information that a bridge is contained in IP ring X, and all flags, ports, and IP addresses are contained in the bridge descriptors that we already have. The information I'm hoping to learn from the new log format is whether a bridge is allocated to the email or web pool or not allocated at all. How about this new format:

bridge-pool-assignment 2011-01-10 01:41:14
abcdef0123456789abcdef0123456789abcdef01 unallocated
0123456789abcdef0123456789abcdef01234567 web
4567890987654321234567890abcdefedcbabcde email

Replying to nickm:

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

I don't know much about Python, but I can try to work on a BridgeDB patch. I have a working BridgeDB installed on a local machine here. Can you give me some more guidance how to implement the format described in this comment?

I also don't quite understand the last sentence of your second comment. The logs I had (see original task description above) contained lines for unallocated bridges. What distributors are there that only exist in the database, and why do we have to treat them specially?

Thanks!

Trac:
Parent: N/A to #2537 (moved)

Replying to karsten:

Replying to nickm:

Adding another log and using it to log more information should be pretty easy. Dumping all allocations should be relatively simple, but the format above is a little nontrivial to output. The simplest format would be sorted by distributor, then by subring, then by ring position, so any bridge that appeared in multiple places would need to be postprocessed into one place.

I don't understand the second half of your last sentence. I noticed that bridges appear in multiple places in the logs, because they occur multiple times in the bridge-descriptors file, right? But shouldn't the same bridge be allocated to the very same ring and subrings when parsing its descriptors, because the allocation is based on the bridge identity and the networkstatus-bridges file?

Yeah. My point was that there's nothing in principle that keeps a bridge from being assigned to more than one place at a time. I don't think there's anything in bridgedb that does that currently thoguh.

The suggested log format was derived from the log file I had. We can change the format if that makes things easier. How about we take a) the subrings including IP ring X, stable, or port-443 subring, and b) the bridge IP address out of the new log format? I don't know what to do with the information that a bridge is contained in IP ring X, and all flags, ports, and IP addresses are contained in the bridge descriptors that we already have. The information I'm hoping to learn from the new log format is whether a bridge is allocated to the email or web pool or not allocated at all. How about this new format:

{{{ bridge-pool-assignment 2011-01-10 01:41:14 abcdef0123456789abcdef0123456789abcdef01 unallocated 0123456789abcdef0123456789abcdef01234567 web 4567890987654321234567890abcdefedcbabcde email }}}

Probably also doable. Let me see what I can do. I'm thinking of including the extra information that you say you don't know what to do with, just in case later there's a use for it.

Replying to nickm:

To add: if that approach I just described is good enough, I think the way you'd want to do it is to add a function to BridgeHolder that dumped its contents to a string or an open file or something. We'd need to think a little, though, about how that would work with unallocated bridges and bridges assigned to "distributors" that only exist in the DB.

I don't know much about Python, but I can try to work on a BridgeDB patch. I have a working BridgeDB installed on a local machine here. Can you give me some more guidance how to implement the format described in this comment?

I hacked up some totally untested code in branch "dump" in my public bridgedb repo. It doesn't handle unallocated bridges yet, and nothing calls it yet, but it should be a good starting point for a patch. Let me know if you have any questions about it.

I also don't quite understand the last sentence of your second comment. The logs I had (see original task description above) contained lines for unallocated bridges. What distributors are there that only exist in the database, and why do we have to treat them specially?

Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

I tweaked your branch a bit. Please see the attached patch (I don't have a public repository yet). The new format is:

$fingerprint email[ port=$port][ flag=$flag]
$fingerprint https ring=$ring[ port=$port][ flag=$flag]
$fingerprint unallocated

I think the code that calls this needs to move somewhere else, and we should make this configurable. Right now, everything is appended to a file assignments.log.

I'm also unsure what to do with the bucket thing. kaner, do you have an idea how we could include the information to which bucket an unallocated bridge was assigned? Maybe we can write something like $fingerprint unallocated bucket=$bucketname for them?

Trac:
0001-Tweak-function-to-dump-all-assignments-a-bit.patch

Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

They exist in the database. Is that not enough?

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

Replying to kaner:

Thanks to kaner's "buckets" thing, there are some distributors that just mean that some "unallocated" bridges are written out to files. These bridges (like other unallocated bridges) don't currently exist at all in-memory for bridgedb. I'm starting to think that choice was iffy.

They exist in the database. Is that not enough?

From reading the code it seems that bucket assignment are made persistent in the database. But I didn't play with BridgeDB enough to be sure.

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

Note that sanitizing the output is not an issue here. This can (and should) happen on the same machine that sanitizes bridge descriptors, so that we have a single place to change the sanitization process, if necessary.

I fixed a bug in my patch above and pushed a new version to branch dump in my shiny new public repository.

Looks okay to me. Shall I merge?

Trac:
Status: new to needs_review

Replying to karsten:

They exist in the database. Is that not enough?

From reading the code it seems that bucket assignment are made persistent in the database. But I didn't play with BridgeDB enough to be sure.

Yes, they are.

Maybe I'm totally off, but I don't think you need any tweaks in BridgeDB for what Karsten wants to accomplish -- if I understood correctly. I think he could just call `TorBridgeDB --dump-bridges' and read bridge pool assignments from those files. Of course a script needs to run over those and sanitize them. But that's another discussion maybe.

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

It writes all pool assignments to file. After running it, you have, for instance: (actual numbers changed)

$ wc -l *.brdgs 9000 email-2011-02-13.brdgs 9000 https-2011-02-13.brdgs 400 twitter-2011-02-13.brdgs 1000 unallocated-2011-02-13.brdgs

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

Replying to nickm:

Looks okay to me. Shall I merge?

Not yet. I want to extend it to output whether an unallocated bridge is assigned to a file bucket. Will set the needs_review flag once I'm done.

Replying to kaner:

I don't think that --dumpbridges is doing what I want. Doesn't that just write the file buckets to disk? I want the assignments of bridges to the https/email distributors as well as the unallocated bridges. Also, I want the assignment file to be updated whenever BridgeDB loads new network statuses and descriptors and refreshes its internal state.

It writes all pool assignments to file. After running it, you have, for instance: (actual numbers changed)

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

I didn't know that --dump-bridges writes email/https/unallocated briges to disk, too. Why's that?

Also, I don't think --dump-bridges does what I want for a number of reasons. 1) I want to learn about changing assignments as soon as BridgeDB loads new bridges, not when an external script is running. 2) I want to learn about assignments at any point in time, not only the most recent assignment at the time of syncing data. 3) I'm more interested in the fingerprint than IP:port.

Trac:
Status: needs_review to assigned
Owner: N/A to karsten

Replying to karsten:

Each *.brdgs file contains ip:port pairs. You decide if that is what you need.

I didn't know that --dump-bridges writes email/https/unallocated briges to disk, too. Why's that?

Somebody (possibly nickm or arma) asked for it. I guess they didn't really have a particular use in mind yet. Maybe, back at that time, statistics? ;)

Also, I don't think --dump-bridges does what I want for a number of reasons. 1) I want to learn about changing assignments as soon as BridgeDB loads new bridges, not when an external script is running. 2) I want to learn about assignments at any point in time, not only the most recent assignment at the time of syncing data. 3) I'm more interested in the fingerprint than IP:port.

Ok. To close this thread of discussion: After IRC conversation with karsten, it seems the two features 'statistics' and 'dump bridges' aren't really meant to merge.

Replying to karsten:

Replying to nickm:

Looks okay to me. Shall I merge? Not yet. I want to extend it to output whether an unallocated bridge is assigned to a file bucket. Will set the needs_review flag once I'm done.

Please review branch dump in my public repository. If this branch looks okay, please merge or tell me to do so. This branch outputs the following format to a file called assignments.log:

$fingerprint email[ port=$port][ flag=$flag]
$fingerprint https ring=$ring[ port=$port][ flag=$flag]
$fingerprint unallocated[ bucket=$bucketname]

Trac:
Status: assigned to needs_review

Export BridgeDB's pool assignments

Child items ...

Activity