Opened 7 years ago

Closed 10 months ago

#3015 closed enhancement (fixed)

Enhance bucket functionality

Reported by: kaner Owned by: isis
Priority: Medium Milestone:
Component: Obfuscation/BridgeDB Version:
Severity: Blocker Keywords: bridgedb-dist
Cc: Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor: SponsorM

Description

The bucket mechanism in BridgeDB currently lacks two important features:

a) Only give out bridges that are fresh, meaning those that have been seen recently enough to be considered worth giving out.
b) Give the user the opportunity to add a number of fresh bridges to a certain bucket pool each time BridgeDB -d is run. This makes it much easier to only mail a delta to a given bridge distributor person plus gives that user a predictable number of new bridges each time.

Child Tickets

Change History (10)

comment:1 Changed 7 years ago by kaner

Owner: changed from kaner to karsten
Status: newassigned

Please review the bug3015 branch in my BridgeDB repo. Thanks in advance!

comment:2 Changed 7 years ago by kaner

Status: assignedneeds_review

comment:3 in reply to:  description Changed 7 years ago by karsten

Owner: changed from karsten to kaner
Status: needs_reviewassigned

Replying to kaner:

The bucket mechanism in BridgeDB currently lacks two important features:

a) Only give out bridges that are fresh, meaning those that have been seen recently enough to be considered worth giving out.

Wait a minute. I thought we were only giving out bridges that were contained in the last known network status. Aren't they fresh by definition?

b) Give the user the opportunity to add a number of fresh bridges to a certain bucket pool each time BridgeDB -d is run. This makes it much easier to only mail a delta to a given bridge distributor person plus gives that user a predictable number of new bridges each time.

I don't quite understand how that works. If we always add a number of new bridges, doesn't that mean we'll soon run out of new bridges? The bucket files cannot grow forever while still having unallocated bridges. Or rather, finding the correct rate that matches the current bridge churn will be difficult.

When making changes like these, can you first update the spec to say what things you want to change? I'm under the impression that we (including Roger) are not really sure what we want here.

comment:4 Changed 7 years ago by karsten

So, after discussing this on IRC and trying out the code, I think BridgeDB does not remove non-running bridges from file buckets. At least the number of bridges never decreased, but always stayed the same or increased in my tests. This also makes sense in the code, because Bucket.py does not import anything from Bridges.py, and Bridges.py is where we store which bridges are running.

(I think what confused me on IRC was that when we dump bridge pool assignments, we look at the running bridges in Bridges.py and obtain the pseudo distributors from Buckets.py. We do not, however, simply iterate over all the bridges in Buckets.py. That's why we only dump pool assignments of running bridges.)

The fact that we dump non-running bridges to file buckets is a bug, and we shouldn't fix it by only removing bridges that haven't been running for a day or two. I don't see how distributing bridges via HTTPS/email or via Twitter/social networks is different in that we make sure we give out only running bridges in the former case but not in the latter. And we do have the information which bridges are running, so I think we should use it.

Here's my idea:

We could make the BucketManager in Buckets.py, that decides which bridges are written to which file, extend from Bridges.BridgeHolder in the same way as the *Distributor classes in Dist.py do. That's how the BucketManager would learn which bridges are running at the moment.

We could also remove the command-line option to dump bridges to buckets and simply dump them whenever we're done reloading the network status and bridge descriptors. External tools wouldn't have to run BridgeDB with the command-line option, but could simply read the files whenever they like.

As a side-effect, this approach would implement the first half of #2755 by removing the time gap between dumping bridges to file buckets (and thereby possibly changing their pseudo distributors) and dumping assignments upon the next HUP.

Does that make sense?

comment:5 in reply to:  4 ; Changed 7 years ago by kaner

Replying to karsten:

So, after discussing this on IRC and trying out the code, I think BridgeDB does not remove non-running bridges from file buckets. At least the number of bridges never decreased, but always stayed the same or increased in my tests. This also makes sense in the code, because Bucket.py does not import anything from Bridges.py, and Bridges.py is where we store which bridges are running.

This is for certain. The reason is simple: Currently, the bucket mechanism does not use the "splitter -> distributor -> rings"-bridges imported from the descriptors, but reads them from the database.

Basically, there's two ways to implement the buckets feature:

a) Similar to what Karsten did for the dump pool assignments feature. Trigger a dump each time SIGHUP is received and only dump bridges that are in the current descriptor file. This approach has the advantage of streamlining nicely with the rest of BridgeDB's distribution mechanisms.

b) Read bridges directly from the database. This provides more flexibility than a), because not only the bridges seen by the latest descriptor file can be given out, but also bridges that maybe haven't been online for a few hours (people might be shutting down their bridges on the weekend, for instance; or at night). Also, bridges can be dumped without interfering with the running instance of BridgeDB, because a separate instance can be triggered via command line at any time. The disadvantage here surely is that it uses a different way of distribution than the rest of BridgeDB.

Initially, nobody seemed to object this tradeoff, so I went for the implementation as it now is. Adding a configurable timespan that eliminates bridges that are too old to give out seems okay to me. This is what part a) of the patch on branch bug3015 does.

Anyway, I'm willing to let it all go down because you don't seem to be a big fan of it (plus a's design is cleaner) and change that approach to a) again.

We could make the BucketManager in Buckets.py, that decides which bridges are written to which file, extend from Bridges.BridgeHolder in the same way as the *Distributor classes in Dist.py do. That's how the BucketManager would learn which bridges are running at the moment.

We could also remove the command-line option to dump bridges to buckets and simply dump them whenever we're done reloading the network status and bridge descriptors. External tools wouldn't have to run BridgeDB with the command-line option, but could simply read the files whenever they like.

As a side-effect, this approach would implement the first half of #2755 by removing the time gap between dumping bridges to file buckets (and thereby possibly changing their pseudo distributors) and dumping assignments upon the next HUP.

Does that make sense?

It does, if I understood correctly. Maybe we should try to merge the dump pool assignments and the bucket mechanisms somewhat, after all. On closer inspection, I think large parts of the code for both features would look quite similar, if we decide for this solution.

Thanks for your suggestions.

comment:6 in reply to:  5 Changed 7 years ago by karsten

Replying to kaner:

Basically, there's two ways to implement the buckets feature:

Your a) is my favorite here. The main argument for me is that it fits more nicely into the rest of BridgeDB. Given that we want to add lots of new features to BridgeDB soon, I'd really prefer making the code clearer and removing special cases.

If we wanted the flexibility of b) to give out non-running bridges that we saw in the last 24 hours, we could still do that by storing when we last saw a bridge in Bridges.by. We could even define that we only want to give out running bridges to HTTPS/email, but that we're fine with 1 day old bridges for file buckets. So, this flexibility is not tied to your approach b), but can be implemented in a), too. Anyway, I'd rather consider this a bug, not a feature.

Maybe we should try to merge the dump pool assignments and the bucket mechanisms somewhat, after all. On closer inspection, I think large parts of the code for both features would look quite similar, if we decide for this solution.

The two features are similar, but still different. For example, we don't have to write file buckets for the HTTPS and email distributors as well as for unallocated bridges. Also, the formats of the assignments.log file and the file buckets are slightly different. I'd say it's fine to copy the approach taken in dumping assignments for dumping file buckets, but we shouldn't try too hard to merge the two yet.

I'm afraid the bulk of the work here will be up to you. How can I help?

comment:7 Changed 3 years ago by isis

Owner: changed from kaner to isis

At this point, the Buckets have been been unused for many years, largely do to the design's requirement of continual manual intervention on the part of the person operating/maintaining BridgeDB, i.e. the process of adding a new bucket and giving it to someone consists of:

  1. Isis has to edit the Bridgedb config file,
  2. commit the config file,
  3. push the commit to the bridgedb-admin.git repo,
  4. ssh into the BridgeDB server,
  5. pull the bridgedb-admin.git changes,
  6. kill the running BridgeDB server,
  7. restart the BridgeDB server,
  8. send a SIGUSR1 signal to the BridgeDB process,
  9. find the bucket file,
  10. scp the bucket file somewhere,
  11. OpenPGP encrypt it,
  12. and, finally, email (or otherwise distribute) the file to the person/group that it needs to go to.

While it might be fair to call me lazy for not wanting to do this, I believe it is fair to say that this original design, while perhaps sufficient for some use cases which no longer exist, is no longer providing a solution to any current problems with bridge distribution.

In order to come up with a better design, we're going to need to think of the intended use cases for Buckets (perhaps #13727 and/or #13570?), and how we'd like to automate getting that data to the party it's intended for, because I simply flat out refuse to do a 12-step process to give bridges to someone.

comment:8 Changed 3 years ago by isis

Keywords: bridgedb-dist added
Status: assignedneeds_information

comment:9 Changed 3 years ago by isis

I should also add that, because the design of Bucket.py is in contrast to the current development direction for #12505, #12506, #12029, #12030, and #12031, and because:

  • the code in Bucket.py is no longer used, for reasons mentioned in my previous comment,
  • the code in Bucket.py never had unit/integration tests, and
  • the SQL schema governing how the bridges assigned to Buckets are stored in the database is clunky, highly inefficient (one shouldn't have to do a for loop over all the bridges which BridgeDB has seen in the past 9 years to find the ones currently assigned to Buckets), and gaining in inefficiency linearly every time a new bridge somewhere comes into existence,

I am tempted to remove Bucket.py nearly entirely until it can be redesigned with some particular use-case(s) in mind. In removing the code, I'd like to retain:

  • functionality for extracting data about bridges still assigned (or historically assigned) to buckets from the databases (unfortunately, this will still retain the nasty for loop linked to above, unless we want to update the SQL schema once more before migrating to the new databases in #12030)

and remove:

  • Bucket.py,
  • the ability to dump buckets to .brdg files,
  • the ability to create new buckets via the BUCKET_FILES config setting, and
  • the SIGUSR1 handler for dumping buckets to .brdg files

comment:10 in reply to:  9 Changed 10 months ago by isis

Points: 1
Resolution: fixed
Severity: Blocker
Sponsor: SponsorM
Status: needs_informationclosed

Replying to isis:

I should also add that, because the design of Bucket.py is in contrast to the current development direction for #12505, #12506, #12029, #12030, and #12031, and because:

  • the code in Bucket.py is no longer used, for reasons mentioned in my previous comment,
  • the code in Bucket.py never had unit/integration tests, and
  • the SQL schema governing how the bridges assigned to Buckets are stored in the database is clunky, highly inefficient (one shouldn't have to do a for loop over all the bridges which BridgeDB has seen in the past 9 years to find the ones currently assigned to Buckets), and gaining in inefficiency linearly every time a new bridge somewhere comes into existence,

I am tempted to remove Bucket.py nearly entirely until it can be redesigned with some particular use-case(s) in mind. In removing the code, I'd like to retain:

  • functionality for extracting data about bridges still assigned (or historically assigned) to buckets from the databases (unfortunately, this will still retain the nasty for loop linked to above, unless we want to update the SQL schema once more before migrating to the new databases in #12030)

and remove:

  • Bucket.py,
  • the ability to dump buckets to .brdg files,
  • the ability to create new buckets via the BUCKET_FILES config setting, and
  • the SIGUSR1 handler for dumping buckets to .brdg files

Fixed in my fix/3015-remove-buckets branch.

Note: See TracTickets for help on using tickets.