Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#4499 closed task (implemented)

Investigate scaling points to handle more bridges

Reported by: runa Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords: SponsorE20120315
Cc: aagbsn, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The current bridge infrastructure relies on a central bridge authority to collect, distribute, and publish bridge relay descriptors. We believe the current infrastructure can handle up to 10,000 bridges.

The scaling points involve the database of descriptors, the metrics portal and its ability to handle this many descriptors for analysis, and the reachability testing part of the code for the bridge authority. We should investigate scaling points to handle more than 10,000 bridge descriptors.

Child Tickets

Attachments (1)

bridge-scaling.png (56.1 KB) - added by karsten 8 years ago.
Scalability of Tor's bridge infrastructure

Download all attachments as: .zip

Change History (11)

comment:1 Changed 8 years ago by phobos

The last sentence is the key here. We don't need to build and deploy a scalable bridge infrastructure. We need to write down thoughts and notes about how to scale the bridge db system to 100,000 or 1,000,000 bridges.

comment:2 Changed 8 years ago by karsten

Owner: set to karsten
Status: newassigned

I'm going to put some thoughts on this. Grabbing the ticket.

comment:3 Changed 8 years ago by runa

Summary: Infrastructure to support more bridgesInvestigate scaling points to handle more bridges

comment:4 Changed 8 years ago by runa

Karsten has previously said that we'll want to look very carefully at the bridge authority, BridgeDB, metrics, and maybe others. He's mostly worried about the bridge authority, but the BridgeDB and metrics will have to be extended as well.

comment:5 Changed 8 years ago by aagbsn

Cc: aagbsn added

comment:6 Changed 8 years ago by karsten

I started this analysis by writing a small tool to generate sample data for BridgeDB and metrics-db. This tool takes the contents from one of Tonga's bridge tarball as input, copies them a given number of times, and overwrites the first two bytes of relay fingerprints in every copy with 0000, 0001, etc. The tool also fixes references between network statuses, server descriptors, and extra-info descriptors. This is sufficient to trick BridgeDB and metrics-db into thinking that relays in the copies are distinct relays. I used the tool to generate tarballs with 2, 4, 8, 16, 32, and 64 times as many bridge descriptors in them.

In the next step I fed the tarballs into BridgeDB and metrics-db. BridgeDB reads the network statuses and server descriptors from the latest tarball and writes them to a local database. metrics-db sanitizes two half-hourly created tarballs every hour, establishes an internal mapping between descriptors, and writes sanitized descriptors with fixed references to disk.

The attached graph shows the results.

The upper graph shows how the tarballs grow in size with more bridge descriptors in them. This growth is, unsurprisingly, linear. One thing to keep in mind here is that bandwidth and storage requirements to the hosts transferring and storing bridge tarballs are growing with the tarballs. We'll want to pay extra attention to disk space running out on those hosts.

The middle graph shows how long BridgeDB takes to load descriptors from a tarball. This graph is linear, too, which indicates that BridgeDB can handle an increase in the number of bridges pretty well. One thing I couldn't check is whether BridgeDB's ability to serve client requests is in any way affected during the descriptor import. I assume it'll be fine. Aaron, are there other things in BridgeDB that I overlooked that may not scale?

The lower graph shows how metrics-db can or cannot handle more bridges. The growth is slightly worse than linear. In any case, the absolute time required to handle 25K bridges is worrisome (I didn't try 50K). metrics-db runs in an hourly cronjob, and if that cronjob doesn't finish within 1 hour, we cannot start the next run and will be missing some data. We might have to sanitize bridge descriptors in a different thread or process than the one that fetches all the other metrics data. I can also look into other Java libraries to handle .gz-compressed files that are faster than the one we're using. So, we can probably handle 25K bridges somehow, and maybe even 50K. Somehow.

Finally, note that I left out the most important part of this analysis: can Tonga, or more generally, a single bridge authority handle this increase in bridges? I'm not sure how to test such a setting, or at least without running 50K bridges in a private network. I could imagine this requires some more sophisticated sample data generation including getting the crypto right and then talking to Tonga's DirPort. If there's an easy way to test this, I'll do it. If not, we can always hope for the best. What can go wrong.

Changed 8 years ago by karsten

Attachment: bridge-scaling.png added

Scalability of Tor's bridge infrastructure

comment:7 Changed 8 years ago by arma

If we end up with way too many bridges, here are a few things we'll want to look at updating:

  • Tonga still does a reachability test on each bridge every 21 minutes or so. Eventually the number of tls handshakes it's doing will overwhelm its cpu.
  • The tarballs we make every half hour have substantial overlap. If we have tens of thousands of descriptors, we would want to get smarter at sending diffs over to bridgedb.
  • Somebody should check whether bridgedb's interaction with users freezes while it's reading a new set of data.

comment:8 Changed 8 years ago by karsten

Cc: arma added

Updated the PDF. Comments welcome. Otherwise this is the PDF to submit to sponsor E on March 15.

comment:9 Changed 8 years ago by karsten

Resolution: implemented
Status: assignedclosed

Published tech report on the metrics website. Closing.

comment:10 Changed 7 years ago by karsten

Keywords: SponsorE20120315 added
Milestone: Sponsor E: March 15, 2012

Switching from using milestones to keywords for sponsor deliverables. See #6365 for details.

Note: See TracTickets for help on using tickets.