Opened 8 years ago

Closed 8 years ago

#2435 closed enhancement (implemented)

Preserving hashed IP addresses in sanitized bridge descriptors

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Roger mentioned in a comment of #2372:

One issue that comes to mind that we might want to research is how often a given bridge moves IP address. The method you describe above would lose that info, yes? Whereas if we do a keyed hash of the IP address (and never disclose the key), we could distinguish "same" from "different". I remember we had the keyed hash design in some other sanitization context, but I don't remember which one -- how is the idea working out in that other context?

(It's possible that we already do the keyed hash for the regular bridge descriptors, so we would just need to match up the sha1(fingerprint) in this file with the sha1(fingerprint) in that file and we could look up the IP address. In which case maybe there's merit in doing the same keyed hash in both places, to ease the job of future researchers.)

When we discussed this topic the last time, I suggested replacing bridge IP addresses with something very similar to this:

  H(IP address + bridge identity + secret)[:3]

The input IP address is the 4-byte long binary representation of the bridge's current IP address. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is an arbitrary, sufficiently long (say, 20 bytes), secure random string that does not change over time and that is only known to the machine running the bridge descriptor sanitizer plus backups. H is SHA-1. The [:x] operator means that we pick the x most significant bytes of the result.

The original transformation used 4 bytes of the output, but I changed this to use only 3 bytes here. The idea is to write the resulting "IP addresses" as 10.x.x.x in the sanitized descriptors to make it clear that these are no public IP addresses. I want to avoid confusion with the non-sanitized IP addresses in exit policies. I'm aware of the higher collision probability, but the probability and impact of missing an IP address change are still sufficiently low.

The resulting "IP address" helps us detect whether a specific bridge has changed its IP address. It does not tell us if two bridges run on the same IP address. It also does not tell us when a bridge changes its fingerprint but keeps its IP address.

The two important pieces of this transformation are that a) someone who learns a bridge's identity cannot guess the bridge's previous IP addresses (which would have been possible without using the secret); b) someone who guesses the secret cannot guess the IP addresses of all bridges (which would have been possible without using the bridge identity).

There are more details about preserving hashed IP addresses in this thread.

Child Tickets

TicketTypeStatusOwnerSummary
#2505enhancementclosedkarstenImplement bridge descriptor secret manager in metrics-db

Attachments (2)

bridge-ips-scatter-2011-02-10.png (35.2 KB) - added by karsten 8 years ago.
Scatter plot: Unique IP addresses of bridges running at least 24 hours in Nov 2008
bridge-ips-ecdf-2011-02-10.png (28.6 KB) - added by karsten 8 years ago.
Cumulative distribution: Unique IP addresses per day of bridges running at least 24 hours in Nov 2008

Download all attachments as: .zip

Change History (7)

comment:1 Changed 8 years ago by karsten

Christian and I discussed this approach some more. Christian is concerned that someone might brute force the secret. The attacker could set up a few bridges, remember their IP addresses and bridge identities, look up the sanitized descriptors in our archives, and try out which secret leads to the same 10.x.x.x address in our descriptors. This attack could be performed offline. He suggests using a much longer secret and changing it regularly.

I somewhat dislike the idea of changing the secret regularly, because it means we cannot compare the sanitized IP addresses of multiple intervals easily. But we're probably safer by changing it, e.g., monthly. Using a longer secret, say, 40 or 60 bytes (or even longer?), is a fine idea, too.

comment:2 Changed 8 years ago by karsten

Ian suggests on or-dev to use a 31 byte long secret here. The idea is to fit IP address, bridge identity, and secret in one SHA block which is 447 bits long. The IP address is 32 bits, the bridge identity is 160 bits, so that we have 255 bits left, or 31 bytes because we're byte-aligned.

Ian also suggests using SHA-256 instead of SHA-1, mostly because SHA-1 shouldn't be used for anything new at this point.

comment:3 Changed 8 years ago by karsten

Status: newassigned

Yesterday I finished the implementation of hashed IP addresses in metrics-db (#2505). I also sanitized some old bridge descriptors from 2008 with the new algorithm last night.

Here's an early analyis of sanitized bridge descriptors containing IP address hashes. The idea of the analysis is to compare unique IP addresses of a bridge compared to the number of statuses that contain this bridge.

There are two graphs in the attachment. The first graph shows a scatter plot of unique IP addresses and days of operation. Only bridges with 24 hours of operation are shown. There is an accumulation of points at the lower left of the graph which are bridges with only a few days of bridge operation. These bridges are probably not as useful for bridge users, because they are unavailable most of the time. In contrast to that, the accumulation of points with almost 30 days of operation and only very few unique IP addresses indicates stable bridges on static IP addresses that are probably most useful for bridge users. Points close to the dashed line indicate bridges that change their IP address once a day. Points above the dashed line are probably not as useful for clients, too, because they change their IP address more than once per day. These bridges are only useful if bridge users download new bridge descriptors for known bridges from the bridge authority.

The second graph shows the cumulative fraction of bridges having a given number of unique IP addresses per day. Again, the dashed line indicates bridges on dynamic IP addresses that change their IP address once a day. Two thirds of the bridges either have static IP addresses or change their address at most once a day. This leaves us with one third of bridges changing their IP address more often than that.

The next steps are:

  • Update the specification-like description of our the sanitizing process here.
  • Post the sanitized descriptors from November 2008 to or-dev for others to look.
  • Sanitize the 2.5 years of descriptors that we have once again and make them available on the metrics website.

I'm planning to do the first two items today and publish the sanitized descriptors next Tuesday (assuming the sanitizing process finishes by then).

Changed 8 years ago by karsten

Scatter plot: Unique IP addresses of bridges running at least 24 hours in Nov 2008

Changed 8 years ago by karsten

Cumulative distribution: Unique IP addresses per day of bridges running at least 24 hours in Nov 2008

comment:4 Changed 8 years ago by karsten

The technical report is updated, and or-dev has another mail from me on the topic. Starting to sanitize descriptors using the new algorithm...

comment:5 Changed 8 years ago by karsten

Resolution: implemented
Status: assignedclosed

New tarballs are available and announced on tor-dev. Closing.

Note: See TracTickets for help on using tickets.