** Session: PrivCount Technical Issues **
10/03/2018
12:00 -- 13:00
Session lead: teor
Scope: Solve remaining issues to securely implement PrivCount.
Agenda:
- Action bounds - Consistent derivation
- Adding new statistics - how to estimate value and cost
- Binning outputs for privacy
- Debugging protocol implementation
- Floating point and differential privacy issues
- Malicious inputs - prevent them from modifying output too much
- Metrics statistical analysis
- Noise
- How much excess noise to handle malicious inputs
- How to divide noise among relays
- Issue with low per-relay noise and possible zero truncation
- Function to sample Gaussian noise
- Versioning
- Ensure noise is correctly despite multiple versions
Initial questions:
- Question: What is the vision for running PrivCount? Answer: Goal is every relay submits data via PrivCount.
- Question: Do we need a minimum number of relays running PrivCount before we started outputting statistics? Answer: Not necessarily for the noise, but perhaps just to provide some level of aggregation.
Discussion:
- Action bounds
- How do we do consistent derivation?
- We should run a client and see what activity gets generated. What user applications get tested isn't clear.
- We could perhaps use measurements about popular client applications to determine what user activity to simulate
- Adding new statistics
- How do we estimate value for purposes?
- We can ramp up.
- Binning outputs for privacy
- Perhaps bin size could be the action bound.
- Do the rounding after the aggregation to handle floating point issues. Reference: "On significance of the least significant bits for differential privacy", Mironov, https://dl.acm.org/citation.cfm?id=2382264.
- Some small multiple (say, 10) of the action bound seems like a reasonable bin size
- Debugging protocol implementation
- Run protocol on a test network
- Debugging mode when each relays individually adds sufficient noise to its statistic, and compare to side-by-side to current method
- Send a cell on a circuit on a circuit that says "measure me" with an ID that has the experiment number, or just pin a node that isn't in the consensus
- Issue with low per-relay noise and possible zero truncation
- Used fixed-point arithmetic, where the decimal point position is chosen based on:
- the expected value of the statistic: avoid overflowing the ≈ 2^60^ available bits of the shamir secret (our field elements are 2^62^ - 2^30^ - 1, which can contain 61-bit 2's complement integers. We use the remaining 2^61^ - 2^30^ - 1 values to detect overflow with probability ≈ 0.5)
- the noise allocated to the statistic: allow enough (how much? 8 extra bits?) precision in the smallest possible noise added by a relay (a relay with consensus weight 1 will add sqrt(1/24 hours * 1/50 million total consensus weight) * noise allocation ≈ 1 / 2^15^ * noise allocation)
- Function to sample Gaussian noise
- Need to re-implement sampling from a Gaussian distribution
- Use discrete Gaussian distribution, lattice-based cryptography implementations should have this, e.g. implementation of BV (Brakerski and Vaikuntanathan)
- one alternative: [https://arxiv.org/abs/1303.6257 Sampling exactly from the normal distribution]
- [http://exrandom.sourceforge.net/ Project page], [https://sourceforge.net/projects/exrandom/files/distrib/ tarballs], also implemented in MPFR as mpfr_nrandom
- Algorithm D outputs Gaussian-distributed integers with standard deviation σ, using approximately (1/0.715)log,,2,,σ uniformly random bits per sample
- Since our Gaussians will be scaled along with our counters, we will be sampling Gaussians with σ between 2^15^ and 2^60^, using approximately 21-84 bits per sample
- This method sets the low bits of the Gaussian directly from the uniformly random input bits, much like [https://gitweb.torproject.org/torspec.git/tree/proposals/288-privcount-with-shamir.txt#n452 Appendix C of the PrivCount shamir spec]
- another alternative: [https://eprint.iacr.org/2013/383.pdf Lattice Signatures and Bimodal Gaussians]
- [http://bliss.di.ens.fr/ Original source code], [https://wiki.strongswan.org/projects/strongswan/wiki/BLISS production-grade C implementation]
- Algorithm 12 in Section 6 outputs Gaussian-distributed integers with standard deviation σ for σ = kσ,,2,, (k integer, σ,,2,, = sqrt(1/(2ln2)) ≈ 0.849), using approximately 4 + 2log,,2,,σ uniformly random bits per sample
- Since our Gaussians will be scaled along with our counters, we will be sampling Gaussians with σ between 2^15^ and 2^60^, using approximately 34-124 bits per sample
- We will round up σ to the nearest kσ,,2,,. These large σ make the difference between kσ,,2,, and σ negligible
- background: [https://www.math.auckland.ac.nz/~sgal018/gen-gaussians.pdf SAMPLING FROM DISCRETE GAUSSIANS FOR LATTICE-BASED CRYPTOGRAPHY ON A CONSTRAINED DEVICE]
- [https://github.com/shaih/HElib HELib (Homomorphic Encryption in C++)] ~~should also have this~~ [https://github.com/shaih/HElib/blob/a49dd5210037c187ca4ea05ec8f532697262db20/src/NumbTh.cpp#L694 uses the box-mueller transform in floating-point and rounds to the nearest integer]
- Malicious inputs
- How to prevent small number of relays from modifying output too much
- Report individual statistics with large amount of noise to check for reasonable size, then aggregate inputs with smaller noise
- Divide relays into buckets and report aggregate values from each bucket, choose at least bucket using trust or bandwidth or flags, choose shared random value to generate buckets produced *after* the inputs have been received
- Specific proposal: choose a few (say, 20) of the largest relays for one bucket, choose a few relays randomly based on bandwidth (say, 20), and also release the network-wide aggregate. Use the bucket values to "check" the network aggregate.
- Metrics statistical analysis
- PrivCount needs to tell Metrics *which* relays provided inputs.
- PrivCount also needs to tell Metrics the standard deviation of the noise.
- Sampling error needs to be taken into account. Ian wrote an email describing how to do this.
- Noise
- How much excess noise to handle malicious inputs
- How to divide noise among relays
- How to share noise across statistics.
- Divide it based on consensus weight. Then make some assumption about maximum fraction of bandwidth that is malicious or fails, which answers how excess noise is added.
- Have a minimum requirement also for number (or weight) of relays that need submit inputs to proceed with the aggregation.
- Versioning
- Ensure noise is correctly despite multiple versions
- Noise parameters and scaling
- If relays know the noise allocated to each statistic (and the number of relays on each version?), they can calculate how much noise to add to the statistics they are collecting