Opened 7 years ago

Closed 2 years ago

#6003 closed project (wontfix)

Quantitative user studies of how people use Tor

Reported by: phobos Owned by:
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: SponsorZ
Cc: runa Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Run a quantitative user study of how people use Tor. Many, many people want to know the top 100 destinations seen by tor exit relays. We'll need to figure out safe ways to collect this data in bulk, analyze it, and publish it for others to use/see/manipulate. Further questions involve top destinations over the day, weeks, months, and years for the tor network at a whole.

The current answer may be "there is no safe way to collect this data", but we won't know until we do some research and arrive at an answer. A few researchers have simply recorded all traffic or samples of traffic, from their exit relays and then run afoul of ethics boards and human subjects ethics at their institutions.

Researching a better method, publishing the method, and testing the method in a simulator may be one possible process.

Child Tickets

Change History (7)

comment:1 Changed 7 years ago by runa

Cc: runa added

comment:2 Changed 7 years ago by karsten

Component: Metrics WebsiteAnalysis

Interesting idea. I agree that we should think about safe ways to do it, rather than letting others do it unsafely. Thinking aloud here:

I could image that exit nodes report either of the two following statistics to the directory authorities:

  1. top-10/20/50 domain names resolved in the last 24 hours;
  2. top-10/20/50 IP addresses exited to in the last 24 hours.

I don't know the details of domain name resolution in Tor, but I think 1 isn't impossible to implement, and it would answer your question better than 2. For either statistic, Tor would only give out the top 10/20/50 results, round them up to multiples of some number, aggregate over 24 hours, etc. There could also be a threshold below which the exit node doesn't report any specific results and just reports "other" domain names or IP addresses.

So, I think we could come up with parameters to protect users' privacy enough here (in particular if we can convince other researchers to use our data instead of doing their let's-log-everything approach). The exit nodes would report these statistics in their extra-info descriptors to the directory authorities where we collect them, make them public to anyone, and analyze them. As part of this analysis we could also extrapolate numbers to compensate for missing statistics and present an overall top-100 list for a given day, week, month, or year.

Is this very roughly what you had in mind? If so, we should move the discussion to the tor-dev mailing list for added fun.

(Moving this ticket to the Analysis component for the same reason as #6002.)

comment:3 Changed 7 years ago by karsten

Keywords: SponsorZ added
Milestone: Sponsor Z: November 1, 2013

Switching from using milestones to keywords for sponsor deliverables. See #6365 for details.

comment:4 Changed 7 years ago by phobos

Sounds like what the ticket was getting at. The challenges here are few:

  1. Figuring out how to do this in a privacy preserving way.
  2. Generating data sets over time for publication.
  3. Keeping away from the slippery slope of "if you can monitor it, you can censor it".

comment:5 Changed 7 years ago by karsten

1 and 2 are fine to discuss on tor-dev. I'm optimistic that we can find privacy-preserving solutions here. Whenever this ticket gets funded or we decide to do it anyway even without funding, I'm happy to move the discussion there and do some initial studies.

3 might be more difficult. The thing is, we can already monitor what domain names we resolve---that is, if I understand domain name resolution on exit relays correctly---and to what destinations we exit. So, if somebody follows the rationale "if you can monitor it, you can censor it", that would already apply. However, I'm less worried about the statistics discussed in this ticket than about those in #6002. Both the domain names to resolve and the destination IP addresses to exit to are information that exit relays need to have anyway. That's different than #6002 where we don't even have protocol information yet that we could use to censor given protocols. Implementing #6002 would change that, whereas implementing this ticket wouldn't add a single bit of information to what exit relays already know.

comment:6 Changed 3 years ago by cass

Severity: Normal

This ticket is tagged SponsorZ, but it looks like progress stalled four years ago and the path forward isn't clear. Is this still a thing for which we need funding?

comment:7 Changed 2 years ago by karsten

Resolution: wontfix
Status: newclosed

Closing tickets in Metrics/Analysis that have been created 5+ years ago and not seen progress recently, except for the ones that "nickm-cares" about.

Note: See TracTickets for help on using tickets.