Opened 9 years ago

Closed 8 years ago

#2718 closed task (implemented)

Analyze Tor usage data for ways to automatically detect country-wide blockings

Reported by: karsten Owned by: tomb
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords:
Cc: tomb@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Every now and then, there are country-wide blockings of Tor. In most cases we learn about these events from users telling us that Tor has stopped working from them. This may work okay, but given that we already have usage data per country, we should be able to detect blockings ourselves, preferrably automatically and with as few false positives as possible.

I already spent some time on a censorship detector that takes our usage data as input and tells us whenever the usage on a given day falls outside an expected interval. But I'm afraid I don't know enough math to push this further, at least not without reading more about time series analysis. Maybe someone wants to pick this up?

Here's where I am:

We take our estimated daily user numbers as input. Our goal is to give out a warning whenever the estimated user number from a given country drops below a predicted value. This predicted value is not static, but should depend on previous values, therefore we should use time series analysis. We want to model the user numbers for days 1..n-1, predict a value for day n, and warn if the actual value for day n is lower than the predicted value minus some error.

I read some stuff about time series analysis and came up with the ARIMA model. Thankfully, the ARIMA model is already implemented in R.

I'm going to upload some R code to the metrics-tasks repository once I have a ticket number (see comment below). The R code generates a PDF that shows on which days we'd receive a warning. I'm also going to attach the PDf to this ticket. Here's how you can run the R code yourself:

$ wget https://metrics.torproject.org/csv/direct-users.csv
$ R --slave -f detect-censorship.R

Possible next steps are a) finding good parameters for the ARIMA model, b) trying other time series models, and c) extending the approach to bridge users. Once we have a useful approach for estimated daily user numbers, we should d) try to get rid of day-based statistics which have a delay of 1--2 days and make the approach work for directory request stats and connecting bridge user stats to get results more quickly. The final step is to e) integrate the R code with the metrics website and execute it every few hours.

Child Tickets

Attachments (2)

detect-censorship.pdf (30.1 KB) - added by karsten 9 years ago.
Censorship detector warnings
dirreqs.csv.bz2 (1.1 MB) - added by karsten 9 years ago.
Raw directory requests by country

Download all attachments as: .zip

Change History (17)

Changed 9 years ago by karsten

Attachment: detect-censorship.pdf added

Censorship detector warnings

comment:1 Changed 9 years ago by karsten

Owner: karsten deleted
Status: newassigned

See https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-2718 for the source code as promised in the ticket description.

comment:2 Changed 9 years ago by karsten

George Danezis and I discussed the idea of a censorship detector in private mail. I'm moving that discussion here with his permission:

I spent the past few days looking at the .csv with the numbers of users of Tor from different domains. I think I have a first algorithm that will allow you to analyse drops in usage, and raise automatic alarms if they are "unexpected". I attach for example the graphs for china (an obvious example) but also Tunisia (which is more subtle):
http://ephemer.sec.cl.cam.ac.uk/~gd216/tor_detect/img/019-cn-censor.png
http://ephemer.sec.cl.cam.ac.uk/~gd216/tor_detect/img/003-tn-censor.png

You can find all the graphs in which I detected possible censorship in the past 6 months.
http://ephemer.sec.cl.cam.ac.uk/~gd216/tor_detect/img/

How does this work? Here are a few notes:

  • [Wide Variation] It seems that the Tor network is not very stable generally -- numbers of users fluctuate from one day to another, sometimes with no obvious reason and across jurisdictions. This maybe due to relays coming and going or the capacity of the network going up or down for other reasons. For this reason it is not reliable to look at the numbers of users from each jurisdiction and infer if they are lower than normal -- since normal varies widely.
  • [Difficult to study each series separately] Given the above we define "normal" from one day to the next, as following a global trend i.e. changing in a similar fashion with the rest of the network, and other jurisdictions. This means the current censorship detection algorithm will fail of Tor numbers start going globally down at the same time there is a censorship event, of if somehow an attack globally cripples the network.
  • [Modelling inter-domain trends] It turns out there is significant variance between jurisdictions as to whether the trends of users are going up and down from one time to another. Therefore we look at the trends of the 50 jurisdictions with the most users, eliminate outliers, and define a normal distribution fitting the rest of the trends (as a percentage change). This is part of our model for predicting the number of users at any specific jurisdiction: from one time period to the next the numbers should fluctuate within the plausible range of the global trend -- i.e. within the most likely values (probability 0.9999) of this normal distribution.
  • [Small-sample random variation] Some jurisdictions have very few users, and this will create false alarms -- a drop of 50% when you have 5000 users is a significant event, whereas the same drop when you see 10 users might just be due to random fluctuation. Therefore we model the number of users at any point in time as a Poisson distribution with mean the observed number of users. This takes care of the random fluctuation. This is the second key part of our prediction model: instead of taking into account past numbers at face value, we instead consider that the true number of users lies somewhere in the most likely region of the poisson distribution (again the region of 0.9999 probability).
  • [Full Model] Given the Poisson model for numbers of users at a given time, and the model for trends, we can predict the region within which we expect the next value of users to be (with probability about 0.999). In the graphs this is the grey area. When the next observation falls within the prediction, we do not raise an alarm, when it falls outside a prediction we raise an alarm. Green dots are unexpected up-trends, and orange dots are unexpected downtrends (possible censorship).
  • [Results] You can see the graphs for all jurisdictions in which I detected a potential censorship even in the past 6 months. They are lexicographically ordered in terms of number of possible censorship events. The top ones are ("down" are potential censorship events, "up" unexpected rises, and "affected" the number of users on the last day observed in the jurisdiction):

=======================
Report for 2010-09-11 to 2011-03-15
=======================

cn -- down: 19 (up: 28 affected: 728)
mm -- down: 18 (up: 14 affected: 50)
si -- down: 16 (up: 17 affected: 254)
ir -- down: 13 (up: 23 affected: 8168)
ph -- down: 13 (up: 32 affected: 4265)
hr -- down: 13 (up: 15 affected: 284)
eg -- down: 10 (up: 7 affected: 673)
kr -- down: 9 (up: 12 affected: 22434)
pk -- down: 8 (up: 9 affected: 385)
zw -- down: 8 (up: 8 affected: 18)
tw -- down: 7 (up: 7 affected: 2178)
ba -- down: 7 (up: 7 affected: 63)
ly -- down: 7 (up: 11 affected: 10)
cm -- down: 5 (up: 2 affected: 23)
tz -- down: 5 (up: 8 affected: 18)
ga -- down: 5 (up: 3 affected: 4)
rs -- down: 4 (up: 3 affected: 250)
et -- down: 4 (up: 5 affected: 150)
mk -- down: 4 (up: 5 affected: 49)
tn -- down: 3 (up: 4 affected: 517)
lb -- down: 3 (up: 6 affected: 92)
dj -- down: 3 (up: 1 affected: 19)
vc -- down: 3 (up: 3 affected: 6)
fo -- down: 3 (up: 3 affected: 2)
vn -- down: 2 (up: 3 affected: 1549)
sy -- down: 2 (up: 0 affected: 569)
bd -- down: 2 (up: 4 affected: 457)
aw -- down: 2 (up: 2 affected: 14)
zm -- down: 2 (up: 2 affected: 7)
gy -- down: 2 (up: 4 affected: 4)
ls -- down: 2 (up: 1 affected: 2)

  • [Estimation delay window] One parameter of the model is the length of the time periods. In other words: are we trying to model from today's numbers of users what is going on tomorrow, or what is going on next week? The previous days gives nice tight predictions, BUT some jurisdictions show a really freakish weekly pattern -- thus I settled for a 7 day window. This means that the value for today is used to predict the value of the same day a week in the future.
  • [Freakish weekly patterns] Some jurisdictions show a very strange weekly pattern, that even the 7-day window detector sometimes mistakes (?) as an attack. Have a look at the series for "Kr" (South Korea): there is a weekly variation between 5K to 20K users -- high in week days and low in the weekend. This is not typical -- it is the only jurisdictions where such a variation is observed. Do you know why that is? Other jurisdictions with similar pronounced weekly patterns are: tw (Taiwan), et (Ethiopia), id (indonesia). What is going on there?
  • [Blind spots] The detector looks for differences in the numbers of users within a jurisdiction as well as across them, and detects any anomalies. This means that the alerts are raised when there is a change -- if you have been under censorship for ever there is no unexpected drop and no alert is raised. Similarly if for the time window chosen the rate of change falls within the expected range (which can be significant) no alert is raised. A cunning censor (with time in their hands) will lower the numbers slowly enough to evade detection given a short window -- I recommend you run the algorithm with multiple windows to detect such strategies. It is also difficult to tell if tor is blocked or the whole country is offline (see Libya over the past few weeks).
  • [Validation] Needless to say I have no labelled data to validate the results I get. They vaguely "make sense" but of course how do we know if some alerts reported are in fact artefacts of a poor prediction model. (See the jurisdictions with weekly trends for example). In some respects this does not really matter: in practice the algorithm gives at most a handful of reports every day, so it is small enough for a human to "keep an eye" on the reports and make a call about their severity (given input from other sources as well -- like news for example).
  • [Early warnings] Even events that do not follow a trend might give you "early warning" -- Burma for example in April 2010 shown an alert followed by a rise in users, then followed by a massive crash and censorship. Iran (not shown) also gives a couple of alerts more than a year ago, that may have been tests of the system they now use all the time?
  • [Code] All of the above is implemented using an ugly 300-line python script with dependencies on scipy, numpy and matplotlib. I am cleaning it up and will be happy to pass it on once it is stable and pretty.
  • [Model refinement] This is a first, rough-and-ready model that I plan on refining further: (a) automatically select the time-window (b) learn the traffic self-similarity (c) offer a full Bayesian model + a particle filter based sampler for whether an unexpected event is occurring. I would be most happy for any feedback on this initial model -- what is useful, what is useless, do you want more / less sensitivity, do you know of events not detected, other sources of information for prediction, etc.

comment:3 in reply to:  2 Changed 9 years ago by karsten

And here's my reply to George:

Replying to karsten:

  • [Wide Variation] It seems that the Tor network is not very stable generally -- numbers of users fluctuate from one day to another, sometimes with no obvious reason and across jurisdictions. This maybe due to relays coming and going or the capacity of the network going up or down for other reasons. For this reason it is not reliable to look at the numbers of users from each jurisdiction and infer if they are lower than normal -- since normal varies widely.

Right, the data we have has a huge variation. The good news is that we're
going to have a higher fraction of relays reporting usage data in the near
future. I hope that our data quality will become better then.

Speaking of input data to your algorithm. How much does your algorithm
care about absolute numbers, and would it be able to process raw
observations made by relays and/or bridges? These raw observations would
tell you what fraction of requests or unique IP addresses were seen at a
single relay or bridge coming from a given country. For example:

  dirreq-v3-reqs us=2368,de=1680,kr=1048,fr=800,[...]

  bridge-ips sa=48,us=40,de=32,ir=32,[...]

If you want to have a look, I can provide you with CSV-formatted data and
tell you more details.

  • [Difficult to study each series separately] [...]
  • [Modelling inter-domain trends] [...]
  • [Small-sample random variation] [...]
  • [Full Model] [...]

The assumptions made here all sound reasonable to me. I guess we have to
start somewhere and watch if the censorship detector results "make sense"
to us.

  • [Estimation delay window] One parameter of the model is the length of

the time periods. In other words: are we trying to model from today's
numbers of users what is going on tomorrow, or what is going on next
week? The previous days gives nice tight predictions, BUT some
jurisdictions show a really freakish weekly pattern -- thus I settled
for a 7 day window. This means that the value for today is used to
predict the value of the same day a week in the future.

I wonder if we can use a 1-day window and a 7-day window at the same time.
The 7-day window is giving us a few (likely false) alerts that a 1-day
window wouldn't. There's probably some influence from the day before and
some influence from the week before.

  • [Freakish weekly patterns] [...]

I have no idea what's going on. This is something that Roger might have
an answer for.

  • [Blind spots] [...]
  • [Validation] [...]
  • [Early warnings] [...]

The assumptions and conlcusions made here make sense to me, too.

  • [Code] All of the above is implemented using an ugly 300-line python

script with dependencies on scipy, numpy and matplotlib. I am cleaning
it up and will be happy to pass it on once it is stable and pretty.

Yes, please. I'm interested in the code, even if it's dirty Python. If
the code is at least somewhat readable, there's no need to clean it up.
I'm fine running Python code with whatever dependencies are necessary to
get this started. Once we have a good idea what's going on, I might
rewrite the relevant parts in Java, R, and ggplot2 for better integration
with our existing codebase and to facilitate maintenance. But right now,
dirty Python is perfectly fine for a prototype phase.

Do you mind if we put your code in Tor's Git repository for metrics code
here?

https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-2718

  • [Model refinement] This is a first, rough-and-ready model that I plan

on refining further: (a) automatically select the time-window (b) learn
the traffic self-similarity (c) offer a full Bayesian model + a particle
filter based sampler for whether an unexpected event is occurring. I
would be most happy for any feedback on this initial model -- what is
useful, what is useless, do you want more / less sensitivity, do you
know of events not detected, other sources of information for
prediction, etc.

The comments here are my first thoughts. I might have more thoughts when
I see the code.

comment:4 Changed 9 years ago by karsten

And here's another reply from George, also pasted here with his permission:

[Individual relays] Roger mentioned to me that you have data from individual relays that is quantised to the closest power of 2. I have not yet started thinking about using those. We could pretend indeed that they each are a "separate tor" and run the detector on them -- this is still likely to miss instances of censorship if the numbers are very low. It is something we will have to investigate after seeing the data feeds.

[Better model] You are right that we can make a more complex model that combines observations from the past 1, 7 or 28 days -- I am happy to work on that next.

[Code] I attach the python code after the most mild clean up. Right now it has no interface and you have to manually edit the source to get to the variables of interest:

def main():

# Change these to customize script
CSV_FILE = "direct-users.csv"
GRAPH_DIR = "img"
INTERV = 7
DAYS= 6 * 31
[...]

You can also use the functions & classes to define a set of graphs you want with different days, intervals, colors, etc (the code should be happy being loaded as a library). Its under a Clear BSD license (BSD + no-patent license -- since I do not want to compromise my employer's IP unknowingly). Feel free to put it in version control.

[Freakish weekly patterns in South Korea] Please do tell me they are not normal, and there is some explanation about them.

comment:5 in reply to:  2 Changed 9 years ago by karsten

Here's a reply to George's first mail from Robert Ransom, pasted with his permission:

Replying to karsten:

  • [Freakish weekly patterns] Some jurisdictions show a very strange weekly pattern, that even the 7-day window detector sometimes mistakes (?) as an attack. Have a look at the series for "Kr" (South Korea): there is a weekly variation between 5K to 20K users -- high in week days and low in the weekend. This is not typical -- it is the only jurisdictions where such a variation is observed. Do you know why that is? Other jurisdictions with similar pronounced weekly patterns are: tw (Taiwan), et (Ethiopia), id (indonesia). What is going on there?

Perhaps employers and/or schools in KR censor their employees' and/or
students' Internet access heavily, and Tor is very popular there as a
firewall circumvention tool but not as a privacy tool. Or perhaps
employers in KR are paranoid and require their employees to use Tor
when browsing at work, and their employees don't care quite so much
about their own personal privacy when browsing at home on the weekends.

comment:6 in reply to:  4 Changed 9 years ago by karsten

Replying to George Danezis:

[Individual relays] Roger mentioned to me that you have data from individual relays that is quantised to the closest power of 2. I have not yet started thinking about using those. We could pretend indeed that they each are a "separate tor" and run the detector on them -- this is still likely to miss instances of censorship if the numbers are very low. It is something we will have to investigate after seeing the data feeds.

I'm attaching a CSV file of the directory requests that relays see and report. The format is as follows:

  • fingerprint: Hex-formatted SHA-1 hash of identity fingerprint
  • statsend: ISO-formatted time when the stats interval ends
  • seconds: Stats interval length in seconds, typically 24 hours
  • ??: Directory requests that could not be resolved
  • a1: Directory requests from anonymous proxies
  • a2: Directory requests from satellite providers
  • ad: Directory requests from Andorra
  • ae: Directory requests from the United Arab Emirates
  • [...] See ISO 3166-1 alpha-2 country codes
  • zw: Directory requests from Zimbabwe
  • zy: Total directory requests from all countries

The request numbers are rounded up to the next multiple of 8, minus 4. That is, 1 to 8 requests == "4", 9 to 16 requests == "12", 17 to 24 requests == "20", etc.

[Better model] You are right that we can make a more complex model that combines observations from the past 1, 7 or 28 days -- I am happy to work on that next.

Yes, that would be interesting. I don't think that whatever happend 28 days ago can have a significant influence on the prediction, but I don't really know.

[Code] I attach the python code after the most mild clean up. Right now it has no interface and you have to manually edit the source to get to the variables of interest:

I added your code to the metrics-tasks repository here.

Changed 9 years ago by karsten

Attachment: dirreqs.csv.bz2 added

Raw directory requests by country

comment:7 Changed 8 years ago by arma

Component: MetricsAnalysis

comment:8 Changed 8 years ago by karsten

I finally had a closer look at George's code and graphs. This looks like a great start!

I wonder how we can move this forward. My suggestion would be to describe George's approach in 2--3 pages of LaTeX and put daily updated graphs on the metrics website. This would allow people to compare possible censorship events in our results to real-world events and give us some feedback. Here's what this plan involves:

  • Write a tiny tech report describing George's censorship detector. This report would briefly motivate the problem of detecting censorship based on our daily user number estimates and then dive into the math behind detector.py. I could write this report, but it wouldn't be as accurate as if George wrote it. Or I could make a start and George corrects/rewrites the parts that I got wrong. Or George could write it himself. George?
  • Run detector.py on the metrics server. Should be straight-forward to make cron grab the latest direct-users.csv and run the script once per day. I can take care of this.
  • Generate graphs using our own graphing engine. We should use R and our own graphing engine to integrate the results more in our website. I could imagine adding a checkbox "[ ] Show possible censorship events if available (BETA)" below the Source drop-down box on https://metrics.torproject.org/users.html#direct-users. The result would be that the graph doesn't only contain the estimated user number, but also a gray ribbon for the expected range and green/yellow points for upturns and downturns. This would require disabling graph generation in detector.py and writing the expected user range per country and day to a file. I hope my Python will be sufficient to do this.
  • Show a table of recent possible censorship events. We could add a short table of countries with possible censorship events in the past months to the website. This table would go on the same website below the "Update graph" button and have a BETA label, too. There would also be a sentence above the table linking to the report as mentioned above. The table content would be similar to the summary.txt file generated by the Python script. I can write the necessary code to parse this file and put the content on the website.

Is this plan reasonable?

comment:9 in reply to:  8 Changed 8 years ago by karsten

Replying to karsten:

  • Generate graphs using our own graphing engine. We should use R and our own graphing engine to integrate the results more in our website. I could imagine adding a checkbox "[ ] Show possible censorship events if available (BETA)" below the Source drop-down box on https://metrics.torproject.org/users.html#direct-users. The result would be that the graph doesn't only contain the estimated user number, but also a gray ribbon for the expected range and green/yellow points for upturns and downturns. This would require disabling graph generation in detector.py and writing the expected user range per country and day to a file. I hope my Python will be sufficient to do this.

This is now implemented. See, for example, this graph for possible censorship events in Iran visualized as blue and red dots. There are still some false positives, but overall the results are a great start.

comment:10 Changed 8 years ago by karsten

See #3624 for adding a table of possibly censored countries to the metrics website. We should complete that task first and then this one.

comment:11 Changed 8 years ago by karsten

We're now running detector.py every 6 hours on the server, and https://metrics.torproject.org/users.html#censorship-events lists the top-10 countries by possible censorship events.

The next thing to do would be to write the tech report explaining George's algorithm.

comment:12 Changed 8 years ago by karsten

I just added a README that attempts to explain the algorithm to the metrics-tasks repository.

comment:13 Changed 8 years ago by tomb

Cc: tomb@… added
Component: AnalysisMetrics Utilities
Owner: set to tomb
Status: assignedaccepted

I have created child ticket #4180 to use the existing python code to pre-process the data, and to move the statistics part to R.

I am also grabbing ownership of this ticket.

I have changed the Component to Metrics Utilities because I intend this project to grow into a utility maintained over time rather than a one-off.

comment:14 in reply to:  13 Changed 8 years ago by karsten

Component: Metrics UtilitiesAnalysis

Replying to tomb:

I have created child ticket #4180 to use the existing python code to pre-process the data, and to move the statistics part to R.

Makes sense. See my comment there.

I am also grabbing ownership of this ticket.

Great!

I have changed the Component to Metrics Utilities because I intend this project to grow into a utility maintained over time rather than a one-off.

Right, having a tool and maintaining it over time is the ultimate goal. But I think it's too early to write that tool now. This ticket is mainly about comparing different detection algorithms. Once we agree on an algorithm, we should open a new ticket and turn that algorithm into a tool. It's unclear whether that'll be a stand-alone tool like ExoneraTor or VisiTor, which are supposed to be run by users and which are part of the Metrics Utilities component, or a web-based thing, e.g., part of the Metrics Website. For now, I think it's best to leave this ticket in the Analysis component. Changing that back.

comment:15 Changed 8 years ago by karsten

Resolution: implemented
Status: acceptedclosed

We have an approach for detecting country-wide blockings deployed. Former child ticket #4180 is now a parent-less ticket aiming to improve the deployed censorship detector. No need to keep this ticket open. Closing.

Note: See TracTickets for help on using tickets.