Opened 7 years ago

Closed 13 months ago

#6180 closed task (wontfix)

Detecting censorship in HTML pages

Reported by: hellais Owned by: hellais
Priority: Medium Milestone:
Component: Archived/Ooni Version:
Severity: Normal Keywords: ooni_wishlist, ooni_research, archived-closed-2018-07-04
Cc: ioerror, isis Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The HTTP test of OONI does a TCP connection to the target host and send an HTTP request to obtain a webpage. If a webpage is retrieved this may be that of the censor.
The issue is understanding if such a page is the legitimate response of it's a block page. How do we do this?

The naive way to do so is to make a connection over Tor and check if that matches the one that is made over the live network, this has some problems though, for example if the site is geolocalized it will be different for Tor.

Another simple approach is to have a database of content lengths of websites, but this also will fail if the censored page is very similar to the real web page.

Another approach is to find a smart fuzzy matching algorithm for the Test page.

Other ideas?

Child Tickets

Change History (8)

comment:1 Changed 7 years ago by isis

Status: newneeds_review

If we are packaging Tor, then could we simply tell the OONI copy of Tor to AllowDotExit notation and always direct it at the same reliable exit node for a given portion of the test? Or, specify only those reliable nodes in a given region? This does obviously increase fingerprintability and potentially even an attack surface.

Another workaround to the geolocation problem might be limiting the test to only request non-geolocalized pages, i.e. not google.com or facebook.com or whatever. There are plenty of vanilla static webpages out there, we'd just have to be sure that as OONI grows we don't hammer them, and also not to pick ones that are so obscure that we're essentially publishing to that server which users run OONI tests.

As for your last suggestion, we could use the HTTP fuzzy match algorithm from the captive portal test, which is not yet ported to the new Twisted-based framework, but is in it's old version in the captive portal branch of my OONI repo on github. That algorithm could definitely use some smarts, however.

comment:2 Changed 7 years ago by karsten

Keywords: SponsorH201206 added
Milestone: Sponsor H: June 2012

Switching from using milestones to keywords for sponsor deliverables. See #6365 for details.

comment:3 Changed 7 years ago by hellais

We discussed a lot about this during the OONI hackfest with mct and this is what we talked about:

We can divide the possible approaches for distinguishing a blocked page from the correct result page into two categories: Statistical and Heuristics.

# Statistical approach

  • Bayesian
  • SVM
  • Fuzzy signature matching
  • Other classifiers

What we want to do is run these classifiers on the corpus of block pages and real website.

# Heuristics

  • List of regexp that apply to censorship pages

We also talked about having clients tell the backend what it got as a response and having the backend figure out if such a page should be a block page or the correct result.

comment:4 Changed 7 years ago by isis

We also talked about having clients tell the backend what it got as a response and having the
backend figure out if such a page should be a block page or the correct result.

This is similar to what Bismark does: they have the client test node call back to a server through an ssh tunnel, and login to a restricted shell where it sets up a recovery tunnel and does a mysqldump. There is was also a script to email the person whose router is running the tests if no updates had been made in a while.

Obviously we'd need to deal with several privacy issues, but if we wind up being allowed to run HSs on Mlab nodes, then we could possibly have the HTTP comparison done through that.

I have done a bit of research into support vector machines and of course have studied bayesian inference, but I'm not a machine learning expert. I do know from the experience of spending two years training a lexigraphical fully-recurrent backpropagating neural network that training is about as much fun as punching yourself in the face. And, though I have not worked with them, and it is also a fast-progressing field, I believe that SVMs have trouble with fitting when the training and data sets are large because the radius function thing (I forget what that function is called) doesn't center on the data point correctly. There is also another thing which is much much simpler and easier to train, called a Relevant Vector Machine, which is basically just the covariance between the training and experimental sets, applied against a Gaussian distribution over a multidimensional space which represents "the test field", which is where defining the test field in an optimized fashion leads to the kernel trick.

I do not know. I think if there exists a feasible machine learning algorithm for computing if a page is changed (if that even happens), or giving us a regex set for what the blocks are, then the censors would use it to find the pages.

That said, I looked into libraries for hacking on this. There is a thing called OrangePy which looks pretty good, and I've played with PyBrain before and it was too bad.

comment:5 Changed 7 years ago by hellais

Keywords: ooni_wishlist ooni_research added; SponsorH201206 removed

comment:6 Changed 7 years ago by hellais

Summary: Detecting censorship in HTTP pagesDetecting censorship in HTML pages

comment:7 Changed 20 months ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

comment:8 Changed 13 months ago by teor

Keywords: archived-closed-2018-07-04 added
Resolution: wontfix
Status: needs_reviewclosed

Close all tickets in archived components

Note: See TracTickets for help on using tickets.