Detecting censorship in HTML pages

added archived-closed-2018-07-04 component::archived/ooni ooni_research ooni_wishlist owner::hellais priority::medium resolution::wontfix severity::normal status::closed type::task labels

If we are packaging Tor, then could we simply tell the OONI copy of Tor to AllowDotExit notation and always direct it at the same reliable exit node for a given portion of the test? Or, specify only those reliable nodes in a given region? This does obviously increase fingerprintability and potentially even an attack surface.

Another workaround to the geolocation problem might be limiting the test to only request non-geolocalized pages, i.e. not google.com or facebook.com or whatever. There are plenty of vanilla static webpages out there, we'd just have to be sure that as OONI grows we don't hammer them, and also not to pick ones that are so obscure that we're essentially publishing to that server which users run OONI tests.

As for your last suggestion, we could use the HTTP fuzzy match algorithm from the captive portal test, which is not yet ported to the new Twisted-based framework, but is in it's old version in the captive portal branch of my OONI repo on github. That algorithm could definitely use some smarts, however.

Trac:
Status: new to needs_review

Switching from using milestones to keywords for sponsor deliverables. See #6365 (moved) for details.

Trac:
Keywords: N/A deleted, SponsorH201206 added
Milestone: Sponsor H: June 2012 to N/A

We discussed a lot about this during the OONI hackfest with mct and this is what we talked about:

We can divide the possible approaches for distinguishing a blocked page from the correct result page into two categories: Statistical and Heuristics.

Statistical approach

Bayesian
SVM
Fuzzy signature matching
Other classifiers

What we want to do is run these classifiers on the corpus of block pages and real website.

Heuristics

List of regexp that apply to censorship pages

We also talked about having clients tell the backend what it got as a response and having the backend figure out if such a page should be a block page or the correct result.

We also talked about having clients tell the backend what it got as a response and having the backend figure out if such a page should be a block page or the correct result.

This is similar to what Bismark does: they have the client test node call back to a server through an ssh tunnel, and login to a restricted shell where it sets up a recovery tunnel and does a mysqldump. There is was also a script to email the person whose router is running the tests if no updates had been made in a while.

Obviously we'd need to deal with several privacy issues, but if we wind up being allowed to run HSs on Mlab nodes, then we could possibly have the HTTP comparison done through that.

I have done a bit of research into support vector machines and of course have studied bayesian inference, but I'm not a machine learning expert. I do know from the experience of spending two years training a lexigraphical fully-recurrent backpropagating neural network that training is about as much fun as punching yourself in the face. And, though I have not worked with them, and it is also a fast-progressing field, I believe that SVMs have trouble with fitting when the training and data sets are large because the radius function thing (I forget what that function is called) doesn't center on the data point correctly. There is also another thing which is much much simpler and easier to train, called a Relevant Vector Machine, which is basically just the covariance between the training and experimental sets, applied against a Gaussian distribution over a multidimensional space which represents "the test field", which is where defining the test field in an optimized fashion leads to the kernel trick.

I do not know. I think if there exists a feasible machine learning algorithm for computing if a page is changed (if that even happens), or giving us a regex set for what the blocks are, then the censors would use it to find the pages.

That said, I looked into libraries for hacking on this. There is a thing called OrangePy which looks pretty good, and I've played with PyBrain before and it was too bad.

Trac:
Keywords: SponsorH201206 deleted, ooni_wishlist, ooni_research added

Trac:
Summary: Detecting censorship in HTTP pages to Detecting censorship in HTML pages

Set all open tickets without a severity to "Normal"

Trac:
Severity: N/A to Normal

Close all tickets in archived components

Trac:
Status: needs_review to closed
Keywords: N/A deleted, archived-closed-2018-07-04 added
Resolution: N/A to wontfix

closed

Detecting censorship in HTML pages

Child items ...

Activity

Statistical approach

Heuristics