The HTTP test of OONI does a TCP connection to the target host and send an HTTP request to obtain a webpage. If a webpage is retrieved this may be that of the censor.
The issue is understanding if such a page is the legitimate response of it's a block page. How do we do this?
The naive way to do so is to make a connection over Tor and check if that matches the one that is made over the live network, this has some problems though, for example if the site is geolocalized it will be different for Tor.
Another simple approach is to have a database of content lengths of websites, but this also will fail if the censored page is very similar to the real web page.
Another approach is to find a smart fuzzy matching algorithm for the Test page.
Other ideas?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
If we are packaging Tor, then could we simply tell the OONI copy of Tor to AllowDotExit notation and always direct it at the same reliable exit node for a given portion of the test? Or, specify only those reliable nodes in a given region? This does obviously increase fingerprintability and potentially even an attack surface.
Another workaround to the geolocation problem might be limiting the test to only request non-geolocalized pages, i.e. not google.com or facebook.com or whatever. There are plenty of vanilla static webpages out there, we'd just have to be sure that as OONI grows we don't hammer them, and also not to pick ones that are so obscure that we're essentially publishing to that server which users run OONI tests.
As for your last suggestion, we could use the HTTP fuzzy match algorithm from the captive portal test, which is not yet ported to the new Twisted-based framework, but is in it's old version in the captive portal branch of my OONI repo on github. That algorithm could definitely use some smarts, however.
We discussed a lot about this during the OONI hackfest with mct and this is what we talked about:
We can divide the possible approaches for distinguishing a blocked page from the correct result page into two categories: Statistical and Heuristics.
Statistical approach
Bayesian
SVM
Fuzzy signature matching
Other classifiers
What we want to do is run these classifiers on the corpus of block pages and real website.
Heuristics
List of regexp that apply to censorship pages
We also talked about having clients tell the backend what it got as a response and having the backend figure out if such a page should be a block page or the correct result.
We also talked about having clients tell the backend what it got as a response and having the
backend figure out if such a page should be a block page or the correct result.
This is similar to what Bismark does: they have the client test node call back to a server through an ssh tunnel, and login to a restricted shell where it sets up a recovery tunnel and does a mysqldump. There is was also a script to email the person whose router is running the tests if no updates had been made in a while.
Obviously we'd need to deal with several privacy issues, but if we wind up being allowed to run HSs on Mlab nodes, then we could possibly have the HTTP comparison done through that.
I have done a bit of research into support vector machines and of course have studied bayesian inference, but I'm not a machine learning expert. I do know from the experience of spending two years training a lexigraphical fully-recurrent backpropagating neural network that training is about as much fun as punching yourself in the face. And, though I have not worked with them, and it is also a fast-progressing field, I believe that SVMs have trouble with fitting when the training and data sets are large because the radius function thing (I forget what that function is called) doesn't center on the data point correctly. There is also another thing which is much much simpler and easier to train, called a Relevant Vector Machine, which is basically just the covariance between the training and experimental sets, applied against a Gaussian distribution over a multidimensional space which represents "the test field", which is where defining the test field in an optimized fashion leads to the kernel trick.
I do not know. I think if there exists a feasible machine learning algorithm for computing if a page is changed (if that even happens), or giving us a regex set for what the blocks are, then the censors would use it to find the pages.
That said, I looked into libraries for hacking on this. There is a thing called OrangePy which looks pretty good, and I've played with PyBrain before and it was too bad.