Opened 4 months ago

Last modified 37 hours ago

#33010 new task

Monitor cloudflare captcha rate: do a periodic onionperf-like query to a cloudflare-hosted static site

Reported by: arma Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Ideas Version:
Severity: Normal Keywords: network-health gsoc-ideas
Cc: gk, pili, haxxpop Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by arma)

We should track the rate that cloudflare gives captchas to Tor users over time.

My suggested way of doing that tracking is to sign up a very simple static webpage to be fronted by cloudflare, and then fetch it via Tor over time, and record and graph the rates of getting a captcha vs getting the real page.

The reason for the "simple static page" is to make it really easy to distinguish whether we're getting hit with a captcha. The "distinguishing one dynamic web page from another" challenge makes exitmap tricky in the general case, but we can remove that variable here.

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

Once we get the infrastructure set up, we would be smart to run a second one which is just wget or curl or lynx or something, i.e. which doesn't behave like Tor Browser, in order to be able to track the difference between how Cloudflare responds to Tor Browser vs other browsers.

I imagine that Cloudflare should be internally tracking how they're handling Tor requests, but having a public tracker (a) gives the data to everybody, and (b) helps Cloudflare have a second opinion in case their internal data diverges from the public version.

The Berkeley ICSI group did research that included this sort of check:
https://www.freehaven.net/anonbib/#differential-ndss2016
https://www.freehaven.net/anonbib/#exit-blocking2017
but what I have in mind here is essentially a simpler subset of this research, skipping the complicated part of "how do you tell what kind of response you got" and with an emphasis on automation and consistency.

There are two interesting metrics to track over time: one is the fraction of exit relays that are getting hit with captchas, and the other is the chance that a Tor client, choosing an exit relay in the normal weighted fashion, will get hit by a captcha.

Then there are other interesting patterns to look for, e.g. "are certain IP addresses punished consistently and others never punished, or is whether you get a captcha much more probabilistic and transient?" And does that pattern change over time?

Child Tickets

TicketStatusOwnerSummaryComponent
#34287newmetrics-teamMake the dataset accessible and downloadableMetrics/Ideas
#34288newmetrics-teamIntegrate more web browsers/fetchers and their older versionsMetrics/Ideas
#34289newmetrics-teamIntegrate Cloudflare API not to change Cloudflare settings manuallyMetrics/Ideas
#34290newmetrics-teamEnhance the available visualizations on the dashboardMetrics/Ideas
#34291newmetrics-teamCreate an API for enabling 3rd party interactions with the systemMetrics/Ideas
#34294newmetrics-teamIntegrate ​Tor StemMetrics/Ideas
#34297newmetrics-teamExplore different options other than crontab to have a flexible scheduling systemMetrics/Ideas

Attachments (2)

captchawtf.png (2.4 MB) - added by cypherpunks 6 weeks ago.
https://captcha.wtf/
firefoxhttpproxy.jpg (82.9 KB) - added by cypherpunks 2 weeks ago.

Change History (32)

comment:1 Changed 4 months ago by arma

Cc: pili added

This project would be a great gsoc idea.

comment:2 Changed 4 months ago by cypherpunks

Please consider setting up both, IPv4 - & IPv6 only domain. 1x A record only & 1x AAAA record only. To test them individually. As the exiting IP will be punished differently like another's IP but while it's the same exit, only different protocol.

Also important can be the first seen date of a fingerprint. To group out if only "fresh' exit IPs can reaches it's destinations for a short period of time until they are burned with endless troll captcha.

We may have a nice static site already ready for mirror, Tor blog?

This will hopefully help to proof all the frustration and headache that cloudflaw is throughing against all Tor users on daily basis.

For every UA not a browser, I guess >90 fail rates.
It's UA discrimination from my personal experience.

Last edited 2 weeks ago by qbi (previous) (diff)

comment:3 Changed 4 months ago by arma

Cc: haxxpop added

Cc'ing haxxpop too, so he can follow along. In an ideal world, Cloudflare would collaborate on making this external monitoring tool be useful for everybody involved. Maybe they even want to put an intern on it this summer. :)

comment:4 Changed 4 months ago by haxxpop

It's quite important to make the request headers look like the ones from Tor Browser as much as possible because sometimes we consider traffic from Tor network with not-Tor-Browser-look-like headers as malicious.

comment:5 Changed 4 months ago by cypherpunks

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

Tor Browser does not upgrade immediately, so that's not too much reason to use real web browser. However, Cloudflare is doing fingerprinting of TLS handshake Client Hello (cipher suites and group in tls 1.3) to tell real Tor Browser from spoofed. Must build curl from NSS and set correct headers and cipher suites to pass.

Cloudflare have also different levels of protection, and some grandfathered protection levels have no Tor Browser whitelisting. Should test them all.

Last edited 2 weeks ago by qbi (previous) (diff)

comment:6 Changed 4 months ago by cypherpunks

There is also case where only subresource requests trigger captcha, but is not displayed to user. This make sites break and no way for resolve because user cannot see captcha!

Example site ​https://kiwiirc.com/nextclient/

Open network panel in dev tools and visit link. You will see that javascript resources are 403 forbidden and require captcha, but this not displayed to user. If you open 403 urls in url bar, is working without problem. Difference is Accept header.

Only one thing worse than reCRAPtcha is invisible reCRAPtcha. At least visible captcha I can solve and access site. Invisible captcha is just access denied without telling you.

What is point of captcha if cannot be seen, cloudflare?

Last edited 2 weeks ago by qbi (previous) (diff)

comment:8 Changed 4 months ago by karsten

Component: Metrics/Exit ScannerMetrics/Ideas

This better fits into our Ideas subcomponent.

comment:9 Changed 4 months ago by gaba

Keywords: gsoc-ideas added

comment:10 Changed 4 months ago by cypherpunks

If tor project can provide a list of sites which blocking Tor that would be useful.

example link. Green checkmark: Tor passed, Red: Tor browser simulation denied. Also this link.

Last edited 2 weeks ago by qbi (previous) (diff)

comment:11 in reply to:  2 Changed 4 months ago by cypherpunks

Replying to cypherpunks:

It's UA discrimination from my personal experience.

This is true. See "​Browser vendor discrimination".
(it's not secure than torbrowser but there are people who use Chromium/Firefox over Tor daemon)

Last edited 2 weeks ago by qbi (previous) (diff)

comment:12 in reply to:  description Changed 3 months ago by gk

Replying to arma:

[snip]

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

The alt-svc is not kicking in with the first load. So, if we use a really simple static page (that is with nothing dynamic and no sub resources being requested subsequently) we should not hit that complicating factor.

That said using Tor Browser for the case where we actually want to find out the Tor Browser experience seems like a thing we should investigate, and be it alone for the reason mentioned in comment:4. There is tor-browser-selenium and various forks that should do the trick in combination with stem.

comment:13 Changed 3 months ago by arma

Description: modified (diff)

(fix typo)

comment:14 Changed 3 months ago by woswos

WARNING: I changed the way these two domains are registered on Cloudflare. All pages and subdomains still exist in the way explained here. That being said, now captcha.wtf has only IPv4 entries and exit11.online has IPv6 entries as suggested by everyone. Previously, I had trouble with getting an IPv6 address to my server.

Please take a look at this wiki page for the most up to date information.

Rest of the original post:

I wanted to conduct a few simple experiments on this issue. I will start by explaining my setup and continue with the experiments themselves.

Domain Setup
I registered two domains (captcha.wtf and exit11.online) with IPv4 records on Cloudflare. After playing with Cloudflare settings, I understood that domain owners have an important role in the way Cloudflare blocks Tor users.

A new free Cloudflare account comes with a default security level (like the security levels in the Tor browser and as comment:5 mentioned), and the default security level doesn't explicitly block Tor users. I am not saying Cloudflare is innocent, but they don't mention a possible Tor user blocking at this security level. However, Tor shows up as a country on the Cloudflare firewall settings, and it is possible to block Tor users based on this firewall rule. I think they have a list of Tor exit node IPs, and they use this list to perform the filtering. They "offer" JS and Captcha challenges in addition to simple blocking, as shown in the image below:

https://bottomless-pit.barkin.io/tor-firewall-rules.png

I think that's why some Tor users face more captcha challenges at higher Tor browser security levels. JavaScipt is blocked at higher security levels, and they can't pass the Cloudflare JS challenges.

Also, if a firewall rule related to Tor is set, Cloudflare applies that rule (for example, the never-ending captcha challenge) all the time even if the user has somehow managed to pass the challenge 5 seconds ago - I think that is the part all of us hate, it just creates an endless loop. A sample Cloudflare firewall record below shows that the same IP address is continuously challenged over and over again, even after successfully passing the captcha challenge.

https://bottomless-pit.barkin.io/tor-firewall-1.png

exit11.online has the default Cloudflare configuration without any additional firewall or protection. I am guessing that this would be the case with most of the average Cloudflare users. I also registered the bypass.exit11.online subdomain, which bypasses the Cloudflare proxy and only utilizes Cloudflare as a DNS hosting service and CDN.

https://bottomless-pit.barkin.io/tor-cloudflare-exit11.png

captcha.wtf has the default Cloudflare configuration with the additional firewall configuration for blocking Tor users, as I have mentioned previously. I registered this second domain to see the difference between using the default Cloudflare settings and adding additional firewall rules. I also registered the bypass.captcha.wtf subdomain, which bypasses the Cloudflare proxy and only utilizes Cloudflare as a DNS hosting service and CDN.

https://bottomless-pit.barkin.io/tor-cloudflare-wtf.png

https://bottomless-pit.barkin.io/tor-cloudflare-wtf-firewall.png

Both of these domains have a very simple static "Hello world!" page at /index.html, and there is a more complicated page at /complex.html that loads resources from different locations. Additionally, captcha.wtf & exit11.online have SSL certificates issued by Cloudflare and bypass.captcha.wtf & bypass.exit11.online have SSL certificates issued by Let's Encrypt. I thought that these might have an effect on the way Cloudflare behaves.

Experimenting
Later, I used the Python script mentioned in comment:7 (it uses httplib) and the tor-browser-selenium mentioned in comment:12 to conduct a few simple experiments. I wrote another script to fetch different domain combinations via tor-browser-selenium and Python's httplib. For example, fetching bypass.exit11.online, exit11.online, exit11.online/complex.html, and bypass.exit11.online/complex.html via both tor-browser-selenium and Python's httplib.

Results
After fetching each combination about 100 times at one-minute intervals, the domain with the default configuration (exit11.online) was not blocked a single time via both Tor and httplib. However, the domain with additional firewall configuration (captcha.wtf) was blocked every single time when fetched via Tor. Of course, both of the bypass subdomains were fine since Cloudflare proxy was disabled, but I wanted to test it anyway.

Possible Conclusions
I'm sure my simple tests are not enough at all to draw a meaningful conclusion, but these results make me question the role of domain owners in this endless captcha problem. The domain with default Cloudflare configurations didn't block Tor users, but the domain with extra firewall configuration set by the domain owner banned Tor users all the time. However, again, this is an observation based on my very limited experiments.

I want to conduct more advanced experiments based on your feedback to address the metrics mentioned in the original ticket and find possible patterns in the recorded data.

Please feel free to use both of these domains for further testing.

Last edited 37 hours ago by woswos (previous) (diff)

comment:15 Changed 2 months ago by gk

Some ideas worth keeping in mind, which irl brought up the other day:

Is there a ipv4/ipv6 difference?
Does it matter which day of the week/time of the day sites are getting visited?
Does size of the exit relay play a role (larger might carry "more" abusive traffic)?
If we check Tor Browser we should have a Firefox control group (maybe with FPI and RFP on)/other tool using just tor (curl/Firefox).

Last edited 2 months ago by gk (previous) (diff)

comment:16 Changed 2 months ago by woswos

I did additions to the repository I mentioned in comment:14 and I deployed the code to a cloud server, specifically the automated_fetcher_influxdb example.

Now, the server is fetching captcha.wtf & exit11.online pages and their combinations with & without the Tor browser at 15 minutes intervals. The full list of URLs tested is here. Later, the results are sent to an InfluxDB database.

I created a public Grafana dashboard at dashboard.captcha.wtf to quickly visualize the collected data. You can visit the dashboard to see the data collected so far. I will add more panels and analysis to the dashboard as I implement more metrics to track.

Note: captcha.wtf & exit11.online websites and the automated_fetcher_influxdb code are not hosted on the same server. They all have different IP addresses if anyone is wondering.

Last edited 2 months ago by woswos (previous) (diff)

comment:17 Changed 2 months ago by woswos

I wanted to share this lovely(!) patent, just in case anyone missed it:

Blocking via an unsolvable CAPTCHA
https://patents.google.com/patent/US9407661

comment:18 in reply to:  17 ; Changed 8 weeks ago by cypherpunks

Replying to woswos:

Blocking via an unsolvable CAPTCHA
https://patents.google.com/patent/US9407661

yes, they own a so called Troll Captcha patent and recaptha effectively presents you this type of unsolvable captcha. or by connecting through exit node, just the Message of "generate an unsolvable challenge-response test based on identifying the request as being associated with the malicious activity."

While "associated with the malicious activity" is already an high amount of requests that any node is processing.

But did you notice cloudflare seems to have changed captcha provider from recaptcha to ?

Last edited 2 weeks ago by qbi (previous) (diff)

comment:19 in reply to:  18 Changed 8 weeks ago by cypherpunks

Replying to cypherpunks:

Replying to woswos:

But did you notice cloudflare seems to have changed captcha provider from recaptcha to ?

Yes, to hcaptcha.com.

Here you can see a cloudflared website, that does deliver all of the time a captcha to user and of course this have changed from recaptcha to hcaptcha too so you can see the difference directly as example site look at:

https://captcha.website/

This means, you should even expect more captchas delivered to users. Because now it is a busyness model, (get webmaster to use "free" cloudflare service and present users money rewarded captchas) with every captcha presented :

runs on the Ethereum blockchain. Websites earn Human Tokens (HMT)
whenever users use the hCaptcha widget on their site,
and machine learning companies pay Human Tokens to get their data labeled.

The Value in Data Labeling
When you use hCaptcha, companies bid on the work your users do as they prove their humanity.
You get the rewards.

source: hcaptcha.com

Last edited 2 weeks ago by qbi (previous) (diff)

comment:20 Changed 6 weeks ago by woswos

Wow, I realized the hcaptcha update, but I didn't know Cloudflare was making money out of this. Based on the blog post, it seems like Cloudflare still utilizes the same decision mechanism to present CAPTCHAs. I wonder how that mechanism will change with the hcaptcha update.

I started collecting data before the hcaptcha update. Let me try to add a new panel to the dashboard to see the presented CAPTCHA rate changes over time.

Changed 6 weeks ago by cypherpunks

Attachment: captchawtf.png added

comment:21 Changed 5 weeks ago by cypherpunks

found a similar repo for this checks on ​https://github.com/shawa/cfcheck

Attempted Cloudflare CAPTCHA detection on a given site across sample of Tor exits

Last edited 2 weeks ago by qbi (previous) (diff)

comment:22 Changed 2 weeks ago by cypherpunks

Can you register 2 other domains with your other identity (e.g your friend) and test them too? (And not disclose the domain name in public - to defeat the chance that Cloudflare can whitelist it)

Last edited 2 weeks ago by qbi (previous) (diff)

comment:23 Changed 2 weeks ago by cypherpunks

The domain with default Cloudflare configurations didn't block Tor users

You must be joking.
have you tried 'Firefox 7X + HTTP Proxy(like Privoxy* for example) + Tor(as SOCKS proxy for HTTP Proxy)'?

  • scenario example: The user use Adblock DNS

forward-socks4 / 127.0.0.1:9050 .


"Browser Vendor" difference
https://trac.torproject.org/projects/tor/wiki/doc/tgcw_cloudflare_ethics

Last edited 2 weeks ago by qbi (previous) (diff)

comment:24 Changed 2 weeks ago by cypherpunks

And Chrome + Tor. Captcha party.

Last edited 2 weeks ago by qbi (previous) (diff)

comment:25 Changed 2 weeks ago by cypherpunks

    desc = """Check if a web site returns a CloudFlare CAPTCHA using tor
    browser. By default, this tool is looking for the
    'Attention Required! | Cloudflare' text within the fetched web site.
    """
$ getweb --tor https://example.com/hello.php
URL opened
Got response:
Start::
hello
end::
URL closed
$ 

Based on your browser's language, Cloudflare return translated string to client.
You better look for '| Cloudflare', or 'Cloudflare'(best).

Last edited 2 weeks ago by qbi (previous) (diff)

Changed 2 weeks ago by cypherpunks

Attachment: firefoxhttpproxy.jpg added

comment:26 Changed 2 weeks ago by cypherpunks

https://trac.torproject.org/projects/tor/raw-attachment/ticket/33010/firefoxhttpproxy.jpg

Here is a image of captchawtf rejected me.

(for auto-redirect I use BCMA addon)

Last edited 2 weeks ago by qbi (previous) (diff)

comment:27 Changed 2 weeks ago by cypherpunks

Expected long-term impact

    Helping Tor users browse the internet without sacrificing privacy and getting discriminated 

Do you know that CloudFlare is now tracking all users, not just Tor users by js_chl_bypass GET paramater?

Last edited 2 weeks ago by qbi (previous) (diff)

comment:28 Changed 2 weeks ago by cypherpunks

CAPTCHA_Monitoring_Project_Diagram.png

Incorrect:
"Websites that use Cloudflare as CDN only"

Correct:
"Websites that does not route though Cloudflare"

OR

"Websites that use Cloudflare as DNS only"

For bypass.*, user's browser go directly to your WWW server.

captcha.wtf. 299 IN A 104.18.32.206 [CF]
captcha.wtf. 299 IN A 104.18.33.206 [CF]

exit11.online. 299 IN A 104.27.148.85 [CF]
exit11.online. 299 IN A 104.27.149.85 [CF]

bypass.captcha.wtf. 299 IN A 216.244.91.100 [US Wowrack]

bypass.exit11.online. 299 IN A 216.244.91.100 [US Wowrack]

Last edited 2 weeks ago by qbi (previous) (diff)

comment:29 Changed 2 weeks ago by ϲypherpunks

Hi, i have read your wiki page entry @wosmos

i have seen under the doc/CAPTCHAMonitor 1. point, you counted my comment into.
Topic was differences of DualStack. But at the moment, after the change from recaptcha to hcaptcha. this might change. Because of time of writing this, hcaptcha does not support IPv6 at all, but recaptcha did. While you can visit IPv6 Website, the captcha page is including IPv4 hcaptcha.

comment:30 in reply to:  28 Changed 4 days ago by woswos

Replying to cypherpunks:

Incorrect:
"Websites that use Cloudflare as CDN only"

Correct:
"Websites that does not route though Cloudflare"
...

Fixed it, thank you for the feedback.

Note: See TracTickets for help on using tickets.