#10809 closed defect (fixed)

reCAPTCHA on bridges.torproject.org are impossible to solve for humans

Reported by: lunar Owned by: isis
Priority: major Milestone:
Component: BridgeDB Version:
Keywords: bridgdb-0.1.5 Cc: sysrqb, isis
Actual Points: Parent ID:
Points:

Description

Several users complained that the CAPTCHA displayed on bridges.torproject.org were very hard to impossible for them to solve. For me it usually takes 10 to 20 tries, when I do not give up. The 'H' in CAPTCHA means “humans”. If humans cannot even solve the CAPTCHA offered by reCAPTCHA, then we have a problem.

Can we consider migrating away from reCAPTCHA? Also even if we proxy the queries to Google, not relying on Google would make me feel better.

Otherwise, the problem needs to be taken to reCAPTCHA admins.

Child Tickets

Attachments (1)

e2gZ9Etz.jpg (59.4 KB) - added by isis 18 months ago.
example gimp-captcha

Download all attachments as: .zip

Change History (18)

comment:1 Changed 19 months ago by sysrqb

I've been considering this, as well. infinity0 pointed us to Asirra [0] (yes, I know, Microsoft) a few days ago, which looks promising, but we can also consider using something that runs locally, or maybe even try both at the same time for a month (alternate between them for each request) and see which one appears to work better. Other suggestions are also welcomed.

[0] https://research.microsoft.com/en-us/um/redmond/projects/asirra/

comment:2 Changed 19 months ago by torland

I always have to magnify the size of the captcha by pressing Ctrl-+ serveral times in order to be able to solve it. Without that it is really hard. Showing the captcha larger would surely help. Is it possible to enlarge it by default?

comment:3 Changed 19 months ago by isis

  • Owner set to isis
  • Status changed from new to accepted

There is also a python API which basically scripts GIMP to create a local cache of CAPTCHAs, made by the SpiderOak developers: https://spideroak.com/code

It's a bit annoying because it's not packaged properly (there's just a tarball and a checksum hash; it's not in PyPI or distro repos), so updates will be annoying, but it was the only decent thing I found when I looked into this a few months ago.

comment:4 Changed 19 months ago by sysrqb

I sent an email to recaptcha support. Let's see what they say.

comment:5 Changed 19 months ago by sysrqb

The reCAPTCHA team should be whitelisting bridges.torproject.org soon. This will hopefully greatly improve the situation.

comment:6 follow-up: Changed 19 months ago by isis

I spent about half an hour last night reviewing and testing the spideroak python/gimp captcha generation script. It works. I tweaked it a bit (to make it harder, actually, since by default it's only 5 letters/numbers). (no commits yet because i only fiddled with it)

Before I go any farther, there are the following open questions about doing local captcha generation:

  1. Can this run on a headless server?
  2. This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?
  3. Is this something we want?

For making this easier for mobile users, there is this SE answer which includes a mobile captcha HTML5 trace system, however it uses HTML5 canvases.

Last edited 19 months ago by isis (previous) (diff)

comment:7 Changed 19 months ago by lunar

Just for the record, the CAPTCHAs still look very difficult to me currently.

comment:8 in reply to: ↑ 6 Changed 19 months ago by massar

..

Before I go any farther, there are the following open questions about doing local captcha generation:

  1. Can this run on a headless server?

Likely yes (in the extreme case just run a X server inside VNC or so); likely the biggest problem is the amount of random that is available to the system, hence a hardware rng should be present on the box.

  1. This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?

If would suggest having one or more machines supply a bunch of captchas in bulk to BridgeDB. Do we have an idea of how many queries are being made at the moment? It should then be easy to provide a daily bunch of captchas to the host.

  1. Is this something we want?

I would say, yes IMHO, as the current captchas are unreadable. Also they depend on an external rather untrustworthy entity (IMHO); though they are unable to see the source of the queries, they can at least see that queries are being made, how many there are etc and possibly correlate them with other events that they have their eyes on (eg if people are silly and use their public DNS system...).

comment:9 Changed 19 months ago by grarpamp

recaptcha(google) doesn't just solve books anymore, they read house numbers for maps, names, and whatever else they need into their monolith. If helping to develop that bothers you, you need to ditch external captchas in favor of internal engines. One such engine...
http://www.phpcaptcha.org/ (SecurImage)
http://www.graphicsmagick.org/

If you get word from google why they seem to serve hard captchas to most Tor IP's and easy to non-tor, post it here.

comment:10 follow-up: Changed 19 months ago by lunar

The CAPTCHA looks just the same as before. Maybe their support team could look what is happening?

comment:11 Changed 19 months ago by lunar

Users keep complaining.

comment:12 Changed 18 months ago by lunar

At least 3 new complaints on the help desk today.

comment:13 in reply to: ↑ 10 Changed 18 months ago by isis

  • Keywords bridgdb-0.1.5 added

Replying to lunar:

The CAPTCHA looks just the same as before. Maybe their support team could look what is happening?

See #10834, which is merged for 0.1.5 (current deployment is 0.1.4).

Also see https://github.com/isislovecruft/gimp-captcha

Last edited 18 months ago by isis (previous) (diff)

comment:14 follow-up: Changed 18 months ago by isis

  • Cc sysrqb isis added
  • Priority changed from normal to major

Okay, my branch for this work is done, and it seems to work well, but I have not yet written unittests for it. I would prefer it also had more documentation on how to generate the CAPTCHAs.

However, the work itself is ready for review while I finish the unittests. I am changing the priority to 'major' because of all the complaints going to the support desk.

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for #11127.

comment:15 in reply to: ↑ 14 ; follow-up: Changed 18 months ago by sysrqb

Replying to isis:

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for #11127.

It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)

I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.

1) (as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a <blah>CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).

2) the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))

Changed 18 months ago by isis

example gimp-captcha

comment:16 in reply to: ↑ 15 Changed 18 months ago by isis

  • Status changed from accepted to needs_revision

Replying to sysrqb:

Replying to isis:

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for #11127.

It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)

I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.

1) (as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a <blah>CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).


I am thinking of making this a separate enhancement ticket, since I think the fine people helping the support desk will have a better quality of life if we first make human-passable Turing tests.

One thing that has just occurred to me is that, if either reCaptcha or the gimp-captchas are considered easier, and we have a probablistic wrapper resource for choosing one or the other, couldn't a user just refresh until they get the easier one? I mean, the webserver isn't stateful between one request and the next. Making it stateful would mean rewriting most of it.

2) the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))


Yeah, I totally agree. There is a TODO comment about it in the commit message for eeb6956ed7f7ddd0f2592c17f4a5d58a580fb878.

comment:17 Changed 18 months ago by isis

  • Resolution set to fixed
  • Status changed from needs_revision to closed

Merged for version 0.1.5 in this commit. See #11215 for follow ticket on timestamps/expiry. Sysrqb and I agreed that this wouldn't be deployed yet, the deployed version, for now, will use the branches from #11127 to #10834 to try to solve the CAPTCHA difficulty problems.

Note: See TracTickets for help on using tickets.