Opened 8 months ago

Closed 5 months ago

Last modified 5 months ago

#9529 closed task (implemented)

Replace check.tp.o webservers with Arlo's Go version

Reported by: arma Owned by:
Priority: normal Milestone:
Component: Tor Check Version:
Keywords: Cc: arlolra, dkantola@…, adam@…
Actual Points: Parent ID:
Points:

Description

https://github.com/arlolra/check
is a (not yet completed) rewrite of TorCheck.py but it comes with its own webserver -- which should help both the "the cgi has to parse a bunch of stuff on startup each time" problem and the "oh my god we're running how many apaches" problem (which are related).

To do the replacement though, we need to enumerate all the things that check.tp.o serves right now.

Here's a start:
A) The index page, including variants of it like https://check.torproject.org/?lang=en-US&small=1&uptodate=1 that TBB 2.x use as their homepage. Apparently there's also a /?TorButton=true variant too.

B) https://check.torproject.org/RecommendedTBBVersions (a flat text file that gets fetched by millions of people)

C) https://check.torproject.org/cgi-bin/TorBulkExitList.py which is another cgi that should eventually get rewritten. (Its current design is particularly painful because of the number of dns queries it launches.)

D) https://exitlist.torproject.org/exit-addresses used to be the static file that tordnsel exports. It's currently 403'ed by webserver configuration, since some jerks are scraping it at maximum speed. But I'd like to expose it again someday. I mention exitlist because it's on the same apache as check currently, but there's no reason it necessarily needs to move at the same time.

Currently we run tordnsel on the same machine as check, since the check cgi's do dns queries and it's best to do those over localhost. I'd like to move tordnsel elsewhere though, to better partition things. We can do that only once none of the check processes do these dns queries.

Child Tickets

TicketSummaryOwner
#6590Update Chinese translation for Tor CheckRy
#8866Include Arabic translation for TorCheck.potRy
#10023TBB version warning not working?

Change History (20)

comment:1 follow-up: Changed 8 months ago by phobos

Is this check.tpo (tor or not) functionality still needed when TBB 3.x is released?

As for exit list, it's a volunteer service we've offered to the world for a few years. We could simply shut it down or get someone to sponsor it to keep it alive. HP and some forum software has it baked into their products assuming we'll run it forever for free.

comment:2 in reply to: ↑ 1 Changed 8 months ago by arma

Replying to phobos:

Is this check.tpo (tor or not) functionality still needed when TBB 3.x is released?

Yes, at least until we want to break TBB 2.x.

As for exit list, it's a volunteer service we've offered to the world for a few years. We could simply shut it down or get someone to sponsor it to keep it alive. HP and some forum software has it baked into their products assuming we'll run it forever for free.

The exit list is the thing that check uses, and also the thing that bulk exit list uses. They're all tied together.

bridgedb relies on a bulk exit list output, so it can treat Tor exits specially. So I don't think it can go away entirely.

comment:3 Changed 8 months ago by phobos

exitlist/tordnsel is a separate piece of software which can be run standalone on various servers. Bridgedb probably should run its own copy, rather than rely on a central server.

As for check.tpo website, it shouldn't exist at all. The functionality of it should be moved to the browser, ala TBB 3.x. Over the life of check.tpo, we've replaced perl with python, and apparently are ready to replace python with go. We keep re-writing the same bad architecture in the cool language of the day. And to be clear, the bad architecture is to have the entire tor browser userbase hit a single website to learn "tor or not".

All that being said, the basic questions about this go implementation are around scaling. We seem to sustain 40-70 requests per second throughout the day[1]. We peak at 500 requests per second on really busy times, such as last week when check went down. Can Arlo's code handle this? How much memory is consumed on average? How many cpu cores does it need to handle all of this? Or is the answer to deploy it and find out?

[1] https://munin.torproject.org/torproject.org/sergii.torproject.org/index.html#apache

comment:4 Changed 8 months ago by tup

  • Cc dkantola@… added

comment:5 follow-ups: Changed 8 months ago by mccajm

  • Cc adam@… added

I would be interested in contributing to this if Arlo would like a hand. I'm also able to benchmark if needed.

500 req/s seems like an _easily_ achievable number. Although all software is different, we've benchmarked similar go services up to 65k req/s. That used 15% of a core and a few hundred MB of RAM. I need to read TorCheck.py further to understand what I'm missing, but from speaking to arma the current problems seems to be the wait time: to spawn a cgi process and iowait for things like dns. Is the current Apache compiled with a modified MAX_CLIENTS? If the connections are taking longer than a second, Apache will only be able to handle 255 simultaneous connections.

A few thoughts on the go code:

  • Check seems to serve pages which either occasionally change (like RecommendedTBBVersions and exit-addresses) or a segment changes based on a dns lookup, such as the index page. The total amount of data here is really small. I would try to eliminate disk reads where possible by loading these into a buffer and serving them directly from there. These buffers could be reloaded in response to a SIGHUP.
  • Responses should be gzipped to close connections more quickly.
  • I don't think the mutexes around ExitMap are necessary. The variable is only written to in LoadLists.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 months ago by arlolra

I would be interested in contributing to this if Arlo would like a hand. I'm also able to benchmark if needed.

I'm happy to have the help. Feel free to send pull requests. Some benchmarking seems like a great place to start.

  • Check seems to serve pages which either occasionally change (like RecommendedTBBVersions and exit-addresses) or a segment changes based on a dns lookup, such as the index page. The total amount of data here is really small. I would try to eliminate disk reads where possible by loading these into a buffer and serving them directly from there. These buffers could be reloaded in response to a SIGHUP.

Agreed. There's already a listener to reload the exit list.
https://github.com/arlolra/check/blob/master/check.go#L285-L294

This makes me think we should inline the css file and remove that extra request.

  • Responses should be gzipped to close connections more quickly.

Yup. check2.torproject.org seems to have gzip enabled already.

  • I don't think the mutexes around ExitMap are necessary. The variable is only written to in LoadLists.

Maps in golang aren't thread safe and LoadLists is signalled to run in another channel.

comment:7 in reply to: ↑ 6 Changed 8 months ago by arlolra

  • I don't think the mutexes around ExitMap are necessary. The variable is only written to in LoadLists.

Maps in golang aren't thread safe and LoadLists is signalled to run in another channel.

On second thought, I was just going on the fact that it's not defined what happens when you read and write to them simultaneously.

But this is just a pointer swap. You're right, it can be removed.
https://github.com/arlolra/check/blob/master/check.go#L148-L151

comment:8 in reply to: ↑ 5 Changed 8 months ago by phobos

Replying to mccajm:

500 req/s seems like an _easily_ achievable number. Although all software is different, we've benchmarked similar go services up to 65k req/s. That used 15% of a core and a few hundred MB of RAM. I need to read TorCheck.py further to understand what I'm missing, but from speaking to arma the current problems seems to be the wait time: to spawn a cgi process and iowait for things like dns. Is the current Apache compiled with a modified MAX_CLIENTS? If the connections are taking longer than a second, Apache will only be able to handle 255 simultaneous connections.

We run wsgi in daemon mode, which pre-spawns 50 python processes along with 40 threads per process. There is no cgi load time per request. From looking at the server when it's busy, torcheck.py is the bottleneck, not apache nor tordnsel. Tordnsel typically returns answers in 0.03 seconds. We haven't tried debugging torcheck.py to see what's so slow inside it.

The current apache is stock debian, we don't compile software on and for production machines. Generally, if it's not in debian nor torproject repos, we don't want to run it. Apache on the machine can handle 800 maxclients using the worker mpm model.

comment:9 Changed 8 months ago by arlolra

A and C are implemented as described in #9204 and running on https://check2.torproject.org. B and D can be served from memory as mccajm suggests.

Some preliminary benchmarking https://github.com/arlolra/check/issues/4 shows it to be rather performant. 2k req/s which comfortably meets the requirements.

phobos, I appreciate your concerns. My efforts here mainly in response to arma's cry, https://lists.torproject.org/pipermail/tor-talk/2013-August/029306.html

comment:10 follow-up: Changed 8 months ago by phobos

Sounds good. My concerns are to make sure we're not implementing something worse. Otherwise, let's deploy it and see how it goes.

comment:11 Changed 8 months ago by arlolra

There's a chance to review copy and translations in #9654.

comment:12 Changed 8 months ago by Ry

Regarding loading B & D from memory in go itself, if we're proxying through apache does it make any sense? If they're static/rarely updated files then would it not be better to just serve them straight from apache? The OS should be caching them to memory anyway assuming we've got some spare? As an added benefit the OS knows when the files get updated :)

See http://httpd.apache.org/docs/2.2/caching.html (In-Memory Caching)

comment:13 in reply to: ↑ 10 Changed 7 months ago by arlolra

Replying to phobos:

Sounds good. My concerns are to make sure we're not implementing something worse. Otherwise, let's deploy it and see how it goes.

I think we've reached the point where it is at least no worse,
https://check2.torproject.org/cgi-bin/TorBulkExitList.py?ip=123.123.123.123&port=443

Time to deploy?

comment:14 follow-up: Changed 6 months ago by arma

Sounds great.

So the deployment process is that you change the apache vhosts files on check to use your new one, ask it to reload, and then we wait to see who complains about something being broken?

I say go for it.

(Did you see the list of IPs that it got wrong, that I think Philipp sent? But that is not in principle a reason to delay, since the current one is even wronger.)

comment:15 in reply to: ↑ 14 Changed 6 months ago by arlolra

Replying to arma:

So the deployment process is that you change the apache vhosts files on check to use your new one, ask it to reload, and then we wait to see who complains about something being broken?

No, check2 is currently just running in screen. The deployment process is closer to making the vhosts changes you describe, installing the service as described here,

https://github.com/arlolra/check#setup

and the cron to keep it up-to-date.

(Did you see the list of IPs that it got wrong, that I think Philipp sent? But that is not in principle a reason to delay, since the current one is even wronger.)

I hadn't. Just opened an issue for further investigation,

https://github.com/arlolra/check/issues/21

comment:16 Changed 5 months ago by arlolra

This is live so can probably be closed.

Do we want B) and D) above?

Also, there's a link to munin up there. It'd be great if I could see that or if some sort of nagios alert was sent to me if the site was unreachable.

comment:17 Changed 5 months ago by phobos

Yes, we need B. We should also provide D if easy to do.

Munin is used for historical tracking of server resources. We don't really rely on Icinga/nagios to alert us to issues. There is a public channel on irc.torproject.org which echoes the current alerts.

comment:18 Changed 5 months ago by arlolra

  • Resolution set to implemented
  • Status changed from new to closed

Ok, B) is up again at https://check.torproject.org/RecommendedTBBVersions

As for D), I exposed it at https://check.torproject.org/exit-addresses
Not the same subdomain as before but at least it's there now.

comment:19 Changed 5 months ago by arma

Is this exit-addresses file the same one as tordnsel exports, or is this a combination of what tordnsel says and what check adds to it (e.g. for relays that tordnsel didn't mention)?

comment:20 Changed 5 months ago by arlolra

It's the same one that tordnsel exports, symlinked into the DocumentRoot.

If you want, check can export all the data that it knows about, maybe in a more convenient format, like JSON.

Note: See TracTickets for help on using tickets.