https://github.com/arlolra/check
is a (not yet completed) rewrite of TorCheck.py but it comes with its own webserver -- which should help both the "the cgi has to parse a bunch of stuff on startup each time" problem and the "oh my god we're running how many apaches" problem (which are related).
To do the replacement though, we need to enumerate all the things that check.tp.o serves right now.
D) https://exitlist.torproject.org/exit-addresses used to be the static file that tordnsel exports. It's currently 403'ed by webserver configuration, since some jerks are scraping it at maximum speed. But I'd like to expose it again someday. I mention exitlist because it's on the same apache as check currently, but there's no reason it necessarily needs to move at the same time.
Currently we run tordnsel on the same machine as check, since the check cgi's do dns queries and it's best to do those over localhost. I'd like to move tordnsel elsewhere though, to better partition things. We can do that only once none of the check processes do these dns queries.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Is this check.tpo (tor or not) functionality still needed when TBB 3.x is released?
As for exit list, it's a volunteer service we've offered to the world for a few years. We could simply shut it down or get someone to sponsor it to keep it alive. HP and some forum software has it baked into their products assuming we'll run it forever for free.
Is this check.tpo (tor or not) functionality still needed when TBB 3.x is released?
Yes, at least until we want to break TBB 2.x.
As for exit list, it's a volunteer service we've offered to the world for a few years. We could simply shut it down or get someone to sponsor it to keep it alive. HP and some forum software has it baked into their products assuming we'll run it forever for free.
The exit list is the thing that check uses, and also the thing that bulk exit list uses. They're all tied together.
bridgedb relies on a bulk exit list output, so it can treat Tor exits specially. So I don't think it can go away entirely.
exitlist/tordnsel is a separate piece of software which can be run standalone on various servers. Bridgedb probably should run its own copy, rather than rely on a central server.
As for check.tpo website, it shouldn't exist at all. The functionality of it should be moved to the browser, ala TBB 3.x. Over the life of check.tpo, we've replaced perl with python, and apparently are ready to replace python with go. We keep re-writing the same bad architecture in the cool language of the day. And to be clear, the bad architecture is to have the entire tor browser userbase hit a single website to learn "tor or not".
All that being said, the basic questions about this go implementation are around scaling. We seem to sustain 40-70 requests per second throughout the day[1]. We peak at 500 requests per second on really busy times, such as last week when check went down. Can Arlo's code handle this? How much memory is consumed on average? How many cpu cores does it need to handle all of this? Or is the answer to deploy it and find out?
I would be interested in contributing to this if Arlo would like a hand. I'm also able to benchmark if needed.
500 req/s seems like an easily achievable number. Although all software is different, we've benchmarked similar go services up to 65k req/s. That used 15% of a core and a few hundred MB of RAM. I need to read TorCheck.py further to understand what I'm missing, but from speaking to arma the current problems seems to be the wait time: to spawn a cgi process and iowait for things like dns. Is the current Apache compiled with a modified MAX_CLIENTS? If the connections are taking longer than a second, Apache will only be able to handle 255 simultaneous connections.
A few thoughts on the go code:
Check seems to serve pages which either occasionally change (like RecommendedTBBVersions and exit-addresses) or a segment changes based on a dns lookup, such as the index page. The total amount of data here is really small. I would try to eliminate disk reads where possible by loading these into a buffer and serving them directly from there. These buffers could be reloaded in response to a SIGHUP.
Responses should be gzipped to close connections more quickly.
I don't think the mutexes around ExitMap are necessary. The variable is only written to in LoadLists.
I would be interested in contributing to this if Arlo would like a hand. I'm also able to benchmark if needed.
I'm happy to have the help. Feel free to send pull requests. Some benchmarking seems like a great place to start.
Check seems to serve pages which either occasionally change (like RecommendedTBBVersions and exit-addresses) or a segment changes based on a dns lookup, such as the index page. The total amount of data here is really small. I would try to eliminate disk reads where possible by loading these into a buffer and serving them directly from there. These buffers could be reloaded in response to a SIGHUP.
500 req/s seems like an easily achievable number. Although all software is different, we've benchmarked similar go services up to 65k req/s. That used 15% of a core and a few hundred MB of RAM. I need to read TorCheck.py further to understand what I'm missing, but from speaking to arma the current problems seems to be the wait time: to spawn a cgi process and iowait for things like dns. Is the current Apache compiled with a modified MAX_CLIENTS? If the connections are taking longer than a second, Apache will only be able to handle 255 simultaneous connections.
We run wsgi in daemon mode, which pre-spawns 50 python processes along with 40 threads per process. There is no cgi load time per request. From looking at the server when it's busy, torcheck.py is the bottleneck, not apache nor tordnsel. Tordnsel typically returns answers in 0.03 seconds. We haven't tried debugging torcheck.py to see what's so slow inside it.
The current apache is stock debian, we don't compile software on and for production machines. Generally, if it's not in debian nor torproject repos, we don't want to run it. Apache on the machine can handle 800 maxclients using the worker mpm model.
Regarding loading B & D from memory in go itself, if we're proxying through apache does it make any sense? If they're static/rarely updated files then would it not be better to just serve them straight from apache? The OS should be caching them to memory anyway assuming we've got some spare? As an added benefit the OS knows when the files get updated :)
So the deployment process is that you change the apache vhosts files on check to use your new one, ask it to reload, and then we wait to see who complains about something being broken?
I say go for it.
(Did you see the list of IPs that it got wrong, that I think Philipp sent? But that is not in principle a reason to delay, since the current one is even wronger.)
So the deployment process is that you change the apache vhosts files on check to use your new one, ask it to reload, and then we wait to see who complains about something being broken?
No, check2 is currently just running in screen. The deployment process is closer to making the vhosts changes you describe, installing the service as described here,
(Did you see the list of IPs that it got wrong, that I think Philipp sent? But that is not in principle a reason to delay, since the current one is even wronger.)
I hadn't. Just opened an issue for further investigation,
Yes, we need B. We should also provide D if easy to do.
Munin is used for historical tracking of server resources. We don't really rely on Icinga/nagios to alert us to issues. There is a public channel on irc.torproject.org which echoes the current alerts.
Is this exit-addresses file the same one as tordnsel exports, or is this a combination of what tordnsel says and what check adds to it (e.g. for relays that tordnsel didn't mention)?