setup a cache frontend for the blog
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well. (done!)
Post launch tasks:
- update documentation to fill in information from template (done)
- hit ratio stats in to Prometheus, separate ticket? (done, although missing syslog buffer, see #32461 (moved))
- convert existing varnish setups into Nginx (probably requires Puppet refactoring, see #32462 (moved))
Disaster recovery:
- flip DNS back to backend
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
Trac:
Parent Ticket: #32090 (moved)Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives
- setup secondary node with Puppet, say
cache-02.tpo
- validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Status: assigned to accepted
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives
- setup secondary node with Puppet, say
cache-02.tpo
- validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives
- setup secondary node with Puppet, say
cache-02.tpo
- validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
ATS
Host
header bug was solved (setpristine_host_hdr
to1
) and benchmarks performed, results in the wiki.now setup nginx and setup benchmarks as well, on a secondary VPS hosted in hetzner cloud.
benchmarks seem to show Nginx performance is comparable or faster than ATS.
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives
- setup secondary node with Puppet, say
cache-02.tpo
- validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(in progress, missing puppet config) - validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
i sent the following to TPA to seek approval:
Subject: leaning towards nginx as a cache server To: "TorProject.org admins" torproject-admin@torproject.org Date: Mon, 28 Oct 2019 21:28:07 -0400
Hi!
I've done some benchmarks of ATS vs Nginx, just to confirm Nginx was comparably fast enough, as I had concerns about that. I had heard rumours its caching layer was sub-par and wanted to make sure it would survive the load. It passed the tests with great success.
It's also easier to configure.
You can see my detailed analysis here:
https://help.torproject.org/tsa/howto/cache/
And the steps remaining for the launch are detailed here:
https://trac.torproject.org/projects/tor/ticket/32239
If you have no objections, I would most likely perform the first production switch tomorrow or wednesday, to see if we can handle the load.
A.
-- Antoine Beaupré torproject.org system administration
tpa doesn't seem to have objections.
tested that logged in users don't hit the cache on nginx, so that's good.
will setup nginx on both nodes next, with this puppet module: https://forge.puppet.com/puppet/nginx
first had to cleanup puppet because there where versionning conflicts in the third party modules since the stdlib upgrade.
new node is up and works, configured with puppet and the new nginx module.
copied the cipher suite from the apache config, but i'm not sure about that, so I opened #32351 (moved) to followup on the suite in apache too.
next step is to deploy on the original node with puppet, run sanity tests against both nodes, then flip the switch. whoohoo!
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(in progress, missing puppet config) - validation benchmark against both nodes
- lower DNS to 300 seconds, wait an hour
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (partial)
- lower DNS to 300 seconds, wait an hour (set TTL to 10min, waiting)
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
the original node is now setup with puppet as well. ran into a problem when trying to figure out hit ratios: those stats are available only in the commercial version.
we might need to pipe stuff through mtail to get those metrics in prometheus. in the meantime, maybe we can still launch without those? :/
the TTL is still low, and i am thinking of launching tomorrow if nothing else comes up. i've changed the procedure slightly to add the caching servers in the pool instead of replacing the backend completely. that way we have a smoother transition and can fall back more easily if something goes wrong.
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (partial)
- lower DNS to 300 seconds, wait an hour (set TTL to 10min, waiting)
- flip DNS to the cache node, wait and monitor for 5 minutes
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to pantheon
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- lower DNS to 3 minutes
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
forgot that we need to open firewall
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- lower DNS to 3 minutes
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall
- lower DNS to 3 minutes
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
lowered TTL to 3 minutes, getting ready for launch
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall
- lower DNS to 3 minutes
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
i've flipped DNS to the new nodes.
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- add one node to the DNS, check if traffic flows properly after 10 minutes
- add the other node to DNS, again checking traffic
- if all is well, remove backend from DNS
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
i've reverted to the originally planned procedure where we just flip the switch because it's a simpler procedure. i've also setup a
cache.tpo
alias to point to the cluster of machines so we can get other sites in and out of rotation with a single CNAME, instead of having to maintain possibly multiple entries under multiple entries.traffic now seems to be flowing into the nodes without noticeable problems. load is negligible:
Load average: 0.03 0.02 0.00
we have space for 12GB of cache on cache-02:
anarcat@cache-02:~$ df -h /var/cache/nginx/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/croot 19G 3.0G 15G 18% /
and ~7GB on cache01:
root@cache01:~# df -h /var/cache/nginx/ Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur /dev/sda1 9,8G 1,9G 7,4G 21% /
that's probably what we should pay closest attention to, actually, since it's not clear nginx will do the right thing with out of disk space conditions.
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
as predicted, we have between 2 and 4 mbit/s although that traffic is present on both nodes, so it remains to be seen what the actual traffic sum is. we might be a little above double the expected average, but it should still be within acceptable boundaries in terms of costs.
added monitoring of https://blog.tpo in nagios, to make really sure that keeps working.
last remaining step here is to make sure the hit ratio is correct, and update the documentation to fill in the "tutorial" and "howto" sections from the template.
there is also longer-term work here to convert existing varnish instances, probably. updated the summary to cover for that.
Trac:
Status: accepted to merge_ready
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well.
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well. (done!)
Post launch tasks:
- update documentation to fill in information from template
- hit ratio stats in to Prometheus, separate ticket?
- convert existing varnish setups into Nginx (probably requires Puppet refactoring), separate ticket?
Disaster recovery:
- flip DNS back to backend
at first glance, our hit ratio is not great. i used the following lnav log format parser:
{ "cache" : { "title" : "Tor Nginx cache Log Format", "description" : "Log format used by the Tor Project for its Nginx cache servers.", "url" : "http://example.com/log-format.html", "regex" : { "basic" : { "pattern" : "^(?<client>[\\w\\d:.-]+) - (?<user>[\\w\\d:.-]+) \\[(?<timestamp>\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2}(?:\\.\\d+)?(?: [+-]\\d{4}))?\\] \"(?<method>\\w+) (?<path>\\S+) (?<transport>\\S+)\" (?<status_code>\\d+) (?<resp_size>\\d+|-) \"(?<referer>[^\"]+)\" \"-\" (?<upstream_addr>\\S+) (?<upstream_response_time>[\\d.-]+) (?<upstream_cache_status>\\S+) (?<request_time>[\\d.-]+)$" } }, "level-field" : "upstream_cache_status", "level" : { "error" : "MISS", "warning" : "BYPASS|EXPIRED|STALE|UPDATING|REVALIDATED", "info" : "HIT" }, "value" : { "client" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" }, "user" : { "kind" : "string", "identifier" : true }, "method" : { "kind" : "string", "identifier" : true }, "path" : { "kind" : "string", "identifier" : true }, "transport" : { "kind" : "string", "identifier" : true }, "status_code" : { "kind" : "integer", "identifier" : true }, "resp_size" : { "kind" : "integer", "identifier" : true }, "referer" : { "kind" : "string", "identifier" : true }, "upstream_addr" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" }, "upstream_response_time" : { "kind" : "float", "identifier" : true }, "upstream_cache_status" : { "kind" : "string", "identifier" : true }, "request_time" : { "kind" : "float", "identifier" : true } }, "sample" : [ { "line" : "0.0.0.0 - - [07/Nov/2019:21:28:09 +0000] \"GET /blog/feed HTTP/2.0\" 200 483311 \"-\" \"-\" - - HIT 0.373", "level" : "info" }, { "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /new-release-tor-browser-90 HTTP/2.0\" 200 26764 \"https://www.google.com/\" \"-\" [2620:12a:8000::2]:443 0.416 MISS 0.420", "level" : "error" }, { "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /robots.txt HTTP/1.1\" 200 1594 \"-\" \"-\" [2620:12a:8001::2]:443 0.020 MISS 0.020", "level" : "error" } ] } }
and then ran lnav on the access logfile, then ran the following query (after hitting the
;
key):select count(*), upstream_cache_status from logline where status_code < 300 group by upstream_cache_status;
which gave the following results:
count(*) upstream_cache_status │ 6│- │ 7│EXPIRED │ 2925│HIT │ 680│MISS │ 5│REVALIDATED
That's an 80% hit ratio, and that's when ignoring any status code above 2xx. If those are included, our hit ratio is around 72%.
So there is still some work to be done to improve the hit ratio.
this is how it looks on pantheon's side:
ie. it went from 15841 to 4957 visits in a day, a 30% reduction, which roughly matches the overall hit ratio (70%) we're seeing. a 70% reduction should bump us from the current "extra large" (300k visits per month 1000
/mth) down to "medium" (50k, 300) or more likely "large" (150k, 600$) because we'd still hit 90k visits per month, or more - 5k visits per day is about exactly 150k visits per month, depending on the month.so if we want to get back to the 300$/mth, we need to work on that hit ratio, bit time. there are probably a few things we can do to work on that, but that will need more investigation.
i tweaked the cache so it keeps inactive entries for more than the default (10m) and to cache 404s and redirects (which isn't done by default).
there are pages that return "bad" headers like this one:
anarcat@angela:~(master)$ curl -sSLI https://blog.torproject.org/new-release-tor-browser-90a5 | grep -i cache-control cache-control: must-revalidate, no-cache, private
that's a problem on the drupal side of things.
Integrated the mtail metrics in prometheus, which should record the hit ratio (and other performance information from nginx logs). Cooked up a quick and dirty grafana dashboard to show the hit rate which is, according to those metrics, now more around 87%:
very satisfying!
next step is to pipe the logs directly into mtail, instead of writing them to disks.
i think this is done. i split the remaining work in separate tickets:
- syslog buffer for hit stats, see #32461 (moved)
- convert existing varnish setups into Nginx, see #32462 (moved)
the docs have been updated as best I could, which about wraps it out:
https://help.torproject.org/tsa/howto/cache/
Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well. (done!)
Post launch tasks:
- update documentation to fill in information from template
- hit ratio stats in to Prometheus, separate ticket?
- convert existing varnish setups into Nginx (probably requires Puppet refactoring), separate ticket?
Disaster recovery:
- flip DNS back to backend
to
design docs in https://help.torproject.org/tsa/howto/cache/
launch checklist:
- alternatives listing and comparison (done)
- deploy a test virtual machine by hand, say
cache-01.tpo
(done) - benchmark the different alternatives (done, ATS and nginx comparable)
- setup secondary node with Puppet, say
cache-02.tpo
(done) - validation benchmark against both nodes (done)
- lower DNS to 10 minutes wait an hour (done)
- open firewall (done)
- lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
- point DNS to caches (done)
- raise DNS back to 1h if all goes well. (done!)
Post launch tasks:
- update documentation to fill in information from template (done)
- hit ratio stats in to Prometheus, separate ticket? (done, although missing syslog buffer, see #32461 (moved))
- convert existing varnish setups into Nginx (probably requires Puppet refactoring, see #32462 (moved))
Disaster recovery:
- flip DNS back to backend
Resolution: N/A to fixed
Status: merge_ready to closed- Trac closed
closed
- anarcat mentioned in issue #32351 (moved)
mentioned in issue #32351 (moved)
- anarcat mentioned in issue #32461 (moved)
mentioned in issue #32461 (moved)
- anarcat mentioned in issue #32462 (moved)
mentioned in issue #32462 (moved)
- Trac mentioned in issue #32090 (moved)
mentioned in issue #32090 (moved)
- anarcat mentioned in issue tpo/tpa/team#32461 (closed)
mentioned in issue tpo/tpa/team#32461 (closed)
- anarcat mentioned in issue tpo/tpa/team#32351
mentioned in issue tpo/tpa/team#32351