setup a cache frontend for the blog

Trac:
Parent Ticket: #32090 (moved)

added component::internal services/service - cache owner::anarcat parent::32090 priority::medium resolution::fixed severity::normal status::closed type::task labels

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives
setup secondary node with Puppet, say cache-02.tpo
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Status: assigned to accepted

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives
setup secondary node with Puppet, say cache-02.tpo
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives
setup secondary node with Puppet, say cache-02.tpo
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

TODO:

firewall
DNS? /etc/hosts?
install and config
basic hits to see if it works
eval benchmark tools
do benchmark (from a new test machine?)

installed and tested ATS. results are not great, can't get it to forward the original Host header, which would make switchovers extremely painful and possibly break SEO (which require a canonical hostname, e.g. blog.torproject.org, which everything redirects to).

ATS Host header bug was solved (set pristine_host_hdr to 1) and benchmarks performed, results in the wiki.

now setup nginx and setup benchmarks as well, on a secondary VPS hosted in hetzner cloud.

benchmarks seem to show Nginx performance is comparable or faster than ATS.

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives
setup secondary node with Puppet, say cache-02.tpo
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (in progress, missing puppet config)
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

i sent the following to TPA to seek approval:

Subject: leaning towards nginx as a cache server To: "TorProject.org admins" torproject-admin@torproject.org Date: Mon, 28 Oct 2019 21:28:07 -0400

Hi!

I've done some benchmarks of ATS vs Nginx, just to confirm Nginx was comparably fast enough, as I had concerns about that. I had heard rumours its caching layer was sub-par and wanted to make sure it would survive the load. It passed the tests with great success.

It's also easier to configure.

You can see my detailed analysis here:

https://help.torproject.org/tsa/howto/cache/

And the steps remaining for the launch are detailed here:

https://trac.torproject.org/projects/tor/ticket/32239

If you have no objections, I would most likely perform the first production switch tomorrow or wednesday, to see if we can handle the load.

A.

-- Antoine Beaupré torproject.org system administration

tpa doesn't seem to have objections.

tested that logged in users don't hit the cache on nginx, so that's good.

will setup nginx on both nodes next, with this puppet module: https://forge.puppet.com/puppet/nginx

first had to cleanup puppet because there where versionning conflicts in the third party modules since the stdlib upgrade.

new node is up and works, configured with puppet and the new nginx module.

copied the cipher suite from the apache config, but i'm not sure about that, so I opened #32351 (moved) to followup on the suite in apache too.

next step is to deploy on the original node with puppet, run sanity tests against both nodes, then flip the switch. whoohoo!

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (in progress, missing puppet config)
validation benchmark against both nodes
lower DNS to 300 seconds, wait an hour
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (partial)
lower DNS to 300 seconds, wait an hour (set TTL to 10min, waiting)
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

the original node is now setup with puppet as well. ran into a problem when trying to figure out hit ratios: those stats are available only in the commercial version.

we might need to pipe stuff through mtail to get those metrics in prometheus. in the meantime, maybe we can still launch without those? :/

the TTL is still low, and i am thinking of launching tomorrow if nothing else comes up. i've changed the procedure slightly to add the caching servers in the pool instead of replacing the backend completely. that way we have a smoother transition and can fall back more easily if something goes wrong.

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (partial)
lower DNS to 300 seconds, wait an hour (set TTL to 10min, waiting)
flip DNS to the cache node, wait and monitor for 5 minutes
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to pantheon

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
lower DNS to 3 minutes
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

forgot that we need to open firewall

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
lower DNS to 3 minutes
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall
lower DNS to 3 minutes
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

lowered TTL to 3 minutes, getting ready for launch

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall
lower DNS to 3 minutes
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

i've flipped DNS to the new nodes.

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
add one node to the DNS, check if traffic flows properly after 10 minutes
add the other node to DNS, again checking traffic
if all is well, remove backend from DNS
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

i've reverted to the originally planned procedure where we just flip the switch because it's a simpler procedure. i've also setup a cache.tpo alias to point to the cluster of machines so we can get other sites in and out of rotation with a single CNAME, instead of having to maintain possibly multiple entries under multiple entries.

traffic now seems to be flowing into the nodes without noticeable problems. load is negligible:

Load average: 0.03 0.02 0.00

we have space for 12GB of cache on cache-02:

anarcat@cache-02:~$ df -h /var/cache/nginx/
Filesystem         Size  Used Avail Use% Mounted on
/dev/mapper/croot   19G  3.0G   15G  18% /

and ~7GB on cache01:

root@cache01:~# df -h /var/cache/nginx/
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/sda1          9,8G    1,9G  7,4G  21% /

that's probably what we should pay closest attention to, actually, since it's not clear nginx will do the right thing with out of disk space conditions.

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches (done)
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

as predicted, we have between 2 and 4 mbit/s although that traffic is present on both nodes, so it remains to be seen what the actual traffic sum is. we might be a little above double the expected average, but it should still be within acceptable boundaries in terms of costs.

added monitoring of https://blog.tpo in nagios, to make really sure that keeps working.

last remaining step here is to make sure the hit ratio is correct, and update the documentation to fill in the "tutorial" and "howto" sections from the template.

there is also longer-term work here to convert existing varnish instances, probably. updated the summary to cover for that.

Trac:
Status: accepted to merge_ready
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches (done)
raise DNS back to 1h if all goes well.

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches (done)
raise DNS back to 1h if all goes well. (done!)

Post launch tasks:

update documentation to fill in information from template
hit ratio stats in to Prometheus, separate ticket?
convert existing varnish setups into Nginx (probably requires Puppet refactoring), separate ticket?

Disaster recovery:

flip DNS back to backend

at first glance, our hit ratio is not great. i used the following lnav log format parser:

{   
    "cache" : {
        "title" : "Tor Nginx cache Log Format",
        "description" : "Log format used by the Tor Project for its Nginx cache servers.",
        "url" : "http://example.com/log-format.html",
        "regex" : {
            "basic" : {
                "pattern" : "^(?<client>[\\w\\d:.-]+) - (?<user>[\\w\\d:.-]+) \\[(?<timestamp>\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2}(?:\\.\\d+)?(?: [+-]\\d{4}))?\\] \"(?<method>\\w+) (?<path>\\S+) (?<transport>\\S+)\" (?<status_code>\\d+) (?<resp_size>\\d+|-) \"(?<referer>[^\"]+)\" \"-\" (?<upstream_addr>\\S+) (?<upstream_response_time>[\\d.-]+) (?<upstream_cache_status>\\S+) (?<request_time>[\\d.-]+)$"
            }
        },
        "level-field" : "upstream_cache_status",
        "level" : {
            "error" : "MISS",
            "warning" : "BYPASS|EXPIRED|STALE|UPDATING|REVALIDATED",
            "info" : "HIT"
        },
        "value" : {
            "client" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" },
            "user" : { "kind" : "string", "identifier" : true },
            "method" : { "kind" : "string", "identifier" : true },
            "path" : { "kind" : "string", "identifier" : true },
            "transport" : { "kind" : "string", "identifier" : true },
            "status_code" : { "kind" : "integer", "identifier" : true },
            "resp_size" : { "kind" : "integer", "identifier" : true },
            "referer" : { "kind" : "string", "identifier" : true },
            "upstream_addr" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" },
            "upstream_response_time" : { "kind" : "float", "identifier" : true },
            "upstream_cache_status" : { "kind" : "string", "identifier" : true },
            "request_time" : { "kind" : "float", "identifier" : true }
        },
        "sample" : [
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:09 +0000] \"GET /blog/feed HTTP/2.0\" 200 483311 \"-\" \"-\" - - HIT 0.373",
                "level" : "info"
            },
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /new-release-tor-browser-90 HTTP/2.0\" 200 26764 \"https://www.google.com/\" \"-\" [2620:12a:8000::2]:443 0.416 MISS 0.420",          
                "level" : "error"
            },
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /robots.txt HTTP/1.1\" 200 1594 \"-\" \"-\" [2620:12a:8001::2]:443 0.020 MISS 0.020",
                "level" : "error"
            }
        ]
    }
}

and then ran lnav on the access logfile, then ran the following query (after hitting the ; key):

select count(*), upstream_cache_status from logline where status_code < 300 group by upstream_cache_status;

which gave the following results:

count(*) upstream_cache_status                                                                                                                                                                │
       6│-                                                                                                                                                                                    │
       7│EXPIRED                                                                                                                                                                              │
    2925│HIT                                                                                                                                                                                  │
     680│MISS                                                                                                                                                                                 │
       5│REVALIDATED

That's an 80% hit ratio, and that's when ignoring any status code above 2xx. If those are included, our hit ratio is around 72%.

So there is still some work to be done to improve the hit ratio.

Trac:

this is how it looks on pantheon's side:

ie. it went from 15841 to 4957 visits in a day, a 30% reduction, which roughly matches the overall hit ratio (70%) we're seeing. a 70% reduction should bump us from the current "extra large" (300k visits per month 1000

/mth) down to "medium" (50k, 300

) or more likely "large" (150k, 600$) because we'd still hit 90k visits per month, or more - 5k visits per day is about exactly 150k visits per month, depending on the month.

so if we want to get back to the 300$/mth, we need to work on that hit ratio, bit time. there are probably a few things we can do to work on that, but that will need more investigation.

i tweaked the cache so it keeps inactive entries for more than the default (10m) and to cache 404s and redirects (which isn't done by default).

there are pages that return "bad" headers like this one:

anarcat@angela:~(master)$ curl -sSLI https://blog.torproject.org/new-release-tor-browser-90a5 | grep -i cache-control
cache-control: must-revalidate, no-cache, private

that's a problem on the drupal side of things.

this change seems to have pushed the hit ratio from 80 to 85% with status codes < 300 and from 70 to 75% for all hits.

next step is to hook this up in prometheus to get proper monitoring of the hit ratio.

Trac:

Integrated the mtail metrics in prometheus, which should record the hit ratio (and other performance information from nginx logs). Cooked up a quick and dirty grafana dashboard to show the hit rate which is, according to those metrics, now more around 87%:

very satisfying!

next step is to pipe the logs directly into mtail, instead of writing them to disks.

i think this is done. i split the remaining work in separate tickets:

syslog buffer for hit stats, see #32461 (moved)
convert existing varnish setups into Nginx, see #32462 (moved)

the docs have been updated as best I could, which about wraps it out:

https://help.torproject.org/tsa/howto/cache/

Trac:
Description: design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches (done)
raise DNS back to 1h if all goes well. (done!)

Post launch tasks:

update documentation to fill in information from template
hit ratio stats in to Prometheus, separate ticket?
convert existing varnish setups into Nginx (probably requires Puppet refactoring), separate ticket?

Disaster recovery:

flip DNS back to backend

to

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

alternatives listing and comparison (done)
deploy a test virtual machine by hand, say cache-01.tpo (done)
benchmark the different alternatives (done, ATS and nginx comparable)
setup secondary node with Puppet, say cache-02.tpo (done)
validation benchmark against both nodes (done)
lower DNS to 10 minutes wait an hour (done)
open firewall (done)
lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
point DNS to caches (done)
raise DNS back to 1h if all goes well. (done!)

Post launch tasks:

update documentation to fill in information from template (done)
hit ratio stats in to Prometheus, separate ticket? (done, although missing syslog buffer, see #32461 (moved))
convert existing varnish setups into Nginx (probably requires Puppet refactoring, see #32462 (moved))

Disaster recovery:

flip DNS back to backend

Resolution: N/A to fixed
Status: merge_ready to closed

Trac:
Component: Internal Services/Tor Sysadmin Team to Internal Services/Service - cache

closed

Trac:

mentioned in issue #32351 (moved)

mentioned in issue #32461 (moved)

mentioned in issue #32462 (moved)

mentioned in issue #32090 (moved)

mentioned in issue tpo/tpa/team#32461 (closed)

mentioned in issue tpo/tpa/team#32351

setup a cache frontend for the blog

Child items 0

Activity