Opened 3 months ago

Closed 2 months ago

Last modified 2 months ago

#32239 closed task (fixed)

setup a cache frontend for the blog

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Service - cache Version:
Severity: Normal Keywords:
Cc: isabela, anarcat, hiro Actual Points:
Parent ID: #32090 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

design docs in https://help.torproject.org/tsa/howto/cache/

launch checklist:

  1. alternatives listing and comparison (done)
  2. deploy a test virtual machine by hand, say cache-01.tpo (done)
  3. benchmark the different alternatives (done, ATS and nginx comparable)
  4. setup secondary node with Puppet, say cache-02.tpo (done)
  5. validation benchmark against both nodes (done)
  6. lower DNS to 10 minutes wait an hour (done)
  7. open firewall (done)
  8. lower DNS to 3 minutes (done, around 2019-11-05 16:00:00)
  9. point DNS to caches (done)
  10. raise DNS back to 1h if all goes well. (done!)

Post launch tasks:

  1. update documentation to fill in information from template (done)
  2. hit ratio stats in to Prometheus, separate ticket? (done, although missing syslog buffer, see #32461)
  3. convert existing varnish setups into Nginx (probably requires Puppet refactoring, see #32462)

Disaster recovery:

  1. flip DNS back to backend

Child Tickets

Attachments (4)

snap-2019.11.07-17.00.49.png (34.3 KB) - added by anarcat 3 months ago.
snap-2019.11.08-18.13.15.png (31.1 KB) - added by anarcat 3 months ago.
snap-2019.12.09-11.28.37.png (58.5 KB) - added by anarcat 6 weeks ago.
snap-2019.12.09-11.30.12.png (226.7 KB) - added by anarcat 6 weeks ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 3 months ago by anarcat

Description: modified (diff)
Status: assignedaccepted

comment:2 Changed 3 months ago by anarcat

Description: modified (diff)

comment:3 Changed 3 months ago by anarcat

TODO:

  • firewall
  • DNS? /etc/hosts?
  • install and config
  • basic hits to see if it works
  • eval benchmark tools
  • do benchmark (from a new test machine?)

comment:4 Changed 3 months ago by anarcat

installed and tested ATS. results are not great, can't get it to forward the original Host header, which would make switchovers extremely painful and possibly break SEO (which require a canonical hostname, e.g. blog.torproject.org, which everything redirects to).

comment:5 Changed 3 months ago by anarcat

Description: modified (diff)

ATS Host header bug was solved (set pristine_host_hdr to 1) and benchmarks performed, results in the wiki.

now setup nginx and setup benchmarks as well, on a secondary VPS hosted in hetzner cloud.

benchmarks seem to show Nginx performance is comparable or faster than ATS.

comment:6 Changed 3 months ago by anarcat

i sent the following to TPA to seek approval:

Subject: leaning towards nginx as a cache server
To: "TorProject.org admins" <torproject-admin@…>
Date: Mon, 28 Oct 2019 21:28:07 -0400

Hi!

I've done some benchmarks of ATS vs Nginx, just to confirm Nginx was
comparably fast enough, as I had concerns about that. I had heard
rumours its caching layer was sub-par and wanted to make sure it would
survive the load. It passed the tests with great success.

It's also easier to configure.

You can see my detailed analysis here:

https://help.torproject.org/tsa/howto/cache/

And the steps remaining for the launch are detailed here:

https://trac.torproject.org/projects/tor/ticket/32239

If you have no objections, I would most likely perform the first
production switch tomorrow or wednesday, to see if we can handle the
load.

A.

--
Antoine Beaupré
torproject.org system administration

comment:7 Changed 3 months ago by anarcat

tpa doesn't seem to have objections.

tested that logged in users don't hit the cache on nginx, so that's good.

will setup nginx on both nodes next, with this puppet module: https://forge.puppet.com/puppet/nginx

first had to cleanup puppet because there where versionning conflicts in the third party modules since the stdlib upgrade.

comment:8 Changed 3 months ago by anarcat

Description: modified (diff)

new node is up and works, configured with puppet and the new nginx module.

copied the cipher suite from the apache config, but i'm not sure about that, so I opened #32351 to followup on the suite in apache too.

next step is to deploy on the original node with puppet, run sanity tests against both nodes, then flip the switch. whoohoo!

comment:9 Changed 3 months ago by anarcat

Description: modified (diff)

the original node is now setup with puppet as well. ran into a problem when trying to figure out hit ratios: those stats are available only in the commercial version.

we might need to pipe stuff through mtail to get those metrics in prometheus. in the meantime, maybe we can still launch without those? :/

the TTL is still low, and i am thinking of launching tomorrow if nothing else comes up. i've changed the procedure slightly to *add* the caching servers in the pool instead of replacing the backend completely. that way we have a smoother transition and can fall back more easily if something goes wrong.

comment:10 Changed 3 months ago by anarcat

Description: modified (diff)

forgot that we need to open firewall

comment:11 Changed 3 months ago by anarcat

Description: modified (diff)

lowered TTL to 3 minutes, getting ready for launch

comment:12 Changed 3 months ago by anarcat

Description: modified (diff)

i've flipped DNS to the new nodes.

comment:13 Changed 3 months ago by anarcat

Description: modified (diff)

i've reverted to the originally planned procedure where we just flip the switch because it's a simpler procedure. i've also setup a cache.tpo alias to point to the cluster of machines so we can get other sites in and out of rotation with a single CNAME, instead of having to maintain possibly multiple entries under multiple entries.

traffic now seems to be flowing into the nodes without noticeable problems. load is negligible:

Load average: 0.03 0.02 0.00

we have space for 12GB of cache on cache-02:

anarcat@cache-02:~$ df -h /var/cache/nginx/
Filesystem         Size  Used Avail Use% Mounted on
/dev/mapper/croot   19G  3.0G   15G  18% /

and ~7GB on cache01:

root@cache01:~# df -h /var/cache/nginx/
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/sda1          9,8G    1,9G  7,4G  21% /

that's probably what we should pay closest attention to, actually, since it's not clear nginx will do the right thing with out of disk space conditions.

comment:14 Changed 3 months ago by anarcat

as predicted, we have between 2 and 4 mbit/s although that traffic is present on *both* nodes, so it remains to be seen what the actual traffic sum is. we might be a little above double the expected average, but it should still be within acceptable boundaries in terms of costs.

comment:15 Changed 3 months ago by anarcat

Description: modified (diff)
Status: acceptedmerge_ready

added monitoring of https://blog.tpo in nagios, to make really sure *that* keeps working.

last remaining step here is to make sure the hit ratio is correct, and update the documentation to fill in the "tutorial" and "howto" sections from the template.

there is also longer-term work here to convert existing varnish instances, probably. updated the summary to cover for that.

comment:16 Changed 3 months ago by anarcat

at first glance, our hit ratio is not great. i used the following lnav log format parser:

{   
    "cache" : {
        "title" : "Tor Nginx cache Log Format",
        "description" : "Log format used by the Tor Project for its Nginx cache servers.",
        "url" : "http://example.com/log-format.html",
        "regex" : {
            "basic" : {
                "pattern" : "^(?<client>[\\w\\d:.-]+) - (?<user>[\\w\\d:.-]+) \\[(?<timestamp>\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2}(?:\\.\\d+)?(?: [+-]\\d{4}))?\\] \"(?<method>\\w+) (?<path>\\S+) (?<transport>\\S+)\" (?<status_code>\\d+) (?<resp_size>\\d+|-) \"(?<referer>[^\"]+)\" \"-\" (?<upstream_addr>\\S+) (?<upstream_response_time>[\\d.-]+) (?<upstream_cache_status>\\S+) (?<request_time>[\\d.-]+)$"
            }
        },
        "level-field" : "upstream_cache_status",
        "level" : {
            "error" : "MISS",
            "warning" : "BYPASS|EXPIRED|STALE|UPDATING|REVALIDATED",
            "info" : "HIT"
        },
        "value" : {
            "client" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" },
            "user" : { "kind" : "string", "identifier" : true },
            "method" : { "kind" : "string", "identifier" : true },
            "path" : { "kind" : "string", "identifier" : true },
            "transport" : { "kind" : "string", "identifier" : true },
            "status_code" : { "kind" : "integer", "identifier" : true },
            "resp_size" : { "kind" : "integer", "identifier" : true },
            "referer" : { "kind" : "string", "identifier" : true },
            "upstream_addr" : { "kind" : "string", "identifier" : true, "collate": "ipaddress" },
            "upstream_response_time" : { "kind" : "float", "identifier" : true },
            "upstream_cache_status" : { "kind" : "string", "identifier" : true },
            "request_time" : { "kind" : "float", "identifier" : true }
        },
        "sample" : [
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:09 +0000] \"GET /blog/feed HTTP/2.0\" 200 483311 \"-\" \"-\" - - HIT 0.373",
                "level" : "info"
            },
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /new-release-tor-browser-90 HTTP/2.0\" 200 26764 \"https://www.google.com/\" \"-\" [2620:12a:8000::2]:443 0.416 MISS 0.420",          
                "level" : "error"
            },
            {   
                "line" : "0.0.0.0 - - [07/Nov/2019:21:28:10 +0000] \"GET /robots.txt HTTP/1.1\" 200 1594 \"-\" \"-\" [2620:12a:8001::2]:443 0.020 MISS 0.020",
                "level" : "error"
            }
        ]
    }
}

and then ran lnav on the access logfile, then ran the following query (after hitting the ; key):

select count(*), upstream_cache_status from logline where status_code < 300 group by upstream_cache_status;

which gave the following results:

count(*) upstream_cache_status                                                                                                                                                                │
       6│-                                                                                                                                                                                    │
       7│EXPIRED                                                                                                                                                                              │
    2925│HIT                                                                                                                                                                                  │
     680│MISS                                                                                                                                                                                 │
       5│REVALIDATED             

That's an 80% hit ratio, and that's when ignoring any status code above 2xx. If those are included, our hit ratio is around 72%.

So there is still some work to be done to improve the hit ratio.

Changed 3 months ago by anarcat

comment:17 Changed 3 months ago by anarcat

this is how it looks on pantheon's side:


ie. it went from 15841 to 4957 visits in a day, a 30% reduction, which roughly matches the overall hit ratio (70%) we're seeing. a 70% reduction should bump us from the current "extra large" (300k visits per month 1000$/mth) down to "medium" (50k, 300$) or more likely "large" (150k, 600$) because we'd still hit 90k visits per month, or more - 5k visits per day is about exactly 150k visits per month, depending on the month.

so if we want to get back to the 300$/mth, we need to work on that hit ratio, bit time. there are probably a few things we can do to work on that, but that will need more investigation.

comment:18 Changed 3 months ago by anarcat

i tweaked the cache so it keeps inactive entries for more than the default (10m) and to cache 404s and redirects (which isn't done by default).

there are pages that return "bad" headers like this one:

anarcat@angela:~(master)$ curl -sSLI https://blog.torproject.org/new-release-tor-browser-90a5 | grep -i cache-control
cache-control: must-revalidate, no-cache, private

that's a problem on the drupal side of things.

comment:19 Changed 3 months ago by anarcat

this change seems to have pushed the hit ratio from 80 to 85% with status codes < 300 and from 70 to 75% for all hits.

next step is to hook this up in prometheus to get proper monitoring of the hit ratio.

Changed 3 months ago by anarcat

comment:20 Changed 3 months ago by anarcat

Integrated the mtail metrics in prometheus, which should record the hit ratio (and other performance information from nginx logs). Cooked up a quick and dirty grafana dashboard to show the hit rate which is, according to those metrics, now more around 87%:


very satisfying!

next step is to pipe the logs directly into mtail, instead of writing them to disks.

comment:21 Changed 2 months ago by anarcat

Description: modified (diff)
Resolution: fixed
Status: merge_readyclosed

i think this is done. i split the remaining work in separate tickets:

  • syslog buffer for hit stats, see #32461
  • convert existing varnish setups into Nginx, see #32462

the docs have been updated as best I could, which about wraps it out:

https://help.torproject.org/tsa/howto/cache/

comment:22 Changed 2 months ago by anarcat

Component: Internal Services/Tor Sysadmin TeamInternal Services/Service - cache

Changed 6 weeks ago by anarcat

Changed 6 weeks ago by anarcat

Note: See TracTickets for help on using tickets.