Opened 7 weeks ago

Closed 6 weeks ago

Last modified 5 weeks ago

#32660 closed defect (fixed)

onionoo-backend is killing the ganeti cluster

Reported by: anarcat Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

hello!

today i noticed that, since last friday (UTC) morning, there has been pretty big spikes on the internal network between the ganeti nodes, every hour. it looks like this, in grafana:


We can clearly see a correlation between the two node's traffic, in reverse. This was confirmed using iftop and tcpdump on the nodes during a surge.

It seems this is due to onionoo-backend-01 blasting the disk and CPU for some reason. This is the disk I/O graphs for that host, which correlate pretty cleanly with the above graphs:


This was confirmed by an inspection of drbd, the mechanisms that synchronizes the disks across the network. It seems there's a huge surge of "writes" on the network every hour which lasts anywhere between 20 and 30 minutes. This was (somewhat) confirmed by running:

watch -n 0.1 -d cat /proc/drbd

on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in DRBD. 13 and 17 are the web nodes, so that's expected - probably log writes? But device ID 4 is onionoo-backend, which is what led me to the big traffic graph.

could someone from metrics investigate?

can i just turn off this machine altogether, considering it's basically trying to murder the cluster every hour? :)

Child Tickets

Attachments (8)

snap-2019.12.02-16.06.11.png (93.4 KB) - added by anarcat 7 weeks ago.
snap-2019.12.02-16.30.33.png (27.9 KB) - added by anarcat 7 weeks ago.
snap-2019.12.04-17.14.49.png (143.6 KB) - added by anarcat 6 weeks ago.
snap-2019.12.04-17.23.36.png (138.8 KB) - added by anarcat 6 weeks ago.
snap-2019.12.06-11.36.33.png (248.7 KB) - added by anarcat 6 weeks ago.
snap-2019.12.06-11.38.52.png (158.5 KB) - added by anarcat 6 weeks ago.
snap-2019.12.06-11.43.28.png (167.6 KB) - added by anarcat 6 weeks ago.
snap-2019.12.06-11.49.27.png (108.9 KB) - added by anarcat 6 weeks ago.

Download all attachments as: .zip

Change History (26)

Changed 7 weeks ago by anarcat

Changed 7 weeks ago by anarcat

comment:1 Changed 7 weeks ago by anarcat

Description: modified (diff)

attach screenshots and further explanations.

the TL;DR: here is: can i shutdown this backend?

comment:2 Changed 7 weeks ago by karsten

Let's first find out what's happening there. We were planning to stop this instance this afternoon and set up a new one on the same host. If we don't know what's going wrong, we might see the same issue with the new instance.

So, this seems like something that is caused by the hourly updater. Can you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly the same timing?

comment:3 Changed 7 weeks ago by irl

I have suspended the hourly updater on this host, but this is the normal expected operation for Onionoo's hourly updater. There are state files that get updated every run including those for which nothing has changed. This is a known flaw in Onionoo but until now it hasn't been a problem.

karsten is going to look at how difficult it would be to reduce the number of writes performed. Is the problem total IO or is it just the writes? Are reads cached? i.e. if we read it again to compare before writing, does that help?

There's very little we can do about the CPU load. We already use optimized parsing libraries for JSON, and quite simple parsers for Tor descriptors. Metrics does involve some computation. If CPU load is a problem then perhaps the Ganeti cluster is the wrong place for Onionoo to live and we need something else.

comment:4 Changed 7 weeks ago by anarcat

Let's first find out what's happening there. We were planning to stop this instance this afternoon and set up a new one on the same host. If we don't know what's going wrong, we might see the same issue with the new instance.

So, this seems like something that is caused by the hourly updater. Can you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly the same timing?

They do look similar, now that you mention it:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100

The peak bandwidth usage is bigger on onionoo-backend-01, but that might be just because the disks are faster; the peak is bigger, but shorter, so the transfer size is probably equivalent.

I have suspended the hourly updater on this host, but this is the normal expected operation for Onionoo's hourly updater. There are state files that get updated every run including those for which nothing has changed. This is a known flaw in Onionoo but until now it hasn't been a problem.

Okay, maybe I'm being overly cautious then. It might be a problem in the mid-long term in the new cluster because of the way it is structured: all writes are replicated between the two servers, because they act as a redundant cluster. If one node goes down, the other can take over on the fly. It also allows us to migrate the machines between the two servers more or less in realtime.

Maybe we should make an exception for this host, and keep the data only on one server. This would have a few implications:

  1. if the server goes down, we need to restore from backups, so there's up to 24h of data loss
  2. when we reboot the servers, the machine will go down for the duration of the reboot
  3. moving the machine around if we decommission the server will require manual work

karsten is going to look at how difficult it would be to reduce the number of writes performed. Is the problem total IO or is it just the writes? Are reads cached? i.e. if we read it again to compare before writing, does that help?

A napkin calculation tells me we're writing about 50GiB of data on the disk every hour. That seems like a *lot*!

(I base this on the graphs that seem to average about 36MiB/s for 35 minutes on onionoo-backend-01, which means around 56GiB. oo-hetzner-03 writes 21MiB for 35 minutes, which means about 46GiB. about average between the two is 50GiB.)

Is that about right? What *are* you writing in there? :)

There's very little we can do about the CPU load. We already use optimized parsing libraries for JSON, and quite simple parsers for Tor descriptors. Metrics does involve some computation. If CPU load is a problem then perhaps the Ganeti cluster is the wrong place for Onionoo to live and we need something else.

I don't mind the CPU load so much, actually. That we have some capacity. And we do have the capacity on the network too - it's a gigabit link after all. It's just that this single node is already taking 10% of the capacity during those peaks, so I was worried it was an anomaly.

But maybe there's much ado about nothing here. It just seems strange that we write all that data all the time...

comment:5 Changed 6 weeks ago by karsten

Nah, we're not writing 50 GiB per hour. Here are some statistics:

2019-12-04 10:06:30,905 INFO o.t.m.o.cron.Main:216 Document store
    7 list operations performed
    1,859,665 files listed
    1,867,140 files stored
    2.0 GiB stored
    1,874,298 files retrieved
    2.5 GiB retrieved
    0 files removed

I'm currently looking into avoiding some of those writes by comparing file digests before writing. Will post results here once I have them.

comment:6 Changed 6 weeks ago by anarcat

okay, maybe there's something wrong with my math then... i'll go back to my homework, but i really wonder where that discrepancy is coming from... could there be other writes that are not covered by those stats?

we don't seem to be hitting the swap file, so that shouldn't be it...

Changed 6 weeks ago by anarcat

comment:7 Changed 6 weeks ago by anarcat

i had trouble regenerating the report that gave me the 50GiB figure today, so here's a more direct link:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100&from=1575328800000&to=1575331800000

and here's a screenshot


Here you can clearly see all three servers (from left to right, omeiense, onionoo-backend-01, oo-hetzner-03) almost all maxing their disks, for a significant amount of time. the older backends (omeiense and hetzner) can barely make it in time for the next job: they both took 47 minutes to write. the new backend is faster, and makes it in a little over 20 minutes, but they all take up more than 50% of disk utilization, up to 100% for the right one. they write between 10 and 40MiB/s if I read those graphs right (and if we can trust those stats).

i'm still learning how to do Prometheus queries, so maybe i don't do this right, but this query:

increase(node_disk_written_bytes_total{instance=~'omeiense\\.torproject\\.org:9100|onionoo-backend-01\\.torproject\\.org:9100|oo-hetzner-03\\.torproject\\.org:9100'}[1h])

seems to say the servers write between 35GB (omeinse) and 58GB (hetzner) every hour:


i'll try to think of another tracing thing i could fire to keep track of who's writing what, but i admit i'm kind of running out of ideas. ;)

Last edited 6 weeks ago by anarcat (previous) (diff)

Changed 6 weeks ago by anarcat

comment:8 Changed 6 weeks ago by karsten

Here's an attempt to reduce writes:

2019-12-05 09:45:45,384 INFO o.t.m.o.cron.Main:216 Document store
    7 list operations performed
    951,149 files listed
    56,448 files stored
    1.4 GiB stored
    912,471 files not rewritten   <--
    641.3 MiB not rewritten       <--
    983,247 files retrieved
    2.8 GiB retrieved
    0 files removed

The idea is to read existing files before writing them, compute their digests, compute digests of the content to be written, and then abort the write if nothing has changed.

It looks like this saves us from overwriting a lot of unchanged files. It doesn't save as many bytes, but that might be secondary. Of course, this patch comes at the price of doing more reads and digest computations. It might be worth it, just saying that it's not free.

I'll bring this patch up at today's team meeting. We should discuss this and possibly merge it before doing new deployments on onionoo-backend-0[12].

And then we could look at those graphs once more to see if we need to tweak things more. Thanks for making these graphs!

comment:9 Changed 6 weeks ago by karsten

Status: newneeds_review

comment:10 Changed 6 weeks ago by anarcat

before we start doing optimization tradeoffs, especially ones involving more CPU time, I think we should diagnose the problem further. running checksums can take a while and might impose extra load on the server... couldn't we use timestamps or some faster mechanism?

we should also diagnose why our numbers differ. there might be another process writing all that data in there...

someone(tm) should run iotop during the peak periods to figure that out. :) i'll see if i can do that soon.

comment:11 in reply to:  9 Changed 6 weeks ago by irl

Status: needs_reviewmerge_ready

Replying to anarcat:

before we start doing optimization tradeoffs, especially ones involving more CPU time, I think we should diagnose the problem further. running checksums can take a while and might impose extra load on the server... couldn't we use timestamps or some faster mechanism?

I don't know enough about XFS to know if filesystem attributes can be trusted.

Replying to karsten:

irl, please review commit 4b7b13d in my task-32660 branch.

The change looks good if we decide we want to go this route. The JVM should use the Intel SHA extensions on recent processors, I don't think that we're going to see a huge increase in CPU load.

comment:12 Changed 6 weeks ago by anarcat

I don't know enough about XFS to know if filesystem attributes can be trusted.

I would be very surprised if they can't, unless you're doing something funky with those. That said, filesystems, in general, can hardly be trusted :p

The change looks good if we decide we want to go this route. The JVM should use the Intel SHA extensions on recent processors, I don't think that we're going to see a huge increase in CPU load.

Okay.

What do you think about the discrepancy between the estimated writes onionoo self-reports and what Prometheus collects?

comment:13 Changed 6 weeks ago by karsten

Oh, I'm not concerned about increase in CPU load. I used the same code to compute SHA-256 file digests for #31204 where doing this for 643M of data took merely 5 seconds.

Other mechanisms like timestamps don't work here, because we're actually producing new content for these files and only then learn that previously written files already have the same content.

We could do something smarter by detecting whether something has changed to the underlying data that we're formatting and writing, but that requires more development time that we currently have available.

I'll deploy this change now, and then we can see how much that bought us. If this doesn't fully solve the issue, which might well be the case, we can think more about that discrepancy between the two numbers.

comment:14 Changed 6 weeks ago by karsten

Patch is merged, released, and deployed on both backends. Hourly updates are much faster now. Does this show in the graphs?

Changed 6 weeks ago by anarcat

Changed 6 weeks ago by anarcat

Changed 6 weeks ago by anarcat

Changed 6 weeks ago by anarcat

comment:15 Changed 6 weeks ago by anarcat

Description: modified (diff)
Resolution: fixed
Status: merge_readyclosed

wow, that *is* a huge improvement! check this out:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=1575563766753&to=1575650166753&var-node=omeiense.torproject.org:9100&var-node=oo-hetzner-03.torproject.org

in particular:


large reduction in CPU and memory usage, significant reduction in load!


also a *dramatic* reduction in disk utilization! especially: all that writing was significantly reduced... but what i find the most interesting is this:


ie. we write less, but we don't read more! even though we're computing all those checksums, we don't impose extra load on the disks because of that reading, which is one thing I was worried about.

but even if we would read more (which we don't) it would still be a worthwhile tradeoffs because (1) we can cache those and (2) we (obviously) don't need to replicate reads across the cluster.

i can't confirm the effect on the actual ganeti cluster because irl (thankfully! :) has turned off those jobs on onionoo-backend-01. but i'm not confident the cluster will be happier with this work if/when we turn it back on.

thank you so much for taking the extra time in fixing this and taking care of our hardware. sometimes it's easier to throw hardware at a problem, but this seemed like a case where we could improve our algos a little, and I'm glad it worked out. :)

all in all, i think this can be marked as fixed, at least it is for me. i'll let other tickets speak for the rest of the work on this onionoo stuff. from what i understand, there needs to be extra work to bring that other backend online (or build a new one?) but i'll let you folks figure out the next steps. :)

do ping me if you need help on that!

cheers, and thanks again!

comment:16 Changed 6 weeks ago by karsten

Thanks for checking, and glad to hear that things worked out this well! We'll take care of coordinating new backend instances. Thanks again!

comment:17 Changed 5 weeks ago by irl

Is this code released now, if so we should update metrics-cloud with this version ready for deployment in January. A second backend host is requested in #32763.

comment:18 in reply to:  17 Changed 5 weeks ago by karsten

Replying to irl:

Is this code released now, if so we should update metrics-cloud with this version ready for deployment in January. A second backend host is requested in #32763.

Yes, this code is released, see #32684. Deploying this version to two backends in January sounds fine to me!

Note: See TracTickets for help on using tickets.