onionoo-backend is killing the ganeti cluster

added component::metrics/onionoo owner::metrics-team priority::medium resolution::fixed severity::normal status::closed type::defect labels

Trac:

attach screenshots and further explanations.

the TL;DR: here is: can i shutdown this backend?

Trac:
Description: hello!

today i noticed that, since last friday (UTC) morning, there has been pretty big spikes on the internal network between the ganeti nodes, every hour. it seems this is due to onionoo-backend-01 blasting the disk and CPU for some reason.

could someone from metrics investigate? can i just turn off this machine altogether, considering it's basically trying to murder the cluster every hour? :)

(will attach explanatory screenshots)

to

hello!

today i noticed that, since last friday (UTC) morning, there has been pretty big spikes on the internal network between the ganeti nodes, every hour. it looks like this, in grafana:

We can clearly see a correlation between the two node's traffic, in reverse. This was confirmed using iftop and tcpdump on the nodes during a surge.

It seems this is due to onionoo-backend-01 blasting the disk and CPU for some reason. This is the disk I/O graphs for that host, which correlate pretty cleanly with the above graphs:

This was confirmed by an inspection of drbd, the mechanisms that synchronizes the disks across the network. It seems there's a huge surge of "writes" on the network every hour which lasts anywhere between 20 and 30 minutes. This was (somewhat) confirmed by running:

watch -n 0.1 -d cat /proc/drbd

on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in DRBD. 13 and 17 are the web nodes, so that's expected - probably log writes? But device ID 4 is onionoo-backend, which is what led me to the big traffic graph.

could someone from metrics investigate?

can i just turn off this machine altogether, considering it's basically trying to murder the cluster every hour? :)

Let's first find out what's happening there. We were planning to stop this instance this afternoon and set up a new one on the same host. If we don't know what's going wrong, we might see the same issue with the new instance.

So, this seems like something that is caused by the hourly updater. Can you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly the same timing?

I have suspended the hourly updater on this host, but this is the normal expected operation for Onionoo's hourly updater. There are state files that get updated every run including those for which nothing has changed. This is a known flaw in Onionoo but until now it hasn't been a problem.

karsten is going to look at how difficult it would be to reduce the number of writes performed. Is the problem total IO or is it just the writes? Are reads cached? i.e. if we read it again to compare before writing, does that help?

There's very little we can do about the CPU load. We already use optimized parsing libraries for JSON, and quite simple parsers for Tor descriptors. Metrics does involve some computation. If CPU load is a problem then perhaps the Ganeti cluster is the wrong place for Onionoo to live and we need something else.

Let's first find out what's happening there. We were planning to stop this instance this afternoon and set up a new one on the same host. If we don't know what's going wrong, we might see the same issue with the new instance.

So, this seems like something that is caused by the hourly updater. Can you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly the same timing?

They do look similar, now that you mention it:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100

The peak bandwidth usage is bigger on onionoo-backend-01, but that might be just because the disks are faster; the peak is bigger, but shorter, so the transfer size is probably equivalent.

I have suspended the hourly updater on this host, but this is the normal expected operation for Onionoo's hourly updater. There are state files that get updated every run including those for which nothing has changed. This is a known flaw in Onionoo but until now it hasn't been a problem.

Okay, maybe I'm being overly cautious then. It might be a problem in the mid-long term in the new cluster because of the way it is structured: all writes are replicated between the two servers, because they act as a redundant cluster. If one node goes down, the other can take over on the fly. It also allows us to migrate the machines between the two servers more or less in realtime.

Maybe we should make an exception for this host, and keep the data only on one server. This would have a few implications:

if the server goes down, we need to restore from backups, so there's up to 24h of data loss
when we reboot the servers, the machine will go down for the duration of the reboot
moving the machine around if we decommission the server will require manual work

karsten is going to look at how difficult it would be to reduce the number of writes performed. Is the problem total IO or is it just the writes? Are reads cached? i.e. if we read it again to compare before writing, does that help?

A napkin calculation tells me we're writing about 50GiB of data on the disk every hour. That seems like a lot!

(I base this on the graphs that seem to average about 36MiB/s for 35 minutes on onionoo-backend-01, which means around 56GiB. oo-hetzner-03 writes 21MiB for 35 minutes, which means about 46GiB. about average between the two is 50GiB.)

Is that about right? What are you writing in there? :)

There's very little we can do about the CPU load. We already use optimized parsing libraries for JSON, and quite simple parsers for Tor descriptors. Metrics does involve some computation. If CPU load is a problem then perhaps the Ganeti cluster is the wrong place for Onionoo to live and we need something else.

I don't mind the CPU load so much, actually. That we have some capacity. And we do have the capacity on the network too - it's a gigabit link after all. It's just that this single node is already taking 10% of the capacity during those peaks, so I was worried it was an anomaly.

But maybe there's much ado about nothing here. It just seems strange that we write all that data all the time...

Nah, we're not writing 50 GiB per hour. Here are some statistics:

2019-12-04 10:06:30,905 INFO o.t.m.o.cron.Main:216 Document store
    7 list operations performed
    1,859,665 files listed
    1,867,140 files stored
    2.0 GiB stored
    1,874,298 files retrieved
    2.5 GiB retrieved
    0 files removed

I'm currently looking into avoiding some of those writes by comparing file digests before writing. Will post results here once I have them.

okay, maybe there's something wrong with my math then... i'll go back to my homework, but i really wonder where that discrepancy is coming from... could there be other writes that are not covered by those stats?

we don't seem to be hitting the swap file, so that shouldn't be it...

Trac:

i had trouble regenerating the report that gave me the 50GiB figure today, so here's a more direct link:

https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100&from=1575328800000&to=1575331800000

and here's a screenshot

Here you can clearly see all three servers (from left to right, omeiense, onionoo-backend-01, oo-hetzner-03) almost all maxing their disks, for a significant amount of time. the older backends (omeiense and hetzner) can barely make it in time for the next job: they both took 47 minutes to write. the new backend is faster, and makes it in a little over 20 minutes, but they all take up more than 50% of disk utilization, up to 100% for the right one. they write between 10 and 40MiB/s if I read those graphs right (and if we can trust those stats).

i'm still learning how to do Prometheus queries, so maybe i don't do this right, but this query:

increase(node_disk_written_bytes_total{instance=~'omeiense\\.torproject\\.org:9100|onionoo-backend-01\\.torproject\\.org:9100|oo-hetzner-03\\.torproject\\.org:9100'}[1h])

seems to say the servers write between 35GB (omeinse) and 58GB (hetzner) every hour:

i'll try to think of another tracing thing i could fire to keep track of who's writing what, but i admit i'm kind of running out of ideas. ;)

Trac:

Here's an attempt to reduce writes:

2019-12-05 09:45:45,384 INFO o.t.m.o.cron.Main:216 Document store
    7 list operations performed
    951,149 files listed
    56,448 files stored
    1.4 GiB stored
    912,471 files not rewritten   <--
    641.3 MiB not rewritten       <--
    983,247 files retrieved
    2.8 GiB retrieved
    0 files removed

The idea is to read existing files before writing them, compute their digests, compute digests of the content to be written, and then abort the write if nothing has changed.

It looks like this saves us from overwriting a lot of unchanged files. It doesn't save as many bytes, but that might be secondary. Of course, this patch comes at the price of doing more reads and digest computations. It might be worth it, just saying that it's not free.

I'll bring this patch up at today's team meeting. We should discuss this and possibly merge it before doing new deployments on onionoo-backend-0[12].

And then we could look at those graphs once more to see if we need to tweak things more. Thanks for making these graphs!

irl, please review commit 4b7b13d in my task-32660 branch.

Trac:
Status: new to needs_review

before we start doing optimization tradeoffs, especially ones involving more CPU time, I think we should diagnose the problem further. running checksums can take a while and might impose extra load on the server... couldn't we use timestamps or some faster mechanism?

we should also diagnose why our numbers differ. there might be another process writing all that data in there...

someone(tm) should run iotop during the peak periods to figure that out. :) i'll see if i can do that soon.

Replying to anarcat:

before we start doing optimization tradeoffs, especially ones involving more CPU time, I think we should diagnose the problem further. running checksums can take a while and might impose extra load on the server... couldn't we use timestamps or some faster mechanism?

I don't know enough about XFS to know if filesystem attributes can be trusted.

Replying to karsten:

irl, please review commit 4b7b13d in my task-32660 branch.

The change looks good if we decide we want to go this route. The JVM should use the Intel SHA extensions on recent processors, I don't think that we're going to see a huge increase in CPU load.

Trac:
Status: needs_review to merge_ready

I don't know enough about XFS to know if filesystem attributes can be trusted.

I would be very surprised if they can't, unless you're doing something funky with those. That said, filesystems, in general, can hardly be trusted :p

The change looks good if we decide we want to go this route. The JVM should use the Intel SHA extensions on recent processors, I don't think that we're going to see a huge increase in CPU load.

Okay.

What do you think about the discrepancy between the estimated writes onionoo self-reports and what Prometheus collects?

Oh, I'm not concerned about increase in CPU load. I used the same code to compute SHA-256 file digests for #31204 (moved) where doing this for 643M of data took merely 5 seconds.

Other mechanisms like timestamps don't work here, because we're actually producing new content for these files and only then learn that previously written files already have the same content.

We could do something smarter by detecting whether something has changed to the underlying data that we're formatting and writing, but that requires more development time that we currently have available.

I'll deploy this change now, and then we can see how much that bought us. If this doesn't fully solve the issue, which might well be the case, we can think more about that discrepancy between the two numbers.

Patch is merged, released, and deployed on both backends. Hourly updates are much faster now. Does this show in the graphs?

Trac:

onionoo-backend is killing the ganeti cluster

Child items ...

Activity