Opened 2 years ago

Last modified 12 months ago

#19249 assigned defect

Onionoo server runs out of memory when importing a full month of data

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: metrics-2018
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I had to re-import all of May on the Onionoo mirror because it was offline for more than three days. Now it's running out of memory in the shut-down process. Logs and exception below:

2016-06-01 09:30:33,944 INFO o.t.o.cron.Main:92 Going to run one-time updater ... 
2016-06-01 09:30:34,002 INFO o.t.o.cron.Main:130 Initializing.
2016-06-01 09:30:34,005 INFO o.t.o.cron.Main:133 Acquired lock
2016-06-01 09:30:34,005 DEBUG o.t.o.cron.Main:152 Started update ...
2016-06-01 09:30:34,007 INFO o.t.o.cron.Main:155 Initialized descriptor source
2016-06-01 09:30:34,012 INFO o.t.o.cron.Main:159 Initialized document store
2016-06-01 09:30:34,029 INFO o.t.o.cron.Main:163 Initialized status update runner
2016-06-01 09:30:34,040 INFO o.t.o.cron.Main:168 Initialized document writer runner
2016-06-01 09:30:34,041 INFO o.t.o.cron.Main:176 Downloading descriptors.
2016-06-01 09:30:34,041 INFO o.t.o.u.DescriptorSource:64 Loading: RELAY_CONSENSUSES
2016-06-01 09:33:02,861 INFO o.t.o.u.DescriptorSource:64 Loading: RELAY_SERVER_DESCRIPTORS
2016-06-01 09:35:39,639 INFO o.t.o.u.DescriptorSource:64 Loading: RELAY_EXTRA_INFOS
2016-06-01 09:38:10,562 INFO o.t.o.u.DescriptorSource:64 Loading: EXIT_LISTS
2016-06-01 09:38:51,159 INFO o.t.o.u.DescriptorSource:64 Loading: BRIDGE_STATUSES
2016-06-01 09:40:29,716 INFO o.t.o.u.DescriptorSource:64 Loading: BRIDGE_SERVER_DESCRIPTORS
2016-06-01 09:41:58,737 INFO o.t.o.u.DescriptorSource:64 Loading: BRIDGE_EXTRA_INFOS
2016-06-01 09:43:44,958 INFO o.t.o.cron.Main:184 Reading descriptors.
2016-06-01 09:43:44,959 INFO o.t.o.u.DescriptorSource:153 Reading archived descriptors...
2016-06-02 02:51:09,249 INFO o.t.o.u.DescriptorSource:200 Read archived descriptors
2016-06-02 02:51:09,249 DEBUG o.t.o.u.DescriptorSource:84 Reading recent RELAY_SERVER_DESCRIPTORS ...
2016-06-02 02:53:01,224 INFO o.t.o.u.DescriptorSource:129 Read recent relay server descriptors
2016-06-02 02:53:01,224 DEBUG o.t.o.u.DescriptorSource:88 Reading recent RELAY_EXTRA_INFOS ...
2016-06-02 03:31:50,889 INFO o.t.o.u.DescriptorSource:132 Read recent relay extra-info descriptors
2016-06-02 03:31:50,890 DEBUG o.t.o.u.DescriptorSource:91 Reading recent EXIT_LISTS ...
2016-06-02 03:32:13,478 INFO o.t.o.u.DescriptorSource:135 Read recent exit lists
2016-06-02 03:32:13,479 DEBUG o.t.o.u.DescriptorSource:94 Reading recent RELAY_CONSENSUSES ...
2016-06-02 08:29:52,761 INFO o.t.o.u.DescriptorSource:126 Read recent relay network consensuses
2016-06-02 08:29:52,765 DEBUG o.t.o.u.DescriptorSource:97 Reading recent BRIDGE_SERVER_DESCRIPTORS ...
2016-06-02 08:31:32,294 INFO o.t.o.u.DescriptorSource:141 Read recent bridge server descriptors
2016-06-02 08:31:32,295 DEBUG o.t.o.u.DescriptorSource:101 Reading recent BRIDGE_EXTRA_INFOS ...
2016-06-02 09:22:29,247 INFO o.t.o.u.DescriptorSource:144 Read recent bridge extra-info descriptors
2016-06-02 09:22:29,247 DEBUG o.t.o.u.DescriptorSource:104 Reading recent BRIDGE_STATUSES ...
2016-06-02 09:23:44,681 INFO o.t.o.u.DescriptorSource:138 Read recent bridge network statuses
2016-06-02 09:23:44,682 INFO o.t.o.cron.Main:186 Updating internal status files.
2016-06-02 09:23:44,682 DEBUG o.t.o.u.StatusUpdateRunner:36 Begin update of NodeDetailsStatusUpdater
2016-06-02 09:25:02,021 INFO o.t.o.u.NodeDetailsStatusUpdater:379 Read node statuses
2016-06-02 09:25:12,500 INFO o.t.o.u.NodeDetailsStatusUpdater:381 Started reverse domain name lookups
2016-06-02 09:31:29,006 INFO o.t.o.u.NodeDetailsStatusUpdater:383 Looked up cities and ASes
2016-06-02 09:31:29,112 INFO o.t.o.u.NodeDetailsStatusUpdater:385 Calculated path selection probabilities
2016-06-02 09:31:29,224 INFO o.t.o.u.NodeDetailsStatusUpdater:387 Computed effective and extended families
2016-06-02 09:31:29,252 INFO o.t.o.u.NodeDetailsStatusUpdater:389 Finished reverse domain name lookups
2016-06-02 09:34:37,918 INFO o.t.o.u.NodeDetailsStatusUpdater:391 Updated node and details statuses
2016-06-02 09:34:37,918 INFO o.t.o.u.StatusUpdateRunner:38 NodeDetailsStatusUpdater updated status files
2016-06-02 09:34:37,918 DEBUG o.t.o.u.StatusUpdateRunner:36 Begin update of BandwidthStatusUpdater
2016-06-02 09:34:37,918 INFO o.t.o.u.StatusUpdateRunner:38 BandwidthStatusUpdater updated status files
2016-06-02 09:34:37,918 DEBUG o.t.o.u.StatusUpdateRunner:36 Begin update of WeightsStatusUpdater
2016-06-02 09:34:37,918 INFO o.t.o.u.StatusUpdateRunner:38 WeightsStatusUpdater updated status files
2016-06-02 09:34:37,919 DEBUG o.t.o.u.StatusUpdateRunner:36 Begin update of ClientsStatusUpdater
2016-06-02 09:47:57,798 INFO o.t.o.u.StatusUpdateRunner:38 ClientsStatusUpdater updated status files
2016-06-02 09:47:57,799 DEBUG o.t.o.u.StatusUpdateRunner:36 Begin update of UptimeStatusUpdater
2016-06-02 10:28:41,049 INFO o.t.o.u.StatusUpdateRunner:38 UptimeStatusUpdater updated status files
2016-06-02 10:28:41,049 INFO o.t.o.cron.Main:194 Updating document files.
2016-06-02 10:28:41,049 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing SummaryDocumentWriter
2016-06-02 10:29:03,109 INFO o.t.o.w.SummaryDocumentWriter:97 Wrote summary document files
2016-06-02 10:29:03,110 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing DetailsDocumentWriter
2016-06-02 10:33:58,870 INFO o.t.o.w.DetailsDocumentWriter:46 Wrote details document files
2016-06-02 10:33:58,870 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing BandwidthDocumentWriter
2016-06-02 11:45:04,633 INFO o.t.o.w.BandwidthDocumentWriter:54 Wrote bandwidth document files
2016-06-02 11:45:04,634 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing WeightsDocumentWriter
2016-06-02 12:19:42,480 INFO o.t.o.w.WeightsDocumentWriter:55 Wrote weights document files
2016-06-02 12:19:42,481 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing ClientsDocumentWriter
2016-06-02 12:23:22,577 INFO o.t.o.w.ClientsDocumentWriter:84 Wrote clients document files
2016-06-02 12:23:22,577 DEBUG o.t.o.w.DocumentWriterRunner:28 Writing UptimeDocumentWriter
2016-06-02 12:39:58,477 INFO o.t.o.w.UptimeDocumentWriter:57 Wrote uptime document files
2016-06-02 12:39:58,477 INFO o.t.o.cron.Main:199 Shutting down.
2016-06-02 12:39:58,477 DEBUG o.t.o.u.DescriptorSource:204 Writing parse histories for recent descriptors...
2016-06-02 12:39:58,492 INFO o.t.o.cron.Main:202 Wrote parse histories

karsten@onionoo:/srv/onionoo.thecthulhu.com/onionoo$ java -DLOGBASE=/srv/onionoo.thecthulhu.com/onionoo/log-cron/ -Xmx4g -jar dist/onionoo-3.1.0.jar --single-run && java -DLOGBASE=/srv/onionoo.thecthulhu.com/onionoo/log-cron/ -Xmx4g -jar dist/onionoo-3.1.0.jar
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
        at java.lang.StringBuilder.append(StringBuilder.java:132)
        at org.torproject.onionoo.docs.DocumentStore.writeNodeStatuses(DocumentStore.java:710)
        at org.torproject.onionoo.docs.DocumentStore.flushDocumentCache(DocumentStore.java:669)
        at org.torproject.onionoo.cron.Main.shutDown(Main.java:205)
        at org.torproject.onionoo.cron.Main.run(Main.java:121)
        at org.torproject.onionoo.cron.Main.runOrScheduleExecutions(Main.java:93)
        at org.torproject.onionoo.cron.Main.main(Main.java:32)

karsten@onionoo:/srv/onionoo.thecthulhu.com/onionoo$ ls -lh in/archive/
total 6.9G
-rw-r--r-- 1 karsten karsten 2.9G May 31 23:40 bridge-descriptors-2016-05.tar
-rw-r--r-- 1 karsten karsten 1.1G May 30 23:19 consensuses-2016-05.tar
-rw-r--r-- 1 karsten karsten 1.5G May 31 15:57 extra-infos-2016-05.tar
-rw-r--r-- 1 karsten karsten 1.5G May 31 12:41 server-descriptors-2016-05.tar

I didn't really start investigating. Note that it takes over 24 hours to do the processing, so we cannot reproduce this bug as easily.

Child Tickets

Change History (7)

comment:1 Changed 2 years ago by iwakeh

An initial analysis:

With enough memory provided the archive data import of a month doesn't take that long.

First I altered the code to not run the download part in order to get a clearer picture there, i.e.
the download descriptors part is omitted (hope, this isn't totally unreasonable).

I ran the imports with the tars listed above and an otherwise empty Onionoo environment.
Using 4g to 6g it needs less than two hours for the entire import:

2016-06-05 18:47:47,866 INFO o.t.o.cron.Main:191 Reading descriptors.
2016-06-05 18:47:47,867 INFO o.t.o.u.DescriptorSource:153 Reading archived descriptors...
...
2016-06-05 20:23:10,832 INFO o.t.o.cron.Main:236 Cleaning up.
2016-06-05 20:23:10,833 INFO o.t.o.cron.Main:242 Done.
2016-06-05 20:23:10,833 INFO o.t.o.cron.Main:246 Releasing lock.
2016-06-05 20:23:10,865 INFO o.t.o.cron.Main:248 Released lock

Using way less causes the OOM, e.g. only 2g, which happens pretty quickly.

When 3g are provided the processing runs for a very, very long time as visible in the log excerpt in the description. This happens b/c the jvm is garbage collection most of the time to stay running, but there is hardly any processing time left for the onionoo updater. With such a scarce amount of memory it just hits oom anyway (as in the initial error description).

I assume, that the full onionoo single run will need memory >= 6g to run comfortably when importing a month worth of data.
The 4g are not sufficient and just keep the jvm busy collecting garbage and the processing takes ages.

Hope, it is possible to increase the memory! As an immediate measure at least.

Still other means of reducing the memory consumption should be investigated.
Those should be new trac issues then.

PS: I also caught the nfe's (#19259).

comment:2 Changed 2 years ago by karsten

Update: I retried the import with 6g instead of 4g:

2016-06-06 08:27:34,845 INFO o.t.o.cron.Main:92 Going to run one-time updater ... 
[..]
2016-06-07 01:28:11,260 INFO o.t.o.cron.Main:241 Released lock

So, it took quite a while but finished without problems. Good to know.

Next step will be to re-import again, because this again took longer than 3 days. Let's see if 6g are enough for 2016-05 and 2016-06 descriptors.

comment:3 Changed 2 years ago by iwakeh

Oh, I think these roughly 17 hours were too long and look unhealthy.

I'd suggest moving it up to 12g if possible for the bigger import.

Did you monitor cpu usage during the import? How many cores does the server have?

comment:4 Changed 2 years ago by karsten

The server has 16g of which 8g are free. I guess I could easily go up to 8g for this import, but 12g might be too much.

I did not monitor CPU usage, but I can take a look during the upcoming import. What in particular are you interested in? The server has 4 cores, AFAICS.

comment:5 Changed 2 years ago by iwakeh

OK, 8g might work, too.
I can only guesstimate how much additional memory the full run needs. And, June data is not yet a full month worth in addition.

Regarding cpu usage:
I noticed that the import usually busies about two cores. The jvm uses parallelization a lot.
So, when all cores start getting busy for too long the problem has started. At least that's what I observed on a machine with more than four cores.

comment:6 Changed 12 months ago by karsten

Keywords: metrics-2018 added

comment:7 Changed 12 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.