Opened 14 months ago

Last modified 9 months ago

#25141 new defect

enabling CellStatistics results in gigabytes of incremental memory consumption

Reported by: starlight Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.3.2.9
Severity: Major Keywords: 029-backport, tor-dos, 034-triage-20180328, 034-removed-20180328, 031-unreached-backport
Cc: karsten, robgjansen, iwakeh Actual Points:
Parent ID: #24806 Points: 1
Reviewer: Sponsor:

Description

This was causing my relay to crash every two days till I figured it out. At a minimum Tor should warn about memory consumption when this option is enabled. IMO the present behavior is broken and the feature should be redesigned or eliminated.

Child Tickets

Change History (19)

comment:1 Changed 14 months ago by starlight

Component: - Select a componentCore Tor/Tor
Version: Tor: 0.3.2.9

comment:2 Changed 14 months ago by teor

Keywords: 031-backport 029-backport ddos must-fix-before-033-stable added
Milestone: Tor: 0.3.3.x-final
Parent ID: #24806
Points: 1
Priority: MediumHigh
Severity: NormalMajor

See #24806 for some background, particularly https://trac.torproject.org/projects/tor/ticket/24806#comment:13

Marking for backport, let's get this done soon, because it makes the DDoS worse.

Edit: spacing

Last edited 14 months ago by teor (previous) (diff)

comment:3 Changed 14 months ago by arma

Can you give us any hints about what goes wrong? How quickly does the bloating happen?

I just ran my fast non-exit relay for an hour or so under valgrind with CellStatistics on, and it didn't find any leaks.

Now I'm running it without valgrind, and it doesn't seem to be bigger than normal.

Does it happen only at certain points, e.g. where it publishes a new descriptor with these stats?

Is there anything else I should do to trigger it?

comment:4 Changed 14 months ago by arma

Ok, my tor relay, now that it's not under valgrind, has been slowly growing.

It could be actual leaks that only happen when the relay is fast enough to do something. It could be memory fragmentation. Or it could be that we're simply keeping a whole lot of extra stuff in memory.

comment:5 Changed 14 months ago by arma

Yep, this thing just grows and grows. I'm up to 6 gigs of ram used.

comment:6 Changed 14 months ago by arma

Cc: karsten added

We're keeping a whole lot of stuff in memory.

diff --git a/src/or/rephist.c b/src/or/rephist.c
index 15fb674..bb59b6b 100644
--- a/src/or/rephist.c
+++ b/src/or/rephist.c
@@ -2390,6 +2390,8 @@ rep_hist_add_buffer_stats(double mean_num_cells_in_queue,
   if (!circuits_for_buffer_stats)
     circuits_for_buffer_stats = smartlist_new();
   smartlist_add(circuits_for_buffer_stats, stats);
+  log_info(LD_HIST, "circuits_for_buffer_stats now size %d",
+           smartlist_len(circuits_for_buffer_stats));
 }
 
 /** Remember cell statistics for circuit <b>circ</b> at time

Every time a circuit closes, we malloc some stuff and add it onto this smartlist. And we only do something with the contents of the smartlist after 24 hours.

During this overload period, my relay handles many hundreds of millions of circuits in a day. Keeping *anything* about each one of them is going to be too much.

What do we actually use this option for? Some experiment we did in the distant past? It sounds like "recommend against enabling it" is the first step. And then either redesigning it to accomplish whatever goals remain, or stripping it out.

I am cc'ing Karsten because I bet he can help us with the 'distant past' part.

comment:7 in reply to:  6 Changed 14 months ago by yawning

Replying to arma:

I am cc'ing Karsten because I bet he can help us with the 'distant past' part.

commit b493a2ccb97e00f4fe3acb5c59c941c2babaeebb
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Sun Jul 5 19:53:25 2009 +0200

    If configured, write cell statistics to disk periodically.

And then either redesigning it to accomplish whatever goals remain, or stripping it out.

Instead of deciles, a combination of variance, skew and kurtosis could be used (a la Commons Math's AbstractStorelessUnivariateStatistic).

Last edited 14 months ago by yawning (previous) (diff)

comment:8 Changed 14 months ago by karsten

Cc: robgjansen iwakeh added

I'd like to take a closer look and also discuss this at the metrics team meeting on Thursday. Copying iwakeh and robgjansen who I think might have an opinion on this, too.

comment:9 Changed 14 months ago by dgoulet

@arma, CellStatistics still needs to be explicitly enabled in the torrc. Is your valgrind to 6GB of RAM have it enabled? If not, something else is causing massive memory usage (again...)

comment:10 Changed 14 months ago by dgoulet

Just as an FYI, I've opened #25149 to hook the rephist layer in the OOM handler.

comment:11 Changed 14 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

I'm thinking this is an 0.3.4 item, unless there's some reason that people really need to have CellStatistics turned on.

comment:12 Changed 14 months ago by robgjansen

TL;DR

These stats are useful for traffic modeling. I think a fine plan is to recommend to phase out cell statistics, with the goal of replacing them in the long term with more useful privacy-preserving measurements using the new PrivCount protocol when its ready, such as bytes-per-stream and streams-per-circuit.

Data point about RAM usage:

I've had the CellStatistics option enabled on my fast relays for a couple of years now. I've noticed that RAM usage on any of the relays would occasionally increase above 2 GiB (even when I temporarily experimented with MaxMemInQueues 512 MB), but I never linked it to CellStatistics. I've now disabled CellStatistics because RAM has recently become a bit of a problem for other reasons.

Do we still care about cell statistics?

Theoretically, cell statistics can help us understand how well our test networks model the conditions of the real Tor network, which would be useful for test networks that actually try to faithfully model the conditions of the real Tor network (like Shadow). For example, we could run a bunch of clients in the test network that are initiating stream transfers according to some understanding of client usage, and then we collect the cell stats on the relays in the test network and compare them to the stats from the public network. How close we are provides a measure of fidelity. I actually did this in the original Shadow paper.

However, circuit-level cell information by itself is not the greatest for informing how the clients in the test network should generate traffic in the first place. For that, we need a combination of stream-level and circuit-level information. From a client modeling perspective, I need to know how many streams each client should create, how much to download on each stream, how many pauses to add and how long to pause, etc. Circuits and cells are a second-level effect of the client model.

The privacy implications are also worth considering (some concerns raised here, and moved here). I think individual relay stats are much more specific that what we need; for modeling purposes, we lose very little utility by using aggregate network results, but they are much safer.

comment:13 Changed 13 months ago by teor

Keywords: 033-must added; must-fix-before-033-stable removed

must-fix-before-033-stable is now 033-must

comment:14 Changed 13 months ago by teor

Keywords: 033-must removed

Looks like we don't need this in 033

comment:15 Changed 13 months ago by dgoulet

Keywords: tor-dos added; ddos removed

comment:16 Changed 12 months ago by nickm

Keywords: 034-triage-20180328 added

comment:17 Changed 12 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:18 Changed 12 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:19 Changed 9 months ago by teor

Keywords: 031-unreached-backport added; 031-backport removed

0.3.1 is end of life, there are no more backports.
Tagging with 031-unreached-backport instead.

Note: See TracTickets for help on using tickets.