#2704 closed defect (fixed)
BandwidthObserved higher than BandwidthBurst
Reported by: | keb | Owned by: | Sebastian |
---|---|---|---|
Priority: | High | Milestone: | Tor: 0.2.2.x-final |
Component: | Core Tor/Tor | Version: | Tor: 0.2.2.23-alpha |
Severity: | Keywords: | 0.2.2.23-alpha bandwidth observed tor-relay | |
Cc: | Actual Points: | ||
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
got this line looking at my relay:
Bandwidth (Max/Burst/Observed - In Bps): 340992 / 1022976 / 30676070
1h20 after an upgrading from 0.2.2.22-alpha to 0.2.2.23-alpha
Child Tickets
Attachments (1)
Change History (30)
Changed 8 years ago by
Attachment: | state-20110310 added |
---|
comment:1 Changed 8 years ago by
Sure enough, BWHistoryWriteMaxima has values like 30689280. Are we writing them wrong, or using them wrong?
comment:2 Changed 8 years ago by
Milestone: | → Tor: 0.2.2.x-final |
---|---|
Version: | → Tor: 0.2.2.23-alpha |
Roger suspects that the new remember-your-observed-bandwidth stuff in rephist.c in 0.2.2.23-alpha may be implicated.
comment:3 Changed 8 years ago by
The functions to scrutinize are rep_hist_update_bwhist_state_section, and rep_hist_load_bwhist_state_section , both in rephist.c .
comment:4 Changed 8 years ago by
i should mention the 1st report line above is from
http://torstatus.all.de/router_detail.php?FP=D3ED911183DF31FDB744A22F94BE6A856AC77D12
but when i got some output from blutmagie's status page a few hours later, it looks ok
http://torstatus.blutmagie.de/router_detail.php?FP=d3ed911183df31fdb744a22f94be6a856ac77d12
so it is a case of one website misinterpreting the units in which that field is stored in the database?
comment:5 Changed 8 years ago by
maxval = b->maxima[i] / NUM_SECS_ROLLING_MEASURE;
implies that maxval is a per-second amount, whereas b->maxima is a per-10-second amount. So the Maxima we write to the state file is a per-second amount.
Then when we read it, we
v = tor_parse_uint64(cp, 10, 0, UINT64_MAX, &ok, NULL); if (have_maxima) { maxstr = smartlist_get(s_maxima, cp_sl_idx); mv = tor_parse_uint64(maxstr, 10, 0, UINT64_MAX, &ok_m, NULL); mv *= NUM_SECS_ROLLING_MEASURE; } else { /* No maxima known; guess average rate to be conservative. */ mv = v / s_interval; }
and then
b->max_total = mv;
where max_total is also a per-10-second amount. So
mv = v / s_interval;
ought to be
mv = v / s_internal * NUM_SECS_ROLLING_MEASURE;
But that's not our bug, since it would produce smaller numbers not larger numbers. But it is a bug.
comment:6 Changed 8 years ago by
Also, why is the argument for rep_hist_load_bwhist_state_section() called s_begins yet the state line is called "BWHistoryWriteEnds"? That sounds like a recipe for more confusion.
comment:7 Changed 8 years ago by
This could be it:
add_obs(b, start, v);
We are reading in a per-900-second value, and then telling our rephist system that at second 'start', we saw all of those bytes get written. I suspect this causes the per-10-second calculation to be 1/10 of the per-900-second value.
comment:8 Changed 8 years ago by
moria1 has funny-looking values here too:
BWHistoryReadValues 536190976,409089024,532183040,497468416,... BWHistoryReadMaxima 53618688,40908800,53218304,49745920,...
Notice the pattern?
comment:9 Changed 8 years ago by
(The following redundant note is added to catch user searches.)
Vidalia "Tor Bandwidth Usage" chart is affected by this issue. The vertical scale is much larger than the actual transfer numbers, so the indicator hardly leaves the floor.
comment:10 Changed 8 years ago by
Priority: | normal → major |
---|
comment:11 Changed 8 years ago by
Status: | new → needs_review |
---|
One way to resolve this one for 0.2.2 is to revert the patch that caused it.
Then we could reschedule that patch for 0.2.3, but not merge it in yet because of this bug.
comment:12 Changed 8 years ago by
Hrmm, I hope we're not using int32_t's anywhere related to this. We're approaching maxint observed bandwidth on several relays. I guess we'll find out how robust all associated tor and non-tor code is soon :).
comment:13 Changed 8 years ago by
I think part of the analysis here is wrong. Specifically, the suggestion to do
mv = v / s_internal * NUM_SECS_ROLLING_MEASURE;
is good, but doesn't influence the other bug this ticket is about, because the above fix is only relevant when we don't know the maximum. If we know the maximum, we already do
mv *= NUM_SECS_ROLLING_MEASURE;
comment:14 Changed 8 years ago by
Owner: | set to Sebastian |
---|---|
Status: | needs_review → assigned |
Oh, also I think we should not just revert the patch. I have a fix that seems to work in local testing, cleaning it up to add it here.
comment:16 Changed 8 years ago by
We ran into a bug on IRC:
18:56 < nickm> hang on. That loop, where we do "while (start < now)". Is that
right? If this interval is not the most recent one, then won't
we add the observation far too many times?
and a simplification:
18:57 < nickm> also you can simplify the overflow logic a lot
18:57 < nickm> instead of overflow, have, say, "actual_interval_len". Set it
to either s_interval or to now-start.
18:57 < nickm> then compute cur = v / actual_interval_len
comment:18 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | needs_review → closed |
Thanks! Rebased (for the fixup) & merged.
comment:19 Changed 8 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:20 Changed 8 years ago by
Apparently arma says this isn't fixed for him on moria1, and that he's talked about it with sebastian. Somebody please post here?
comment:21 Changed 8 years ago by
moria1's current descriptor says bandwidth 512000 10485760 1306223, with the burst well within above the max observed.
Looking at the past days, moria reported very high values shortly after starting up, and the values then dropped. This didn't appear to change significantly with the patch vs without it.
comment:22 Changed 8 years ago by
I can't really reproduce the issue with fluxe3, however: I just restarted it (on 0.2.2.24-alpha), and in its descriptor it said bandwidth 921600 10240000 83463884 - so we don't have an inflated value here. hrm.
comment:23 Changed 8 years ago by
Eh, disregard the comment above. I guess my maths sucks a little more than usual today
comment:24 Changed 8 years ago by
Here's another bug in rep_hist_update_bwhist_state_section(), I think:
tor_asprintf(&cp, U64_FORMAT, U64_PRINTF_ARG(b->max_total & ~0x3ff)); smartlist_add(*s_maxima, cp); }} It seems we need to divide b->max_total by NUM_SECS_ROLLING_MEASURE, no? Also, this following change from 105b94b7 seems to be strange to me, but I can't put my finger on why so far. {{{ /* Clean up maxima and observed */ - /* Do we really want to zero this for the purpose of max capacity? */ for (i=0; i<NUM_SECS_ROLLING_MEASURE; ++i) { b->obs[i] = 0; } b->total_obs = 0; - for (i=0; i<NUM_TOTALS; ++i) { - b->maxima[i] = 0; - } - b->max_total = 0; }}}
comment:25 Changed 8 years ago by
Status: | reopened → needs_review |
---|
ok, so I think we should definitely fix that bug soon. This does lead to inflated Maxima history in the state file. bug2704 for the patch.
comment:27 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | needs_review → closed |
I'm calling this one fixed. This time for sure! Thanks.
comment:28 Changed 6 years ago by
Keywords: | tor-relay added |
---|
comment:29 Changed 6 years ago by
Component: | Tor Relay → Tor |
---|
state file from tor data dir