Opened 4 years ago

Closed 2 years ago

#16723 closed enhancement (wontfix)

randomize HH:MM in AccountingStart for a more even distribution of hibernating relay resources

Reported by: cypherpunks Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords:
Cc: tyseom Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

AccountingStart defaults to 0:00 local time. This results in may relays waking up from hibernation at the same second.

What about randomizing the default value on first start with an AccountingMax config?
(write the randomized time to disk and read that file the next time tor starts)

here some numbers (from relays probably using a daily quota):

timestamp (second granularity)     #relays
===========================================
| 2015-08-03 04:00:00          |       25 |
| 2015-08-03 00:00:00          |       26 |
| 2015-08-02 22:00:00          |       30 |

(first column shows the timestamp when relays awake, second column shows how many)

from the tor manual:

AccountingStart day|week|month [day] HH:MM

All times are local, and given in 24-hour time. (Default: "month 1 0:00")

Child Tickets

Change History (20)

comment:1 Changed 4 years ago by cypherpunks

Changing the default won't change much if these relays explicitly set their start time to '0:00'.

Last edited 4 years ago by cypherpunks (previous) (diff)

comment:2 Changed 4 years ago by nickm

Status: newneeds_information

AccountingStart is the time at which the period starts, not the time at which relays wake up. The wakeup time is determined by estimating our bandwidth, and trying to pick a random start point that will still allow us to consume all our accountingbytes.

Is there a place where the documentation explains this badly?

The calculation is done in accounting_set_wakeup_time(). For more information, see the big comment in hibernate.c , near the start.

Is the calculation not working correctly for these relays?

comment:3 Changed 4 years ago by teor

Since the consensus is only updated every hour, it looks as though relays which woke up in the last hour, all woke up at HH:00.

comment:4 Changed 4 years ago by cypherpunks

AccountingStart is the time at which the period starts, not the time at which relays wake up.

lets make sure we mean the same thing when saying 'wake up':
wake up = relay starts to relay traffic again/publishes a new descriptor where the hibernate flag is not set

I understood AccountingStart as the time when the relay starts to relay traffic based on onionoo data.
Maybe I'm wrong, but I'll explain how I came to that conclusion.

Lets have a look at yesterday's data.

There were 59 relays restarting at 2015-08-03 22:00:00 UTC.

by 2015-08-04 02:00:00 (last_seen, hibernate=1) 12 relays were hibernating already (I assume: they used up their accountingmax already by that time, 2 relays somewhere between 0:00-1:00 and 10 relays somewhere between 1:00 - 2:00).

data:
https://raw.githubusercontent.com/nusenu/tor-network-observations/master/understanding_accountingstart_by_example.txt

@nick: If I understood you correctly, a relay should wake up around ~20:00 UTC if his calculations say that it will take 4 hours to eat up the accountingmax traffic (and his accountingstart is at 00:00 UTC).
Onionoo data says otherwise or am I misinterpreting it?

@teor: onionoo's 'last_restarted' field has second granularity (unlike first_seen and last_seen which are consensus timestamps with 1 hour granularity)

comment:5 Changed 4 years ago by tyseom

Cc: tyseom added

comment:6 Changed 4 years ago by cypherpunks

Does the observed behavior mean these relays have no bandwidth estimates?

I just stumbled upon this log entry in the code:
https://gitweb.torproject.org/tor.git/tree/src/or/hibernate.c#n570

Since I was in contact with two relay ops already, I'll ask if they can share some of the hibernate logentries.

comment:8 Changed 4 years ago by cypherpunks

I wanted to know what the ratio between hibernating relays where
interval_wakeup_time = interval_start_time; (identified by MM:SS = 00:00) and relays where
the start_time is not wakeup_time.

For 2015-08-04 it is:
43 (wakeup_time != start_time) vs 71 (wakeup_time=start_time)

comment:9 Changed 4 years ago by cypherpunks

A relay operator running 4 of these hibernating relays provided me with the output of
grep 'Configured hibernation.' /var/log/tor/log

All 4 relays say:

Aug 04 06:25:09.000 [notice] Configured hibernation.  This interval began at 2015-08-04 00:00:00; the scheduled wake-up time was 2015-08-04 00:00:00; we expect to exhaust our quota for this interval around 2015-08-05 00:00:00; the next interval begins at 2015-08-05 00:00:00 (all times local)

In reality it took them between 9 and 15 hours to exhaust the quota.

So the approach to prevent them all from starting at the same time would be to randomize the interval start time? In that case we would distribute wakeups more evenly even if relays are bad at estimating their bandwidth usage.

comment:10 Changed 4 years ago by teor

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step?
Have we confirmed that the estimate is inaccurate on a consistent basis?

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

comment:11 in reply to:  10 ; Changed 4 years ago by cypherpunks

Replying to teor:

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step?

I just assumed that having accurate estimates is harder.

Have we confirmed that the estimate is inaccurate on a consistent basis?

Depends on what accuracy you are aiming at.
Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

I've no strong opinion on how and if that behavior gets changed, we can also simply send an email to tor-relays to ask ops to changed their accountingstart time if they wish to distribute restarts more evenly.

Last edited 4 years ago by cypherpunks (previous) (diff)

comment:12 in reply to:  11 ; Changed 4 years ago by teor

Replying to cypherpunks:

Replying to teor:

Have we confirmed that the estimate is inaccurate on a consistent basis?

Depends on what accuracy you are aiming at.
Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

This is expected behaviour - most relays with hibernation set should exhaust their bandwidth part-way through the period.

Unless you're focused on the 9 that didn't?

comment:13 in reply to:  12 Changed 4 years ago by cypherpunks

Depends on what accuracy you are aiming at.
Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

This is expected behaviour - most relays with hibernation set should exhaust their bandwidth part-way through the period.

Ok.

comment:14 in reply to:  10 ; Changed 4 years ago by cypherpunks

Replying to teor:

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step?
Have we confirmed that the estimate is inaccurate on a consistent basis?

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

When using the suggested method (random value generated by the relay once) then this problem does not occur, right?

Anyway I'll just write a short email to tor-relays and we can close this ticket.

comment:15 in reply to:  14 Changed 4 years ago by arma

Replying to cypherpunks:

When using the suggested method (random value generated by the relay once) then this problem does not occur, right?

No, I think the problem can still occur. If my provider charges me for going over my quota on a given day (midnight to midnight), but from Tor's perspective there are two intervals that overlap with that day, then I could end up spending most of my bandwidth in the later half of the first interval, and most of it in the early half of the second interval, and now I spent twice as much as I wanted to on the day.

In sum, letting operators know that they can change the 00:00 is a fine thought, if they do it with knowledge of what's going on inside Tor.

Seems to me that the better answer is to make Tor better at predicting how much bandwidth it will take on a day, so it can start up at more random times.

Do we think there are bugs in the current prediction algorithm, or is it just the case that relays often don't have any data from the previous day?

comment:16 Changed 4 years ago by dgoulet

Milestone: Tor: 0.2.???

comment:17 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:18 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:19 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:20 Changed 2 years ago by nickm

Resolution: wontfix
Severity: Normal
Status: needs_informationclosed

per comments above, closing: this isn't actually a good idea.

Note: See TracTickets for help on using tickets.