randomize HH:MM in AccountingStart for a more even distribution of hibernating relay resources

changed milestone to %Tor: unspecified

added component::core tor/tor milestone::Tor: unspecified priority::medium resolution::wontfix severity::normal status::closed type::enhancement labels

Changing the default won't change much if these relays explicitly set their start time to '0:00'.

AccountingStart is the time at which the period starts, not the time at which relays wake up. The wakeup time is determined by estimating our bandwidth, and trying to pick a random start point that will still allow us to consume all our accountingbytes.

Is there a place where the documentation explains this badly?

The calculation is done in accounting_set_wakeup_time(). For more information, see the big comment in hibernate.c , near the start.

Is the calculation not working correctly for these relays?

Trac:
Status: new to needs_information

Since the consensus is only updated every hour, it looks as though relays which woke up in the last hour, all woke up at HH:00.

AccountingStart is the time at which the period starts, not the time at which relays wake up.

lets make sure we mean the same thing when saying 'wake up': wake up = relay starts to relay traffic again/publishes a new descriptor where the hibernate flag is not set

I understood AccountingStart as the time when the relay starts to relay traffic based on onionoo data. Maybe I'm wrong, but I'll explain how I came to that conclusion.

Lets have a look at yesterday's data.

There were 59 relays restarting at 2015-08-03 22:00:00 UTC.

by 2015-08-04 02:00:00 (last_seen, hibernate=1) 12 relays were hibernating already (I assume: they used up their accountingmax already by that time, 2 relays somewhere between 0:00-1:00 and 10 relays somewhere between 1:00 - 2:00).

data: https://raw.githubusercontent.com/nusenu/tor-network-observations/master/understanding_accountingstart_by_example.txt

@nick: If I understood you correctly, a relay should wake up around ~20:00 UTC if his calculations say that it will take 4 hours to eat up the accountingmax traffic (and his accountingstart is at 00:00 UTC). Onionoo data says otherwise or am I misinterpreting it?

@teor: onionoo's 'last_restarted' field has second granularity (unlike first_seen and last_seen which are consensus timestamps with 1 hour granularity)

Trac:
Username: tyseom
Cc: N/A to tyseom

Does the observed behavior mean these relays have no bandwidth estimates?

I just stumbled upon this log entry in the code: https://gitweb.torproject.org/tor.git/tree/src/or/hibernate.c#n570

Since I was in contact with two relay ops already, I'll ask if they can share some of the hibernate logentries.

2 cases for: interval_wakeup_time = interval_start_time; https://gitweb.torproject.org/tor.git/tree/src/or/hibernate.c#n568 https://gitweb.torproject.org/tor.git/tree/src/or/hibernate.c#n589

I wanted to know what the ratio between hibernating relays where interval_wakeup_time = interval_start_time; (identified by MM:SS = 00:00) and relays where the start_time is not wakeup_time.

For 2015-08-04 it is: 43 (wakeup_time != start_time) vs 71 (wakeup_time=start_time)

A relay operator running 4 of these hibernating relays provided me with the output of grep 'Configured hibernation.' /var/log/tor/log

All 4 relays say:

Aug 04 06:25:09.000 [notice] Configured hibernation.  This interval began at 2015-08-04 00:00:00; the scheduled wake-up time was 2015-08-04 00:00:00; we expect to exhaust our quota for this interval around 2015-08-05 00:00:00; the next interval begins at 2015-08-05 00:00:00 (all times local)

In reality it took them between 9 and 15 hours to exhaust the quota.

So the approach to prevent them all from starting at the same time would be to randomize the interval start time? In that case we would distribute wakeups more evenly even if relays are bad at estimating their bandwidth usage.

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step? Have we confirmed that the estimate is inaccurate on a consistent basis?

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

Replying to teor:

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step?

I just assumed that having accurate estimates is harder.

Have we confirmed that the estimate is inaccurate on a consistent basis?

Depends on what accuracy you are aiming at. Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

I've no strong opinion on how and if that behavior gets changed, we can also simply send an email to tor-relays to ask ops to changed their accountingstart time if they wish to distribute restarts more evenly.

Replying to cypherpunks:

Replying to teor:

Have we confirmed that the estimate is inaccurate on a consistent basis?

Depends on what accuracy you are aiming at. Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

This is expected behaviour - most relays with hibernation set should exhaust their bandwidth part-way through the period.

Unless you're focused on the 9 that didn't?

Depends on what accuracy you are aiming at. Currently out of 59 relays 50 exhausted their relays >3hours before interval start time.

This is expected behaviour - most relays with hibernation set should exhaust their bandwidth part-way through the period.

Ok.

Replying to teor:

If the estimate is inaccurate, why not try to fix the estimate, at least as a first step? Have we confirmed that the estimate is inaccurate on a consistent basis?

Given that the bandwidth authorities are currently thrashing about, that could be causing the inaccuracy at the moment.

I agree that randomising the lower-order components of the period would mitigate the thundering herd wake issue, but 100/5000 relays is not really a herd.

So we'd have to decide whether the unpredictable behaviour would be worthwhile, and outweigh the existing assumption of a 00:00 interval start time.

When I configured hibernation, I depended on the fact that the changeover time was 00:00, as that was the time that the VPS' free quota was reset.

Changing the behaviour for existing configs would be a really bad idea, if it led to people exceeding their quotas due to unpredictable interval start times, where those start times overlapped poorly with the charging intervals on the VPS.

(For example, if 11:39 was chosen at random, I could have had almost two periods' worth of usage in the one charging period, if the wake time was late one day, and early the next. This would have been expensive for me.)

When using the suggested method (random value generated by the relay once) then this problem does not occur, right?

Anyway I'll just write a short email to tor-relays and we can close this ticket.

Replying to cypherpunks:

When using the suggested method (random value generated by the relay once) then this problem does not occur, right?

No, I think the problem can still occur. If my provider charges me for going over my quota on a given day (midnight to midnight), but from Tor's perspective there are two intervals that overlap with that day, then I could end up spending most of my bandwidth in the later half of the first interval, and most of it in the early half of the second interval, and now I spent twice as much as I wanted to on the day.

In sum, letting operators know that they can change the 00:00 is a fine thought, if they do it with knowledge of what's going on inside Tor.

Seems to me that the better answer is to make Tor better at predicting how much bandwidth it will take on a day, so it can start up at more random times.

Do we think there are bugs in the current prediction algorithm, or is it just the case that relays often don't have any data from the previous day?

Trac:
Sponsor: N/A to N/A
Milestone: N/A to Tor: 0.2.???

Milestone renamed

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.3.???

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

Trac:
Milestone: Tor: 0.3.??? to Tor: unspecified
Keywords: N/A deleted, tor-03-unspecified-201612 added

Remove an old triaging keyword.

Trac:
Keywords: tor-03-unspecified-201612 deleted, N/A added

per comments above, closing: this isn't actually a good idea.

Trac:
Resolution: N/A to wontfix
Reviewer: N/A to N/A
Severity: N/A to Normal
Status: needs_information to closed

closed

moved to tpo/core/tor#16723 (closed)

randomize HH:MM in AccountingStart for a more even distribution of hibernating relay resources

Child items 0

Activity