Opened 4 years ago

Closed 2 years ago

#16667 closed defect (fixed)

BWauth / scanner 'longclaw' measurments not updating since 7/22

Reported by: starlight Owned by: aagbsn
Priority: High Milestone:
Component: Core Tor/Torflow Version:
Severity: Blocker Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

At least some of the relay bandwidth measurements taken by 'longclaw' have not update since the first hour of 7/22/15 or for about 3.7 days. Reviewed values withing +/-60 lines of Binnacle (about a dozen total), which would seem to be a random sample.

Child Tickets

Attachments (2)

unchanged_bwmeasure.txt (213.5 KB) - added by starlight 4 years ago.
changed_bwmeasure.txt (181.0 KB) - added by starlight 4 years ago.

Download all attachments as: .zip

Change History (15)

Changed 4 years ago by starlight

Attachment: unchanged_bwmeasure.txt added

Changed 4 years ago by starlight

Attachment: changed_bwmeasure.txt added

comment:1 Changed 4 years ago by starlight

Priority: normalmajor

After running for a couple of days 'longclaw' BWauth is broken again.

For forty hours between

2015-07-27-09-00-00-vote-23D15D965B. . .

and

2015-07-28-13-00-00-vote-23D15D965B. . .

3416 out of 6251 relay measurements have not updated.
Fewer than half or 2835 measurements have updated.

See files attached to ticket for details.

comment:2 Changed 4 years ago by starlight

In conjunction with this failure/outage, some bizarre measurements were taken and then frozen in place for the duration. Ticket #16675.

comment:3 Changed 4 years ago by micah

appears to have a stale bwauth process:

WARN[Sun Jul 26 17:45:07 2015]:Bandwidth scanner scanner.1 stale. Possible dead bwauthority.py. Timestamp: Wed Jul 22 07:00:16 2015

comment:4 Changed 4 years ago by micah

The bw.logs on longclaw filled the disk, causing this outage a few days ago. It was resolved and the bwauth restarted, but it seems like it has not yet caught back up. It usually takes a few days before a restarted bwauth works again.

comment:5 Changed 4 years ago by starlight

You are certain it's not broken? Looking at the "changed"
relays one sees the values moving about once every four
or five hours. All the relays on the "unchanged" list
have been frozen for 42 hours. I was under the
impression that each BWauth scans all the relays
at least once per day.

comment:6 Changed 4 years ago by starlight

Note also that the "unchanged" relays were
updated between the time of the restart and the
time that the bizarre measurement was taken
and updating stopped.

I say it's broken.

comment:7 Changed 4 years ago by micah

Am I certain it is not broken? No. bwauth measurement is not trivial to determine if it is working properly or not. It can take days, usually around 3 or so, when first starting bwauth measurements before you can reach the appropriate thresholds for reporting.

I'm not quite following what you are pointing out... where is this "changed" and "unchanged" list generated from, and what is it showing? You say you look at the "changed" list and see values moving about once every four or five hours, can you clarify how you see this? If I look at the file you attached I don't see how I can see values moving.

comment:8 Changed 4 years ago by starlight

The lists are attached to this ticket above and
were produce as follows:

For each of the interval start and interval end vote files

1) run an awk that combines the vanity name and fingerprint
for each relay and remembers it, then outputs one line
for each measurement with the name and measurement.

2) sort by the name

3) awk exclude relays with null measurement fields

4) join the two

5) awk out lines that have the same and different measurements

Then spot check a few "unchanged" relays to be
certain that they did not change during the 40 hours
and change back even though 55% doing so is insanely
improbable.

Look at the pre-freeze votes for a few unchanged
and relays observe that all of them were updating
about every four or five hours as is typical.

Good enough?

comment:9 Changed 4 years ago by starlight

Perhaps you are unaware of

https://collector.torproject.org/recent/relay-descriptors/votes/

?

This is where the data for the above was obtained
--is conclusive.

comment:10 Changed 4 years ago by micah

This bwauth was running what is affectionately known as "the old code", rather than trying to debug this and figure out what is going on. I instead opted to update to the new code. This means that there will be no useful measurements coming from longclaw for 2-3 days at minimum. So lets let this sit for a few days and then look again to see if things have improved.

comment:11 Changed 4 years ago by starlight

excellent!

I have noticed the other BWauths go offline and
return relatively sane and rational. Did not realized
one last one remained to be updated.

[tor-relays] BWauth's / TorFlow seem better
https://lists.torproject.org/pipermail/tor-relays/2015-July/007422.html

comment:12 Changed 4 years ago by starlight

When the other BWauths restarted from scratch,
they dropped out of (or were manually removed)
from the BW consensus (as seen on the consensus-
health page) and the Measured= entries were
cleared. Not seeing this for 'longclaw'
at present.

Is the work yet to be done (past tense was applied
in comment 10), does it happen later in the restart
logic, or is this a manual operation that's been
overlooked?

comment:13 Changed 2 years ago by teor

Resolution: fixed
Severity: Blocker
Status: newclosed

This appears to be fixed.

Note: See TracTickets for help on using tickets.