Opened 7 months ago

Closed 5 months ago

#24628 closed defect (fixed)

bwauth= bug in consensus health

Reported by: tom Owned by: tom
Priority: Medium Milestone:
Component: Metrics/Consensus Health Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We seem to be a small maybe-consistent value off of the correct value, and it makes the bwauth match not work.

Child Tickets

Change History (5)

comment:1 Changed 7 months ago by tom

I don't see this anymore. I added https://gitweb.torproject.org/depictor.git/commit/?id=4d0e8a74c88b9c43eed0d12e79b1c3ffcf3d6faf to make it easy to find if it happens again.

I updated stem, maybe that was the problem? My early review of the code made me think it might be.

comment:2 Changed 6 months ago by tom

#24877 was a dupe and has a specific time to investigate.

comment:3 Changed 6 months ago by tom

Started digging into this using the example in #24877. Really weird.

	bastet	maatuska	moria	gabelmoo	fara
Pg	4800	6080		5100	5270		4160
21	4770	6080		5110	5240		4160
20	4760	6070		5070	5230		4160
19	4750	6060		5060	5190		4160
18	4750	6050		5050	5190		4150

That's the page value (for 2018-01-11-20-00) and the vote values (from collector) for surrounding hours. The consensus document says 5070, the page says 5070 also. So the votes are wrong.

I parsed the moria vote with stem, and it gave me 5070.

I searched for any vote in January, made by moria that had a Measured value of 5100. I got the following:

  • ./11/2018-01-11-23-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-A15ABFB2A6F993F16E8645C9C3AF16E13EA7934A-r ForEdSnowden AaHRX5/GftBfopZv+IMGftUCzwg F+hQq2QhOd7rM7N7z+K+cBw/HCA 2018-01-11 14:52:14 51.15.133.16 9001 0
  • ./24/2018-01-24-17-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-5391D5738EF960227FDBA4776D914150BFEA1EDF-r ForEdSnowden AaHRX5/GftBfopZv+IMGftUCzwg lmIBHy2xp+nAsZ9PuZMBPDLm660 2018-01-24 14:58:17 51.15.133.16 9001 0
  • ./24/2018-01-24-16-00-00-vote-D586D18309DED4CD6D57C18FDB97EFA96D330566-50F9CBB890A5BF2E4919DDB8FB5577FE48C34517-r ForEdSnowden AaHRX5/GftBfopZv+IMGftUCzwg lmIBHy2xp+nAsZ9PuZMBPDLm660 2018-01-24 14:58:17 51.15.133.16 9001 0

Okay so 23:00 is nearby.

If I expand the table:

	bastet	maatuska	moria	gabelmoo	fara
Pg	4800	6080		5100	5270		4160
0	4800	6110		5130	5270		4160
23	4800	6110		5100	5270		4160
22	4770	6110		5110	5250		4160
21	4770	6080		5110	5240		4160
20	4760	6070		5070	5230		4160
19	4750	6060		5060	5190		4160
18	4750	6050		5050	5190		4150

I checked henryi's timezone, and it's in UTC. The filename is written out based on the consensus's time in the file.

Then I ran ps. And I found 3 processes running, one that had been running for 30 minutes, one for 2.5 hours, and one for 3.5 hours.

Things are starting to come together. Maybe.

I already know the script sometimes dies due to out of memory errors. Now I think I see why. I call subprocess.call at the end as a convenience. This invokes fork, doubling the amount of memory I've used. (And it's a lot.)

I'm going to replace those calls and hopefully it will resolve ALL of the weird-ass errors we've been seeing with consensus-health.

comment:4 Changed 5 months ago by tom

There were two causes of this

1) It was taking forever to download data, resulting in mismatches votes.

2) Sometimes a vote cannot be retreived from an authority, but the other dirauths got it. So it's in the consensus, but I don't have it, and all I can do is flail. https://consensus-health.torproject.org/consensus-health-2018-02-02-03-00.html is an example.

1 is fixed. 2 is handled.... well I guess it could be handled better.

If a vote is missing, I could print a footnote and instead of saying "bwauth=" say "bwauth=<sup>1</sup>".

So I'll leave this open for that purpose, and I'll get to it eventually.

Note: See TracTickets for help on using tickets.