Opened 9 years ago

Closed 9 years ago

#2570 closed enhancement (implemented)

Try harder to detect when Tonga's bridge snapshots are stale

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When sanitizing bridge snapshots from Tonga, we make sure that the tarballs are not older than a couple of hours. If they are, we log a warning, so that we know early when Tonga is broken.

We overlooked the case when Tor crashes on Tonga, but the cronjob creates new tarballs with stale content every 30 minutes. We did not detect this, because we have to infer the status publication time from the tarball name (there is no publication time in bridge network statuses). This is how we did not detect that Tonga was broken for two weeks.

We should check the descriptor publication times for plausibility. If a status does not have a descriptor published in the last, say, 3 hours before the status was published, there's probably something wrong. We should print out a warning in this case, too, and investigate the problem.

We might also look at the last-modified times of files contained in the tarballs. In theory, the time difference between writing these files and writing the tarball should not be higher than 1 hour.

Child Tickets

Change History (1)

comment:1 in reply to:  description Changed 9 years ago by karsten

Resolution: implemented
Status: newclosed

Replying to karsten:

We should check the descriptor publication times for plausibility. If a status does not have a descriptor published in the last, say, 3 hours before the status was published, there's probably something wrong. We should print out a warning in this case, too, and investigate the problem.

I went with this approach, but with a maximum slack time of 1 hour. There shouldn't be many false positives with 1 hour instead of 3. A test with the January and February 2011 tarballs shows that we'd have learned about Tonga's tor process dying 60-90 minutes after the fact.

We might also look at the last-modified times of files contained in the tarballs. In theory, the time difference between writing these files and writing the tarball should not be higher than 1 hour.

This is a fine idea if it turns out that the current fix doesn't work well enough. Closing anyway.

Note: See TracTickets for help on using tickets.