Opened 2 years ago

Closed 17 months ago

#25144 closed defect (duplicate)

op-us onionperf instance spends much of its time at 100% timeout failure: why?

Reported by: arma Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Onionperf Version:
Severity: Normal Keywords:
Cc: hiro Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


Now that we've fixed #24996, we now see on
that there's an onionperf instance that spent much of January just totally failing.

What was it trying to do, and was its task actually hard enough to merit such failure?

The op-nl server didn't seem to have this level of difficulty.

op-hk did have problems, but in different weeks than op-us.

The results look similar when I click 'public', so I guess it isn't just the onion destination.

Child Tickets

Change History (8)

comment:1 Changed 2 years ago by karsten

Cc: hiro added

hiro, can you take a look at the op-* instances? Context: At one of the last team meetings we said we'd do that once #24996 is fixed, and indeed, it's fixed now. Happy to discuss this more at Thursday's meeting. Thanks in advance!

comment:2 Changed 2 years ago by cypherpunks

op-us is on a streak of timeouts again that started on April 19, 2018 through the latest data point, a running total of 21 days and going.

op-hk and op-nl are experiencing spikes at the same time but not constantly like op-us is.

Days before that started, the number of bridges fell quickly on April 15 and 16:

The loss of bridges might be explained by Google and Amazon stopping domain fronting on April 13 (#25804) and announcing it the week of April 23 ( ) respectively. The working idea is that their decision is a result of a Russian court banning Telegram messenger on April 13 and the collateral damage from the ensuing cat-and-mouse game between Telegram servers and Roskomnadzor, the Russian internet censorship body, blocking IP subnets.

However, the sudden loss of bridges might not be related to the timeouts of op-us starting April 19. The wiki:doc/MetricsTimeline hasn't been updated with clues either.

According to #21653 and wiki:org/operations/services/onionperf, op-us is in Washington, DC at and, Allied Telecom Group, LLC, Radio Free Asia, not at Google, Amazon, or Azure.

comment:3 Changed 2 years ago by irl

hiro has restarted this OnionPerf instance, which hopefully should set it going again.

comment:4 Changed 2 years ago by arma

What went wrong with it? Was it off? Or does it have a bug where it stops working sometimes? Should we set up some sort of better monitoring?

comment:5 Changed 2 years ago by irl

It appeared to be running normally, I suspect it has a bug where it stops working. The obvious answer, "the tgen server went away and there was nothing to talk to", didn't seem to be the answer.

Improving monitoring of metrics services is on the roadmap. As a temporary measure I can keep an eye on these CSV files to ensure that the service looks like it is running. A medium-term goal may be to write a Nagios plugin (relatively simple task) to check the latest CSV files for 100% failure rates and alert on that. A longer-term goal would be to both track down the cause of the failure, and to instrument OnionPerf to be able to notify of such failures before the 24h cycle is complete.

comment:6 Changed 2 years ago by irl

Component: Metrics/StatisticsMetrics/Onionperf
Status: newassigned

This is about the Onionperf deployment, not the graphs produced from the data.

comment:7 Changed 2 years ago by hiro

The medium term goal, i.e. writing a nagios plugin, is in my list. I am also thinking to review the deployment and see if is there something that could be improved.

Last edited 2 years ago by hiro (previous) (diff)

comment:8 Changed 17 months ago by irl

Resolution: duplicate
Status: assignedclosed

Duplicate of #28271.

Note: See TracTickets for help on using tickets.