Opened 4 years ago

Closed 4 years ago

#16907 closed enhancement (fixed)

Stop returning a 500 Internal Server Error when the hourly updater has not run for 6+ hours

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: phw, isis, fmap, teor Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When the hourly updater has not run for 6 hours or more, the server responds to all queries with a 500 Internal Server Error. This seemed to be a good idea, because users would tell us when the hourly updater was broken and we could fix it. But actually, it's a really bad idea, because there are valid cases when we need to stop the hourly updater, for example, for making backups. Last night's backup took 10 hours, which is not really sane, but which is not a good reason to break the service for all clients. I also have a Nagios check in place that warns me when the hourly updater stops running, so that I don't rely on poor users to find out when things are broken and tell me.

The suggested change is to just serve whatever data is available, regardless of when it was written. Usually, it would be less than two hours old. But we might also serve 5 hours or 25 hours or 100 hours old data. It would be up to clients like Atlas and Globe to warn their users that the displayed data is outdated.

I'm not implementing this right now, because I want to give the Atlas and Globe maintainers the chance to add such warnings to their applications. I'd want to implement this in a month from now, if possible.

Child Tickets

Change History (18)

comment:1 Changed 4 years ago by virgil

Severity: Normal

As far as I know fmap's patch at:

https://github.com/fmapfmapfmap/onionoo-dev/commit/7e1e340e7b18c07e63788138d1b892239d3b4043

implements the fix proposed above.  Onionoo spits out whatever stale content it has but warns the downstream software with a 504 error. But if the user want to use the content even though she sees a 504 that's her choice.

Last edited 4 years ago by virgil (previous) (diff)

comment:2 Changed 4 years ago by virgil

Status: newneeds_review

comment:3 Changed 4 years ago by iwakeh

This seems to be a reasonable solution.
Patch looks fine.

comment:4 Changed 4 years ago by karsten

Are there any other services using 504 status codes in this way? To be honest, I'm not yet convinced that sending a 504 error and including a body is the right thing to do here. Why not just send a 200 status code, include the body, and let the application decide whether to display a warning? That's what I suggested in the description.

comment:5 Changed 4 years ago by iwakeh

I think there should be a status code differing from 200. And 504 is a good match viewing the updater as the auxiliary server w3.org:

10.5.5 504 Gateway Timeout

The server, while acting as a gateway or proxy, did not receive a timely response from the upstream 
server specified by the URI (e.g. HTTP, FTP, LDAP) or some other auxiliary server (e.g. DNS) it needed 
to access in attempting to complete the request. 

Sending 200 will surely trigger complaints about stale data and the 504 signals beforehand that onionoo is aware.

comment:6 Changed 4 years ago by fmap

Cc: fmap added

iwakeh correctly unpacked my reasoning. Note that clients should still be notified prior to deployment of these changes, e.g. reading Roster I see that under some conditions it will unpack any valid JSON response independent of its response code:

https://github.com/seansaito/Roster/blob/7268361a0d0ab416e804303716a11908b65e292e/app/models/onionoo_connector.py#L37-L45

I've rebased my changes against master and modified the protocol document to describe the new usage:

https://github.com/fmapfmapfmap/onionoo-dev/commit/7831f9d36f41e2ed7c4c02d8d0bbfff4f01359dd

Last edited 4 years ago by fmap (previous) (diff)

comment:7 Changed 4 years ago by virgil

Karsten: is this ticket an accept?

comment:8 Changed 4 years ago by teor

Cc: teor added

comment:9 Changed 4 years ago by karsten

Hmmmm. I need to think more about this, because I'm not yet convinced that introducing that new status code is the right solution here. However, it's a bad time of the year for sitting down and thinking about such things, so I'll have to revisit in early 2016. This ticket has been sitting here for 4 months, so what's another two weeks. Sorry for the delay.

comment:10 Changed 4 years ago by virgil

As I understand it, the desired behavior is to:

  1. Spit out the stale results.
  2. Somehow signify the results are stale.

Respectfully, a different status code seems to easiest way of doing this.  The alternative is to have something like "isStale = True" somewhere in the content itself.  And sure this alternative would work just fine.  But IMHO it seems easier just to change the metadata for the content (i.e., the returned status code) than the content itself. However, I wholly defer to your (Karsten's) judgement.

comment:11 Changed 4 years ago by fmap

I observed another severe downtime event on the evening of Silvester:

http://hack.rs/~vi/onionoo/outage-2015-12-31.svg
sha256: ead6b6f11406cc89f12bd25a7a7217bf7d9a61433b87ae91260f5f8fa93e62f6
http://www.webcitation.org/6eFD7EJo0

Though I'm uncertain whether it was accountable to this bug, or something else. In hopes of better attributing future downtime events, I've extended my monitoring script to record the difference between the current time and relays_published; the idea being that downtimes associated with this bug will be preceded by a sequence of (successful) responses with a delta leading up to six hours:

http://hack.rs/~vi/onionoo/OnionooMonitor.hs
sha256: 34c441f8aa08a7f445cd2e8088f0285d5735f74a47ce112f4facf15d7029d202
http://www.webcitation.org/6eFDa1ta1

comment:12 Changed 4 years ago by fmap

Oh, derp, we actually want to diff Date and Last-Modified (relays_published timestamps the most recently seen consensus describing any relay(?), the Last-Modified header when the indexer last finished):

http://hack.rs/~vi/onionoo/OnionooMonitor.hs
sha256: 4e83bb7fddc05256b81a15f36f702603c55204ff9c93c389ee921d3bf4125535
http://www.webcitation.org/6eFR8RTXB

Last edited 4 years ago by fmap (previous) (diff)

comment:13 Changed 4 years ago by karsten

I'm still investigating the December 31 issue. Apparently, there was a host issue with fetching data from CollecTor at 01:19 UTC ("No route to host"), possibly followed by a termination of the back-end process (which I don't see in the logs, though), followed by a (manual?) host restart at 15:30 UTC. I'm still looking into all that, but I think it's unrelated to this ticket, so I'll open another ticket once I know whether it's an actual issue in Onionoo.

So, after some more thinking about the patch above I'm inclined to reject it in favor of just returning a 200 status code regardless of how old the data is. Some reasons:

  • Returning a 5xx status code and including (stale) data seems very non-standard to me. I'd want to see an example of another web application using HTTP that way before implementing such a thing.
  • I fear that a non-standard use of HTTP status codes might make it more difficult to replicate Onionoo server parts in the future. For example, if we had a separate layer of web caches that get their data from two or three Onionoo front-ends, we shouldn't confuse those web caches by returning unusual HTTP status codes.
  • It should be up to the application developer of the Onionoo client to decide whether data received from the Onionoo server is still recent enough. I can imagine that different applications have different requirements, and it shouldn't be decided for them by the server, not even indicated by using different status codes. The maximum age of six hours for content was picked rather arbitrary in the early stages of developing Onionoo, and it hasn't become less arbitrary since then. (Onionoo clients can look at the Last-Modified header to learn how old the contained data is, and they can look at the relays_published or bridges_published fields of the content to see how old the data is that was used to build the response.)

Regarding possible complaints about stale data, I think it's up to the applications to inform their users about that. I commented on #15178 for changing this in Atlas and created #16908 for Globe four months ago, but I'm not aware of any changes to those two applications. Usually, I'd say let's make this a major protocol change and give a one-month heads-up, but this behavior of sending a 500 status code after six hours is not even specified. That's why I'm inclined not to change the protocol version at all and instead just give a one week heads-up on tor-dev@ before removing the six-hour check.

Please let me know if I overlooked something or whether my reasoning doesn't make sense, and I'll reconsider. Otherwise I'll move forward in a few days. Thanks, everyone!

comment:14 Changed 4 years ago by iwakeh

If the six hour limit was just arbitrarily picked, I also think that returning data with status 200 - no matter how old - is fine.
It should indeed be up to the client to decide what to do.
Somehow I thought there was more behind the six hour rule.

comment:15 Changed 4 years ago by teor

Sounds good, karsten, the Fallback Directory search script uses OnionOO, and it will work if we deliver old data with a status code of 200 (see #17887).

In fact, given that fallback directories are all stable relays, it can use data that's a day old.

comment:16 Changed 4 years ago by virgil

I propose adding a new query parameter, allow_stale=1, which will always return a HTTP 200 if there's any content, and spit out whatever is the most recent content, regardless of how stale it is.

Alternatively, if allow_stale is not specified (or set to 0), then the current behavior is kept as-is, i.e., it spits out big nasty error if content is more than 6h stale.  As a core Tor service, 6h is a long time for content to be stale. Things breaking after 6h makes people upset enough to complain about fixing it. And motivates us to fix it. If instead they don't care much about freshness we simply spit-out the stale.

Last edited 4 years ago by virgil (previous) (diff)

comment:17 Changed 4 years ago by karsten

iwakeh, teor, thanks for the comments. Agreed!

virgil, if you really want to keep the current behavior in your application, maybe there's a way to achieve that: just include an If-Modified-Since header with a timestamp six hours ago, and you'll either get a 200 status code if content was modified since then or a 304 status code if it hasn't. Example (with a timestamp six minutes ago):

$ date -u
Fri Jan  8 08:32:30 UTC 2016
$ curl -H "If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT" -v https://onionoo.torproject.org/summary?limit=0
*   Trying 154.35.132.148...
* Connected to onionoo.torproject.org (154.35.132.148) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate: *.torproject.org
* Server certificate: DigiCert SHA2 High Assurance Server CA
* Server certificate: DigiCert High Assurance EV Root CA
> GET /summary?limit=0 HTTP/1.1
> Host: onionoo.torproject.org
> User-Agent: curl/7.43.0
> Accept: */*
> If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT
> 
< HTTP/1.1 304 Not Modified
< Date: Fri, 08 Jan 2016 08:33:08 GMT
< Server: Apache
< Cache-Control: public, max-age=300
< 
* Connection #0 to host onionoo.torproject.org left intact

And if you want to receive content in either case, just leave out that header and decide whether to display a warning based on the server's Last-Modified header. I think we don't have change the Onionoo protocol for anything here.

comment:18 Changed 4 years ago by karsten

Resolution: fixed
Status: needs_reviewclosed

Merged to master and deployed on the server. Closing. Thanks again, everyone!

Note: See TracTickets for help on using tickets.