Stop returning a 500 Internal Server Error when the hourly updater has not run for 6+ hours

added component::metrics/onionoo priority::medium resolution::fixed severity::normal status::closed type::enhancement labels

As far as I know fmap's patch at:

https://github.com/fmapfmapfmap/onionoo-dev/commit/7e1e340e7b18c07e63788138d1b892239d3b4043

implements the fix proposed above. Onionoo spits out whatever stale content it has but warns the downstream software with a 504 error. But if the user want to use the content even though she sees a 504 that's her choice.

Trac:
Severity: N/A to Normal
Sponsor: N/A to N/A

Trac:
Status: new to needs_review

This seems to be a reasonable solution. Patch looks fine.

Are there any other services using 504 status codes in this way? To be honest, I'm not yet convinced that sending a 504 error and including a body is the right thing to do here. Why not just send a 200 status code, include the body, and let the application decide whether to display a warning? That's what I suggested in the description.

I think there should be a status code differing from 200. And 504 is a good match viewing the updater as the auxiliary server w3.org:

10.5.5 504 Gateway Timeout

The server, while acting as a gateway or proxy, did not receive a timely response from the upstream 
server specified by the URI (e.g. HTTP, FTP, LDAP) or some other auxiliary server (e.g. DNS) it needed 
to access in attempting to complete the request.

Sending 200 will surely trigger complaints about stale data and the 504 signals beforehand that onionoo is aware.

iwakeh correctly unpacked my reasoning. Note that clients should still be notified prior to deployment of these changes, e.g. reading Roster I see that under some conditions it will unpack any valid JSON response independent of its response code:

https://github.com/seansaito/Roster/blob/7268361a0d0ab416e804303716a11908b65e292e/app/models/onionoo_connector.py#L37-L45

I've rebased my changes against master and modified the protocol document to describe the new usage:

https://github.com/fmapfmapfmap/onionoo-dev/commit/7831f9d36f41e2ed7c4c02d8d0bbfff4f01359dd

Trac:
Username: fmap
Cc: phw, isis to phw, isis, fmap

Karsten: is this ticket an accept?

Trac:
Cc: phw, isis, fmap to phw, isis, fmap, teor

Hmmmm. I need to think more about this, because I'm not yet convinced that introducing that new status code is the right solution here. However, it's a bad time of the year for sitting down and thinking about such things, so I'll have to revisit in early 2016. This ticket has been sitting here for 4 months, so what's another two weeks. Sorry for the delay.

As I understand it, the desired behavior is to:

Spit out the stale results.
Somehow signify the results are stale.

Respectfully, a different status code seems to easiest way of doing this. The alternative is to have something like "isStale = True" somewhere in the content itself. And sure this alternative would work just fine. But IMHO it seems easier just to change the metadata for the content (i.e., the returned status code) than the content itself. However, I wholly defer to your (Karsten's) judgement.

I observed another severe downtime event on the evening of Silvester:

http://hack.rs/~vi/onionoo/outage-2015-12-31.svg sha256: ead6b6f11406cc89f12bd25a7a7217bf7d9a61433b87ae91260f5f8fa93e62f6 http://www.webcitation.org/6eFD7EJo0

Though I'm uncertain whether it was accountable to this bug, or something else. In hopes of better attributing future downtime events, I've extended my monitoring script to record the difference between the current time and relays_published; the idea being that downtimes associated with this bug will be preceded by a sequence of (successful) responses with a delta leading up to six hours:

http://hack.rs/~vi/onionoo/OnionooMonitor.hs sha256: 34c441f8aa08a7f445cd2e8088f0285d5735f74a47ce112f4facf15d7029d202 http://www.webcitation.org/6eFDa1ta1

Trac:
Username: fmap

Oh, derp, we actually want to diff Date and Last-Modified (relays_published timestamps the most recently seen consensus describing any relay(?), the Last-Modified header when the indexer last finished):

http://hack.rs/~vi/onionoo/OnionooMonitor.hs sha256: 4e83bb7fddc05256b81a15f36f702603c55204ff9c93c389ee921d3bf4125535 http://www.webcitation.org/6eFR8RTXB

Trac:
Username: fmap

I'm still investigating the December 31 issue. Apparently, there was a host issue with fetching data from CollecTor at 01:19 UTC ("No route to host"), possibly followed by a termination of the back-end process (which I don't see in the logs, though), followed by a (manual?) host restart at 15:30 UTC. I'm still looking into all that, but I think it's unrelated to this ticket, so I'll open another ticket once I know whether it's an actual issue in Onionoo.

So, after some more thinking about the patch above I'm inclined to reject it in favor of just returning a 200 status code regardless of how old the data is. Some reasons:

Returning a 5xx status code and including (stale) data seems very non-standard to me. I'd want to see an example of another web application using HTTP that way before implementing such a thing.
I fear that a non-standard use of HTTP status codes might make it more difficult to replicate Onionoo server parts in the future. For example, if we had a separate layer of web caches that get their data from two or three Onionoo front-ends, we shouldn't confuse those web caches by returning unusual HTTP status codes.
It should be up to the application developer of the Onionoo client to decide whether data received from the Onionoo server is still recent enough. I can imagine that different applications have different requirements, and it shouldn't be decided for them by the server, not even indicated by using different status codes. The maximum age of six hours for content was picked rather arbitrary in the early stages of developing Onionoo, and it hasn't become less arbitrary since then. (Onionoo clients can look at the Last-Modified header to learn how old the contained data is, and they can look at the relays_published or bridges_published fields of the content to see how old the data is that was used to build the response.)

Regarding possible complaints about stale data, I think it's up to the applications to inform their users about that. I commented on #15178 (moved) for changing this in Atlas and created #16908 (moved) for Globe four months ago, but I'm not aware of any changes to those two applications. Usually, I'd say let's make this a major protocol change and give a one-month heads-up, but this behavior of sending a 500 status code after six hours is not even specified. That's why I'm inclined not to change the protocol version at all and instead just give a one week heads-up on tor-dev@ before removing the six-hour check.

Please let me know if I overlooked something or whether my reasoning doesn't make sense, and I'll reconsider. Otherwise I'll move forward in a few days. Thanks, everyone!

If the six hour limit was just arbitrarily picked, I also think that returning data with status 200 - no matter how old - is fine. It should indeed be up to the client to decide what to do. Somehow I thought there was more behind the six hour rule.

Sounds good, karsten, the Fallback Directory search script uses OnionOO, and it will work if we deliver old data with a status code of 200 (see #17887 (moved)).

In fact, given that fallback directories are all stable relays, it can use data that's a day old.

I propose adding a new query parameter, allow_stale=1, which will always return a HTTP 200 if there's any content, and spit out whatever is the most recent content, regardless of how stale it is.

Alternatively, if allow_stale is not specified (or set to 0), then the current behavior is kept as-is, i.e., it spits out big nasty error if content is more than 6h stale. As a core Tor service, 6h is a long time for content to be stale. Things breaking after 6h makes people upset enough to complain about fixing it. And motivates us to fix it. If instead they don't care much about freshness we simply spit-out the stale.

iwakeh, teor, thanks for the comments. Agreed!

virgil, if you really want to keep the current behavior in your application, maybe there's a way to achieve that: just include an If-Modified-Since header with a timestamp six hours ago, and you'll either get a 200 status code if content was modified since then or a 304 status code if it hasn't. Example (with a timestamp six minutes ago):

$ date -u
Fri Jan  8 08:32:30 UTC 2016
$ curl -H "If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT" -v https://onionoo.torproject.org/summary?limit=0
*   Trying 154.35.132.148...
* Connected to onionoo.torproject.org (154.35.132.148) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate: *.torproject.org
* Server certificate: DigiCert SHA2 High Assurance Server CA
* Server certificate: DigiCert High Assurance EV Root CA
> GET /summary?limit=0 HTTP/1.1
> Host: onionoo.torproject.org
> User-Agent: curl/7.43.0
> Accept: */*
> If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT
> 
< HTTP/1.1 304 Not Modified
< Date: Fri, 08 Jan 2016 08:33:08 GMT
< Server: Apache
< Cache-Control: public, max-age=300
< 
* Connection #0 to host onionoo.torproject.org left intact

And if you want to receive content in either case, just leave out that header and decide whether to display a warning based on the server's Last-Modified header. I think we don't have change the Onionoo protocol for anything here.

Merged to master and deployed on the server. Closing. Thanks again, everyone!

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #16908 (moved)

mentioned in issue #17887 (moved)

mentioned in issue tpo/core/tor#17887 (closed)

Stop returning a 500 Internal Server Error when the hourly updater has not run for 6+ hours

Child items ...

Activity