When the hourly updater has not run for 6 hours or more, the server responds to all queries with a 500 Internal Server Error. This seemed to be a good idea, because users would tell us when the hourly updater was broken and we could fix it. But actually, it's a really bad idea, because there are valid cases when we need to stop the hourly updater, for example, for making backups. Last night's backup took 10 hours, which is not really sane, but which is not a good reason to break the service for all clients. I also have a Nagios check in place that warns me when the hourly updater stops running, so that I don't rely on poor users to find out when things are broken and tell me.
The suggested change is to just serve whatever data is available, regardless of when it was written. Usually, it would be less than two hours old. But we might also serve 5 hours or 25 hours or 100 hours old data. It would be up to clients like Atlas and Globe to warn their users that the displayed data is outdated.
I'm not implementing this right now, because I want to give the Atlas and Globe maintainers the chance to add such warnings to their applications. I'd want to implement this in a month from now, if possible.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
implements the fix proposed above. Onionoo spits out whatever stale content it has but warns the downstream software with a 504 error. But if the user want to use the content even though she sees a 504 that's her choice.
Are there any other services using 504 status codes in this way? To be honest, I'm not yet convinced that sending a 504 error and including a body is the right thing to do here. Why not just send a 200 status code, include the body, and let the application decide whether to display a warning? That's what I suggested in the description.
I think there should be a status code differing from 200. And 504 is a good match viewing the updater as the auxiliary server w3.org:
10.5.5 504 Gateway TimeoutThe server, while acting as a gateway or proxy, did not receive a timely response from the upstream server specified by the URI (e.g. HTTP, FTP, LDAP) or some other auxiliary server (e.g. DNS) it needed to access in attempting to complete the request.
Sending 200 will surely trigger complaints about stale data and the 504 signals beforehand that onionoo is aware.
iwakeh correctly unpacked my reasoning. Note that clients should still be notified prior to deployment of these changes, e.g. reading Roster I see that under some conditions it will unpack any valid JSON response independent of its response code:
Hmmmm. I need to think more about this, because I'm not yet convinced that introducing that new status code is the right solution here. However, it's a bad time of the year for sitting down and thinking about such things, so I'll have to revisit in early 2016. This ticket has been sitting here for 4 months, so what's another two weeks. Sorry for the delay.
Respectfully, a different status code seems to easiest way of doing this. The alternative is to have something like "isStale = True" somewhere in the content itself. And sure this alternative would work just fine. But IMHO it seems easier just to change the metadata for the content (i.e., the returned status code) than the content itself. However, I wholly defer to your (Karsten's) judgement.
Though I'm uncertain whether it was accountable to this bug, or something else. In hopes of better attributing future downtime events, I've extended my monitoring script to record the difference between the current time and relays_published; the idea being that downtimes associated with this bug will be preceded by a sequence of (successful) responses with a delta leading up to six hours:
Oh, derp, we actually want to diff Date and Last-Modified (relays_published timestamps the most recently seen consensus describing any relay(?), the Last-Modified header when the indexer last finished):
I'm still investigating the December 31 issue. Apparently, there was a host issue with fetching data from CollecTor at 01:19 UTC ("No route to host"), possibly followed by a termination of the back-end process (which I don't see in the logs, though), followed by a (manual?) host restart at 15:30 UTC. I'm still looking into all that, but I think it's unrelated to this ticket, so I'll open another ticket once I know whether it's an actual issue in Onionoo.
So, after some more thinking about the patch above I'm inclined to reject it in favor of just returning a 200 status code regardless of how old the data is. Some reasons:
Returning a 5xx status code and including (stale) data seems very non-standard to me. I'd want to see an example of another web application using HTTP that way before implementing such a thing.
I fear that a non-standard use of HTTP status codes might make it more difficult to replicate Onionoo server parts in the future. For example, if we had a separate layer of web caches that get their data from two or three Onionoo front-ends, we shouldn't confuse those web caches by returning unusual HTTP status codes.
It should be up to the application developer of the Onionoo client to decide whether data received from the Onionoo server is still recent enough. I can imagine that different applications have different requirements, and it shouldn't be decided for them by the server, not even indicated by using different status codes. The maximum age of six hours for content was picked rather arbitrary in the early stages of developing Onionoo, and it hasn't become less arbitrary since then. (Onionoo clients can look at the Last-Modified header to learn how old the contained data is, and they can look at the relays_published or bridges_published fields of the content to see how old the data is that was used to build the response.)
Regarding possible complaints about stale data, I think it's up to the applications to inform their users about that. I commented on #15178 (moved) for changing this in Atlas and created #16908 (moved) for Globe four months ago, but I'm not aware of any changes to those two applications. Usually, I'd say let's make this a major protocol change and give a one-month heads-up, but this behavior of sending a 500 status code after six hours is not even specified. That's why I'm inclined not to change the protocol version at all and instead just give a one week heads-up on tor-dev@ before removing the six-hour check.
Please let me know if I overlooked something or whether my reasoning doesn't make sense, and I'll reconsider. Otherwise I'll move forward in a few days. Thanks, everyone!
If the six hour limit was just arbitrarily picked, I also think that returning data with status 200 - no matter how old - is fine.
It should indeed be up to the client to decide what to do.
Somehow I thought there was more behind the six hour rule.
Sounds good, karsten, the Fallback Directory search script uses OnionOO, and it will work if we deliver old data with a status code of 200 (see #17887 (moved)).
In fact, given that fallback directories are all stable relays, it can use data that's a day old.
I propose adding a new query parameter, allow_stale=1, which will always return a HTTP 200 if there's any content, and spit out whatever is the most recent content, regardless of how stale it is.
Alternatively, if allow_stale is not specified (or set to 0), then the current behavior is kept as-is, i.e., it spits out big nasty error if content is more than 6h stale. As a core Tor service, 6h is a long time for content to be stale. Things breaking after 6h makes people upset enough to complain about fixing it. And motivates us to fix it. If instead they don't care much about freshness we simply spit-out the stale.
virgil, if you really want to keep the current behavior in your application, maybe there's a way to achieve that: just include an If-Modified-Since header with a timestamp six hours ago, and you'll either get a 200 status code if content was modified since then or a 304 status code if it hasn't. Example (with a timestamp six minutes ago):
$ date -uFri Jan 8 08:32:30 UTC 2016$ curl -H "If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT" -v https://onionoo.torproject.org/summary?limit=0* Trying 154.35.132.148...* Connected to onionoo.torproject.org (154.35.132.148) port 443 (#0)* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384* Server certificate: *.torproject.org* Server certificate: DigiCert SHA2 High Assurance Server CA* Server certificate: DigiCert High Assurance EV Root CA> GET /summary?limit=0 HTTP/1.1> Host: onionoo.torproject.org> User-Agent: curl/7.43.0> Accept: */*> If-Modified-Since: Fri, 08 Jan 2016 08:26:30 GMT> < HTTP/1.1 304 Not Modified< Date: Fri, 08 Jan 2016 08:33:08 GMT< Server: Apache< Cache-Control: public, max-age=300< * Connection #0 to host onionoo.torproject.org left intact
And if you want to receive content in either case, just leave out that header and decide whether to display a warning based on the server's Last-Modified header. I think we don't have change the Onionoo protocol for anything here.