"Crosscert is expired" warnings: RSA->Ed25519 identity crosscertifice apparently made in 1970?

changed milestone to %Tor: 0.3.0.x-final

added 030-backport 1970 certificate component::core tor/tor expired milestone::Tor: 0.3.0.x-final priority::medium resolution::fixed severity::normal status::closed tor-relay type::defect labels

My branch bug22466_diagnostic_030 adds some assertions to try to catch this.

jvoisin: your relay 'jafar' is one of the relays suffering from this bug. Do you know if anything unusual happened to this relay in the past weeks? Thanks!

Trac:
Cc: N/A to jvoisin

Another "fix" we could do here is to check whether the crosscert is expired when we're making new keys. We don't currently do that, since the cert is regenerated on startup and lives for 10 years ... but we could afford to shorten the lifetime if we make that change.

And also my branch bug22466_regenerate_030 regenerates crosscerts when they are close to expiring (and lowers the lifetime to 6 months)

Trac:
Status: new to needs_review

There appear to be 3 ways an error like this could happen:

do_hup() is called before the first call to update_approx_time() in main(). (Since cached_approx_time is always set to time(NULL), all subsequent calls to do_hup() are safe, unless...)
time(NULL) returns 0 or -1 (or some small value), or
RAM is corrupted, most likely by writing outside the bounds of a static array stored somewhere near cached_approx_time.

It seems to me that 1. is the most likely, particularly if obtaining the log lock hangs.

Eliminating 2. requires a close reading of the OS documentation and source code (see below).

Eliminating 3. requires checking all the operations on static arrays in tor (I checked the ones in util.c, they seem fine). And I think we'd notice if we were overwriting a lot of static variables.

The man page for my system (macOS) describes time()'s return value this way:

     The time() function returns the value of time in seconds since 0 hours, 0
     minutes, 0 seconds, January 1, 1970, Coordinated Universal Time, without
     including leap seconds.  If an error occurs, time() returns the value
     (time_t)-1.
...
     The time() function may fail for any of the reasons described in
     gettimeofday(2).
...
     Neither ISO/IEC 9899:1999 (``ISO C99'') nor IEEE Std 1003.1-2001
     (``POSIX.1'') requires time() to set errno on failure; thus, it is impos-
     sible for an application to distinguish the valid time value -1 (repre-
     senting the last UTC second of 1969) from the error return value.

But gettimeofday says:

     The following error codes may be set in errno:

     [EFAULT]  An argument address referenced invalid memory.

     [EPERM]   A user other than the super-user attempted to set the time.

Neither of which apply in this case.

Linux is much more explicit:

     When tloc is NULL, the call cannot fail.

So I think we should fix case 1 (initialise approx_time in our signal handler if it is invalid, or, if time() is not signal safe, defer any actions that depend on time() to the main loop), and see if the issue keeps on happening.

We should also payoff the technical debt we incurred by calling so many things from our signal handlers. Because if this is the cause, there is likely to be a cluster of race-condition bugs here, not just one.

The other reason that time(NULL) could return 0 (or -1, or a small integer) is if tor starts on a machine which thinks the time is 1970. This can happen when the clock battery fails. If the machine then updates its time using ntp or similar, tor could bootstrap, but would have an old certificate.

This seems like another possible cluster of bugs: or do we renew everything else when it expires? Or do we do other long-term things once at startup, and expect them to be right forever?

Hi! I think case 1 above is impossible, since do_hup() is a signal handler, and the signal handlers aren't even installed until later in tor_main() than first update_approx_time() call. I agree that case 2 is unlikely, given OS behavior.

Based on our earlier discussion with jvoisin, we found that the relay "jafar" had gotten an unexpected crash. Also, it's running on an ODROID-C1. Those together make it seem likelier to me that your case 4 ("The OS believes it's 1970") is the likeliest explanation.

The machine had a small power outage, but except this, nothing interesting happened to it.

root@jafar:~# uptime 
 18:02:20 up 7 days,  9:54,  2 users,  load average: 0.45, 0.45, 0.48
root@jafar:~# tor --version
Tor version 0.3.0.7 (git-4e55cb9db769b11c).
root@jafar:~# date
Thu Jun  1 18:02:35 UTC 2017
root@jafar:~#

Both branches lgtm;

Also deployed on my dirauth on the testnet.

Both branches also look good to me. I have no easy way to test the "OS thinks it's 1970" aspect of this issue, though.

Merged the mitigation branch to maint-0.3.0 and forward; merged the diagnostic branch to master only.

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.0.x-final
Resolution: N/A to fixed
Status: needs_review to closed

Replying to catalyst:

Both branches also look good to me. I have no easy way to test the "OS thinks it's 1970" aspect of this issue, though.

For future reference, libfaketime is your friend here :-)

closed

moved to tpo/core/tor#22466 (closed)

"Crosscert is expired" warnings: RSA->Ed25519 identity crosscertifice apparently made in 1970?

Child items ...

Activity