On #22460 (moved), we found one mysterious case of link handshake failure: An RSA->Ed25519 identity cross-certificate expiring at Dec 30, 1979. This could likely be caused by passing "0" as the now value in load_ed_keys() : see comment:19:ticket:22460 .
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
jvoisin: your relay 'jafar' is one of the relays suffering from this bug. Do you know if anything unusual happened to this relay in the past weeks? Thanks!
Another "fix" we could do here is to check whether the crosscert is expired when we're making new keys. We don't currently do that, since the cert is regenerated on startup and lives for 10 years ... but we could afford to shorten the lifetime if we make that change.
There appear to be 3 ways an error like this could happen:
do_hup() is called before the first call to update_approx_time() in main().
(Since cached_approx_time is always set to time(NULL), all subsequent calls to do_hup() are safe, unless...)
time(NULL) returns 0 or -1 (or some small value), or
RAM is corrupted, most likely by writing outside the bounds of a static array stored somewhere near cached_approx_time.
It seems to me that 1. is the most likely, particularly if obtaining the log lock hangs.
Eliminating 2. requires a close reading of the OS documentation and source code (see below).
Eliminating 3. requires checking all the operations on static arrays in tor (I checked the ones in util.c, they seem fine). And I think we'd notice if we were overwriting a lot of static variables.
The man page for my system (macOS) describes time()'s return value this way:
The time() function returns the value of time in seconds since 0 hours, 0 minutes, 0 seconds, January 1, 1970, Coordinated Universal Time, without including leap seconds. If an error occurs, time() returns the value (time_t)-1.... The time() function may fail for any of the reasons described in gettimeofday(2).... Neither ISO/IEC 9899:1999 (``ISO C99'') nor IEEE Std 1003.1-2001 (``POSIX.1'') requires time() to set errno on failure; thus, it is impos- sible for an application to distinguish the valid time value -1 (repre- senting the last UTC second of 1969) from the error return value.
But gettimeofday says:
The following error codes may be set in errno: [EFAULT] An argument address referenced invalid memory. [EPERM] A user other than the super-user attempted to set the time.
Neither of which apply in this case.
Linux is much more explicit:
When tloc is NULL, the call cannot fail.
So I think we should fix case 1 (initialise approx_time in our signal handler if it is invalid, or, if time() is not signal safe, defer any actions that depend on time() to the main loop), and see if the issue keeps on happening.
We should also payoff the technical debt we incurred by calling so many things from our signal handlers. Because if this is the cause, there is likely to be a cluster of race-condition bugs here, not just one.
The other reason that time(NULL) could return 0 (or -1, or a small integer) is if tor starts on a machine which thinks the time is 1970. This can happen when the clock battery fails. If the machine then updates its time using ntp or similar, tor could bootstrap, but would have an old certificate.
This seems like another possible cluster of bugs: or do we renew everything else when it expires? Or do we do other long-term things once at startup, and expect them to be right forever?
Hi! I think case 1 above is impossible, since do_hup() is a signal handler, and the signal handlers aren't even installed until later in tor_main() than first update_approx_time() call. I agree that case 2 is unlikely, given OS behavior.
Based on our earlier discussion with jvoisin, we found that the relay "jafar" had gotten an unexpected crash. Also, it's running on an ODROID-C1. Those together make it seem likelier to me that your case 4 ("The OS believes it's 1970") is the likeliest explanation.
The machine had a small power outage, but except this, nothing interesting happened to it.
root@jafar:~# uptime 18:02:20 up 7 days, 9:54, 2 users, load average: 0.45, 0.45, 0.48root@jafar:~# tor --versionTor version 0.3.0.7 (git-4e55cb9db769b11c).root@jafar:~# dateThu Jun 1 18:02:35 UTC 2017root@jafar:~#