Opened 4 weeks ago

Closed 3 weeks ago

Last modified 2 weeks ago

#22466 closed defect (fixed)

"Crosscert is expired" warnings: RSA->Ed25519 identity crosscertifice apparently made in 1970?

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 030-backport tor-relay certificate expired 1970
Cc: jvoisin Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

On #22460, we found one mysterious case of link handshake failure: An RSA->Ed25519 identity cross-certificate expiring at Dec 30, 1979. This could likely be caused by passing "0" as the now value in load_ed_keys() : see comment:19:ticket:22460 .

Child Tickets

Change History (13)

comment:1 Changed 4 weeks ago by nickm

My branch bug22466_diagnostic_030 adds some assertions to try to catch this.

comment:2 Changed 4 weeks ago by arma

  • Cc jvoisin added

jvoisin: your relay 'jafar' is one of the relays suffering from this bug. Do you know if anything unusual happened to this relay in the past weeks? Thanks!

comment:3 Changed 4 weeks ago by nickm

Another "fix" we could do here is to check whether the crosscert is expired when we're making new keys. We don't currently do that, since the cert is regenerated on startup and lives for 10 years ... but we could afford to shorten the lifetime if we make that change.

comment:4 Changed 4 weeks ago by nickm

And also my branch bug22466_regenerate_030 regenerates crosscerts when they are close to expiring (and lowers the lifetime to 6 months)

comment:5 Changed 4 weeks ago by nickm

  • Status changed from new to needs_review

comment:6 Changed 4 weeks ago by teor

There appear to be 3 ways an error like this could happen:

  1. do_hup() is called before the first call to update_approx_time() in main(). (Since cached_approx_time is always set to time(NULL), all subsequent calls to do_hup() are safe, unless...)
  2. time(NULL) returns 0 or -1 (or some small value), or
  3. RAM is corrupted, most likely by writing outside the bounds of a static array stored somewhere near cached_approx_time.

It seems to me that 1. is the most likely, particularly if obtaining the log lock hangs.

Eliminating 2. requires a close reading of the OS documentation and source code (see below).

Eliminating 3. requires checking all the operations on static arrays in tor (I checked the ones in util.c, they seem fine). And I think we'd notice if we were overwriting a lot of static variables.

The man page for my system (macOS) describes time()'s return value this way:

     The time() function returns the value of time in seconds since 0 hours, 0
     minutes, 0 seconds, January 1, 1970, Coordinated Universal Time, without
     including leap seconds.  If an error occurs, time() returns the value
     (time_t)-1.
...
     The time() function may fail for any of the reasons described in
     gettimeofday(2).
...
     Neither ISO/IEC 9899:1999 (``ISO C99'') nor IEEE Std 1003.1-2001
     (``POSIX.1'') requires time() to set errno on failure; thus, it is impos-
     sible for an application to distinguish the valid time value -1 (repre-
     senting the last UTC second of 1969) from the error return value.

But gettimeofday says:

     The following error codes may be set in errno:

     [EFAULT]  An argument address referenced invalid memory.

     [EPERM]   A user other than the super-user attempted to set the time.

Neither of which apply in this case.

Linux is much more explicit:

     When tloc is NULL, the call cannot fail.

So I think we should fix case 1 (initialise approx_time in our signal handler if it is invalid, or, if time() is not signal safe, defer any actions that depend on time() to the main loop), and see if the issue keeps on happening.

We should also payoff the technical debt we incurred by calling so many things from our signal handlers. Because if this is the cause, there is likely to be a cluster of race-condition bugs here, not just one.

Last edited 4 weeks ago by teor (previous) (diff)

comment:7 Changed 4 weeks ago by teor

The other reason that time(NULL) could return 0 (or -1, or a small integer) is if tor starts on a machine which thinks the time is 1970. This can happen when the clock battery fails. If the machine then updates its time using ntp or similar, tor could bootstrap, but would have an old certificate.

This seems like another possible cluster of bugs: or do we renew everything else when it expires? Or do we do other long-term things once at startup, and expect them to be right forever?

comment:8 Changed 4 weeks ago by nickm

Hi! I think case 1 above is impossible, since do_hup() is a signal handler, and the signal handlers aren't even installed until later in tor_main() than first update_approx_time() call. I agree that case 2 is unlikely, given OS behavior.

Based on our earlier discussion with jvoisin, we found that the relay "jafar" had gotten an unexpected crash. Also, it's running on an ODROID-C1. Those together make it seem likelier to me that your case 4 ("The OS believes it's 1970") is the likeliest explanation.

comment:9 Changed 4 weeks ago by jvoisin

The machine had a small power outage, but except this, nothing interesting happened to it.

root@jafar:~# uptime 
 18:02:20 up 7 days,  9:54,  2 users,  load average: 0.45, 0.45, 0.48
root@jafar:~# tor --version
Tor version 0.3.0.7 (git-4e55cb9db769b11c).
root@jafar:~# date
Thu Jun  1 18:02:35 UTC 2017
root@jafar:~# 

comment:10 Changed 4 weeks ago by dgoulet

Both branches lgtm;

Also deployed on my dirauth on the testnet.

comment:11 follow-up: Changed 4 weeks ago by catalyst

Both branches also look good to me. I have no easy way to test the "OS thinks it's 1970" aspect of this issue, though.

comment:12 Changed 3 weeks ago by nickm

  • Milestone changed from Tor: 0.3.1.x-final to Tor: 0.3.0.x-final
  • Resolution set to fixed
  • Status changed from needs_review to closed

Merged the mitigation branch to maint-0.3.0 and forward; merged the diagnostic branch to master only.

comment:13 in reply to: ↑ 11 Changed 2 weeks ago by teor

Replying to catalyst:

Both branches also look good to me. I have no easy way to test the "OS thinks it's 1970" aspect of this issue, though.

For future reference, libfaketime is your friend here :-)

Note: See TracTickets for help on using tickets.