Opened 9 months ago

Closed 8 weeks ago

#9299 closed enhancement (implemented)

Dump stack traces on assertion, crash, or general trouble

Reported by: nickm Owned by:
Priority: normal Milestone: Tor: 0.2.5.x-final
Component: Tor Version:
Keywords: tor-relay debugging Cc:
Actual Points: Parent ID: #11046
Points:

Description (last modified by nickm)

It's so easy to dump stack traces these days!

I have a "backtrace" branch right now that an dump stack traces on assertion failures. It works on glibc/ELF, and on OSX. We should expand it to work on Windows too, and BSD if we can.

Other fixes to make before it's ready:

  • It should be able to log a stack trace too.
    • It should log the stack trace on an assertion.
  • There should be an option to tell it not to log to the stack_dumps file, perhaps.
  • Perhaps the logfile should be pid-controlled?
  • It should support Windows.
  • It should handle deadly signals (SEGV, etc) as well.
  • It should indicate to the user somehow (if it can) that stuff might be saved to the stack_dumps file.
  • It should have tests.

Child Tickets

Change History (18)

comment:1 Changed 9 months ago by nickm

  • Description modified (diff)

comment:2 Changed 9 months ago by nickm

Updated my branch. Fun stuff! I cut the gordian knot of "where to store traces on crash" by saying "just put them in the log fd and on stderr."

For FreeBSD (and others?) we should look at linking libexecinfo, which is supposed to provide the same information as the backtrace() functions in <execinfo.h> on Linux/OSX.

For windows, we will need to look into StackWalk64 for stack dumping. I believe that the function to catch segfaults and whatnot is SetUnhandledExceptionFilter, but I'm less sure about that.

comment:3 Changed 9 months ago by nickm

  • Description modified (diff)

All that's left to do now is Windows support.

comment:4 Changed 9 months ago by nickm

  • Status changed from new to needs_review

Actually, I'm going to suggest we review and merge now, and leave Windows support for a bit later, since that's a whole new project, and apparently a big one, since the windows API is not trivial.

comment:5 follow-up: Changed 6 months ago by lunar

Ok, here's a review. I hope I'm not wasting anyone's time.

Interesting how the new format_dec_number_sigsafe() and format_hex_number_sigsafe() are slightly different given they do almost the same thing.

+++ b/configure.ac
         crt_externs.h \
+       execinfo.h \
         grp.h \

Small whitespace issue.

+    tor_assert(0 * 0 == 1);

Leaving that in the history is probably bad for future bisection, no?

tor_log_err_sigsafe_helper() might be better named tor_log_err_sigsafe_write()

+  if (log_time_granularity >= 2000) {
+    int g = log_time_granularity / 1000;

A comment for the magic numbers? You previously had 15 minutes. Does this mean that the
stack_dump file will have second granularity by default? Is that a problem?

+tor_log_get_sigsafe_err_fds(const int **out)
+{
+  *out = sigsafe_log_fds;
+  return n_sigsafe_log_fds;

Can't we have a race here if tor_log_get_sigsafe_err_fds() is called and then tor_log_update_sigsafe_err_fds()? Or do the usage of the first mandate LOCK_LOGS? Or are we sure there's no way this can happen?

fd7aa1af commit message does not say anything about the found_real_stderr thing.

+    assert(n_sigsafe_log_fds >= 2);

Using assert() because tor_assert() will not be happy with an unconfigured crash handler, right? Might worth a comment.

+#ifdef PC_FROM_UCONTEXT
+#if defined(__linux__)
+  const int n = 1;
+#elif defined(__darwin__) || defined(__APPLE__) || defined(__OpenBSD__) \
+  || defined(__FreeBSD__)
+  const int n = 2;
+#endif

Maybe add an #else #error here ? The compiler should be unhappy because n will be uninitialized, but it might save a little bit of time for a future port.

+    puts("Argument should be \"assert\" or \"crash\"");

“or none”

+  return crash(x) + crash(x*2);

What's the point of testing the reader's knowledge about C evaluation order? :)

Running the test program with assert has the logging function in the trace:

$ ./test-bt-cl assert
Oct 11 22:55:39.705 [err] tor_assertion_failed_(): Bug: src/test/test_bt_cl.c:36: crash: Assertion 1 == 0 failed; aborting.
Oct 11 22:55:39.705 [err] Bug: Assertion 1 == 0 failed in crash at src/test/test_bt_cl.c:36. Stack trace:
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(log_backtrace+0x35) [0x7f3236a5e095]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(tor_assertion_failed_+0x9f) [0x7f3236a68a8f]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(crash+0x79) [0x7f3236a5de69]
[…]

Is that something we want to keep?

For the other one:

$ ./test-bt-cl crash
============================================================ T=1381524943
Tor  died: Caught signal 11
./test-bt-cl(+0x8fc7)[0x7f156871afc7]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(oh_what+0x25)[0x7f156871ae95]

There's no symbol for the first function and crash() is there twice. Is that expected?

comment:6 in reply to: ↑ 5 ; follow-up: Changed 6 months ago by nickm

Replying to lunar:

Ok, here's a review. I hope I'm not wasting anyone's time.

Interesting how the new format_dec_number_sigsafe() and format_hex_number_sigsafe() are slightly different given they do almost the same thing.

Yeah; perhaps they should share more code.

+++ b/configure.ac
         crt_externs.h \
+       execinfo.h \
         grp.h \

Small whitespace issue.

Fixed

+    tor_assert(0 * 0 == 1);

Leaving that in the history is probably bad for future bisection, no?

That's what the later "fixup!" in the commit message for 62645b5e18a8994c9257078f571d4227037a894c is for. See the "--autosquash" argument for git rebase.

tor_log_err_sigsafe_helper() might be better named tor_log_err_sigsafe_write()

Renaming.

+  if (log_time_granularity >= 2000) {
+    int g = log_time_granularity / 1000;

A comment for the magic numbers? You previously had 15 minutes. Does this mean that the
stack_dump file will have second granularity by default? Is that a problem?

It means that the file will have the same granularity as the rest of the system logs.

+tor_log_get_sigsafe_err_fds(const int **out)
+{
+  *out = sigsafe_log_fds;
+  return n_sigsafe_log_fds;

Can't we have a race here if tor_log_get_sigsafe_err_fds() is called and then tor_log_update_sigsafe_err_fds()? Or do the usage of the first mandate LOCK_LOGS? Or are we sure there's no way this can happen?

Hm. During a signal handler, it's not safe to grab a lock, so we might need to tolerate that, or use some kind of atomic set/fetch.

fd7aa1af commit message does not say anything about the found_real_stderr thing.

Added in a squash! commit.

+    assert(n_sigsafe_log_fds >= 2);

Using assert() because tor_assert() will not be happy with an unconfigured crash handler, right? Might worth a comment.

Added a comment. The other reason is that tor_assert logs, and logging inside log functions is potentially a recursive nightmare.

+#ifdef PC_FROM_UCONTEXT
+#if defined(__linux__)
+  const int n = 1;
+#elif defined(__darwin__) || defined(__APPLE__) || defined(__OpenBSD__) \
+  || defined(__FreeBSD__)
+  const int n = 2;
+#endif

Maybe add an #else #error here ? The compiler should be unhappy because n will be uninitialized, but it might save a little bit of time for a future port.

Added

+    puts("Argument should be \"assert\" or \"crash\"");

“or none”

Added

+  return crash(x) + crash(x*2);

What's the point of testing the reader's knowledge about C evaluation order? :)

Added a comment. The actual point is to prevent the compiler from adding a tail-call optimization during the first call to crash().

Running the test program with assert has the logging function in the trace:

$ ./test-bt-cl assert
Oct 11 22:55:39.705 [err] tor_assertion_failed_(): Bug: src/test/test_bt_cl.c:36: crash: Assertion 1 == 0 failed; aborting.
Oct 11 22:55:39.705 [err] Bug: Assertion 1 == 0 failed in crash at src/test/test_bt_cl.c:36. Stack trace:
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(log_backtrace+0x35) [0x7f3236a5e095]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(tor_assertion_failed_+0x9f) [0x7f3236a68a8f]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(crash+0x79) [0x7f3236a5de69]
[…]

Is that something we want to keep?

I'd say, "yes".

For the other one:

$ ./test-bt-cl crash
============================================================ T=1381524943
Tor  died: Caught signal 11
./test-bt-cl(+0x8fc7)[0x7f156871afc7]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(oh_what+0x25)[0x7f156871ae95]

There's no symbol for the first function and crash() is there twice. Is that expected?

Odd. This seems Linux-specific. Will investigate.

comment:7 in reply to: ↑ 6 Changed 6 months ago by nickm

Replying to nickm:

[...]

Odd. This seems Linux-specific. Will investigate.

This investigation turned up nothing helpful; it seems to be happening because of the "n = 1" in backtrace.c, but I am sure that in testing, n = 1" was required under some circumstances. I'm going to call this a harmless wart for now.

comment:8 follow-up: Changed 5 months ago by andrea

Athena's code review follows:

  • c8ca168d0cfe58b8f86badbc17300f85adfedb31
    • Is not our code, and I sure hope Google didn't break it too badly. OpenBSD support seems to be missing and probably shouldn't be - and wouldn't be too hard to find out and add. Linux/sparc{,64} is rather more obscure but I happen to have it at hand, so we could add that too.

comment:9 in reply to: ↑ 8 Changed 5 months ago by nickm

Replying to andrea:

Athena's code review follows:

Indeed it isn't, but it's one of the things Tor requires. (Specifically, we require that your character set contain ASCII as a subset.) We check for this in compat.h:

#if 'a'!=97 || 'z'!=122 || 'A'!=65 || ' '!=32
#error "It seems that you encode characters in something other than ASCII."
#endif

Yes, it does add it to LDFLAGS. If the check fails, then presumably the platform is one where the -rdynamic flag doesn't exist, and therefore isn't necessary.

We might need more sophisticated checks, but the ones I had were sufficient to make it work everywhere I tried.

Yes, though that's not sufficient. The failure mode when invoking non-signal-safe stuff from signal handlers is not "the code fails" but rather "if the signal arrives while exactly the wrong thing is happening, the signal handler code fails, or the other code fails once the signal handler is done" As such, the only real safe way to do this is to enumerate all the system and/or library functions you're calling and make sure they're all on the POSIX safe list.

The ones we call from tor_log_err_sigsafe() are: time(), write(), va_start(), va_ag(), va_end(), strlen(), tor_assert()... whoops. tor_assert() shouldn't be on that list.

Fixing in a fixup commit.

  • c8ca168d0cfe58b8f86badbc17300f85adfedb31
    • Is not our code, and I sure hope Google didn't break it too badly. OpenBSD support seems to be missing and probably shouldn't be - and wouldn't be too hard to find out and add. Linux/sparc{,64} is rather more obscure but I happen to have it at hand, so we could add that too.

Indeed. I'm assuming that this will be easier for us to fix up and submit upstream fixes for than it would be for us to build one of these from scratch. (For a while, I was trying the latter, and getting quite frustrated at the amount of research needed.)

They mention OpenBSD support in the comments and later in the file; supposedly it works?

It does in the "./test/test-bt-cl crash" test, I believe?

Yes; it's only to add another missing bit of commit log message to the commit it's squashing on to.

I think I #errored in the case where PC_FROM_UCONTEXT is defined, but the platfrom *isn't* one of linux, darwin, apple, openbsd, freebsd. I think instead going for '1' as a sane default makes more sense. Doing that.

Maybe? It's nice to have confirmation that running with -O2 doesn't prevent the backtrace code from working.

Based on the above, I'm adding a few fixup commits, squashing as 'backtrace_squashed', and merging. Thanks!

comment:10 Changed 5 months ago by nickm

  • Resolution set to implemented
  • Status changed from needs_review to closed

#10186 is the ticket to add windows support. This branch is now merged.

comment:11 Changed 2 months ago by cypherpunks

Stack trace unreadable if FPO enabled in general or used optimization options (then FPO applied selectively by compiler depends function's code) for x86 platform. It's not problem for x86_64 because GCC generates unwind tables there by default.
If to enable unwind tables by "-fasynchronous-unwind-tables" then stack trace readable with any optimizations.

Last edited 2 months ago by cypherpunks (previous) (diff)

comment:12 Changed 2 months ago by cypherpunks

-fomit-frame-pointer
    Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.

    On some machines, such as the VAX, this flag has no effect, because the standard calling sequence automatically handles the frame pointer and nothing is saved by pretending it doesn't exist. The machine-description macro FRAME_POINTER_REQUIRED controls whether a target machine supports this flag. See Register Usage.

    Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit GNU/Linux x86 and 32-bit Darwin x86 targets has been changed to -fomit-frame-pointer. The default can be reverted to -fno-omit-frame-pointer by configuring GCC with the --enable-frame-pointer configure option.

    Enabled at levels -O, -O2, -O3, -Os. 

Even better. Lets assume FPO was enabled for Tor.

comment:13 Changed 2 months ago by nickm

  • Resolution implemented deleted
  • Status changed from closed to reopened

Reopening this until we can copy those ideas into a new ticket.

comment:14 Changed 2 months ago by cypherpunks

log_backtrace() is thread unsafe. If threads going to assert for the same time then calling of backtrace() from every thread going to fill the same cb_buf array.

comment:15 Changed 8 weeks ago by nickm

  • Parent ID set to #11046

comment:16 Changed 8 weeks ago by nickm

The -fomit-frame-pointer issue is now #11047.

comment:17 Changed 8 weeks ago by nickm

The threadsafety issue is now #11048

comment:18 Changed 8 weeks ago by nickm

  • Resolution set to implemented
  • Status changed from reopened to closed

Closing this ticket again; new issues should ideally be children of #11046 (or comments on #11046 if you don't have time to open a new ticket). Thanks!

Note: See TracTickets for help on using tickets.