Dump stack traces on assertion, crash, or general trouble

changed milestone to %Tor: 0.2.5.x-final

added component::core tor/tor debugging milestone::Tor: 0.2.5.x-final parent::11046 priority::medium resolution::implemented status::closed tor-relay type::enhancement labels

Trac:
Description: It's so easy to dump stack traces these days!

I have a "backtrace" branch right now that an dump stack traces on assertion failures. It works on glibc/ELF, and on OSX. We should expand it to work on Windows too, and BSD if we can.

Other fixes to make before it's ready:

It should be able to log a stack trace too.
- It should log the stack trace on an assertion.
There should be an option to tell it not to log to the stack_dumps file, perhaps.
Perhaps the logfile should be pid-controlled?
It should support Windows.
It should handle deadly signals (SEGV, etc) as well.
It should indicate to the user somehow (if it can) that stuff might be saved to the stack_dumps file.
It should have tests.

to

It's so easy to dump stack traces these days!

I have a "backtrace" branch right now that an dump stack traces on assertion failures. It works on glibc/ELF, and on OSX. We should expand it to work on Windows too, and BSD if we can.

Other fixes to make before it's ready:

~~It should be able to log a stack trace too.~~
- ~~It should log the stack trace on an assertion.~~
~~There should be an option to tell it not to log to the stack_dumps file, perhaps.~~
~~Perhaps the logfile should be pid-controlled?~~
It should support Windows.
~~It should handle deadly signals (SEGV, etc) as well.~~
~~It should indicate to the user somehow (if it can) that stuff might be saved to the stack_dumps file.~~
It should have tests.

Updated my branch. Fun stuff! I cut the gordian knot of "where to store traces on crash" by saying "just put them in the log fd and on stderr."

For FreeBSD (and others?) we should look at linking libexecinfo, which is supposed to provide the same information as the backtrace() functions in <execinfo.h> on Linux/OSX.

For windows, we will need to look into StackWalk64 for stack dumping. I believe that the function to catch segfaults and whatnot is SetUnhandledExceptionFilter, but I'm less sure about that.

All that's left to do now is Windows support.

Trac:
Description: It's so easy to dump stack traces these days!

I have a "backtrace" branch right now that an dump stack traces on assertion failures. It works on glibc/ELF, and on OSX. We should expand it to work on Windows too, and BSD if we can.

Other fixes to make before it's ready:

~~It should be able to log a stack trace too.~~
- ~~It should log the stack trace on an assertion.~~
~~There should be an option to tell it not to log to the stack_dumps file, perhaps.~~
~~Perhaps the logfile should be pid-controlled?~~
It should support Windows.
~~It should handle deadly signals (SEGV, etc) as well.~~
~~It should indicate to the user somehow (if it can) that stuff might be saved to the stack_dumps file.~~
It should have tests.

to

It's so easy to dump stack traces these days!

I have a "backtrace" branch right now that an dump stack traces on assertion failures. It works on glibc/ELF, and on OSX. We should expand it to work on Windows too, and BSD if we can.

Other fixes to make before it's ready:

~~It should be able to log a stack trace too.~~
- ~~It should log the stack trace on an assertion.~~
~~There should be an option to tell it not to log to the stack_dumps file, perhaps.~~
~~Perhaps the logfile should be pid-controlled?~~
It should support Windows.
~~It should handle deadly signals (SEGV, etc) as well.~~
~~It should indicate to the user somehow (if it can) that stuff might be saved to the stack_dumps file.~~
~~It should have tests.~~

Actually, I'm going to suggest we review and merge now, and leave Windows support for a bit later, since that's a whole new project, and apparently a big one, since the windows API is not trivial.

Trac:
Status: new to needs_review

Ok, here's a review. I hope I'm not wasting anyone's time.

Interesting how the new format_dec_number_sigsafe() and format_hex_number_sigsafe() are slightly different given they do almost the same thing.

+++ b/configure.ac
         crt_externs.h \
+       execinfo.h \
         grp.h \

Small whitespace issue.

+    tor_assert(0 * 0 == 1);

Leaving that in the history is probably bad for future bisection, no?

tor_log_err_sigsafe_helper() might be better named tor_log_err_sigsafe_write()

+  if (log_time_granularity >= 2000) {
+    int g = log_time_granularity / 1000;

A comment for the magic numbers? You previously had 15 minutes. Does this mean that the stack_dump file will have second granularity by default? Is that a problem?

+tor_log_get_sigsafe_err_fds(const int **out)
+{
+  *out = sigsafe_log_fds;
+  return n_sigsafe_log_fds;

Can't we have a race here if tor_log_get_sigsafe_err_fds() is called and then tor_log_update_sigsafe_err_fds()? Or do the usage of the first mandate LOCK_LOGS? Or are we sure there's no way this can happen?

fd7aa1af commit message does not say anything about the found_real_stderr thing.

+    assert(n_sigsafe_log_fds >= 2);

Using assert() because tor_assert() will not be happy with an unconfigured crash handler, right? Might worth a comment.

+#ifdef PC_FROM_UCONTEXT
+#if defined(__linux__)
+  const int n = 1;
+#elif defined(__darwin__) || defined(__APPLE__) || defined(__OpenBSD__) \
+  || defined(__FreeBSD__)
+  const int n = 2;
+#endif

Maybe add an #else #error here ? The compiler should be unhappy because n will be uninitialized, but it might save a little bit of time for a future port.

+    puts("Argument should be \"assert\" or \"crash\"");

“or none”

+  return crash(x) + crash(x*2);

What's the point of testing the reader's knowledge about C evaluation order? :)

Running the test program with assert has the logging function in the trace:

$ ./test-bt-cl assert
Oct 11 22:55:39.705 [err] tor_assertion_failed_(): Bug: src/test/test_bt_cl.c:36: crash: Assertion 1 == 0 failed; aborting.
Oct 11 22:55:39.705 [err] Bug: Assertion 1 == 0 failed in crash at src/test/test_bt_cl.c:36. Stack trace:
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(log_backtrace+0x35) [0x7f3236a5e095]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(tor_assertion_failed_+0x9f) [0x7f3236a68a8f]
Oct 11 22:55:39.705 [err] Bug:     ./test-bt-cl(crash+0x79) [0x7f3236a5de69]
[…]

Is that something we want to keep?

For the other one:

$ ./test-bt-cl crash
============================================================ T=1381524943
Tor  died: Caught signal 11
./test-bt-cl(+0x8fc7)[0x7f156871afc7]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(crash+0x48)[0x7f156871ae38]
./test-bt-cl(oh_what+0x25)[0x7f156871ae95]

There's no symbol for the first function and crash() is there twice. Is that expected?

Replying to lunar:

Ok, here's a review. I hope I'm not wasting anyone's time.

Interesting how the new format_dec_number_sigsafe() and format_hex_number_sigsafe() are slightly different given they do almost the same thing.

Yeah; perhaps they should share more code.

{{{ +++ b/configure.ac crt_externs.h \
  execinfo.h \
   grp.h \
}}}

Small whitespace issue.

Fixed

{{{

tor_assert(0 * 0 == 1); }}}

Leaving that in the history is probably bad for future bisection, no?

That's what the later "fixup!" in the commit message for 62645b5e18a8994c9257078f571d4227037a894c is for. See the "--autosquash" argument for git rebase.

tor_log_err_sigsafe_helper() might be better named tor_log_err_sigsafe_write()

Renaming.

{{{

if (log_time_granularity >= 2000) {

int g = log_time_granularity / 1000; }}}

A comment for the magic numbers? You previously had 15 minutes. Does this mean that the stack_dump file will have second granularity by default? Is that a problem?

It means that the file will have the same granularity as the rest of the system logs.

{{{ +tor_log_get_sigsafe_err_fds(const int **out) +{

*out = sigsafe_log_fds;

return n_sigsafe_log_fds; }}}

Can't we have a race here if tor_log_get_sigsafe_err_fds() is called and then tor_log_update_sigsafe_err_fds()? Or do the usage of the first mandate LOCK_LOGS? Or are we sure there's no way this can happen?

Hm. During a signal handler, it's not safe to grab a lock, so we might need to tolerate that, or use some kind of atomic set/fetch.

fd7aa1af commit message does not say anything about the found_real_stderr thing.

Added in a squash! commit.

{{{

assert(n_sigsafe_log_fds >= 2); }}}

Using assert() because tor_assert() will not be happy with an unconfigured crash handler, right? Might worth a comment.

Added a comment. The other reason is that tor_assert logs, and logging inside log functions is potentially a recursive nightmare.

{{{ +#ifdef PC_FROM_UCONTEXT +#if defined(linux)

const int n = 1; +#elif defined(darwin) || defined(APPLE) || defined(OpenBSD) \

|| defined(FreeBSD)

const int n = 2; +#endif }}}

Maybe add an #else #error here ? The compiler should be unhappy because n will be uninitialized, but it might save a little bit of time for a future port.

Added

{{{

puts("Argument should be "assert" or "crash""); }}}

“or none”

Added

{{{

return crash(x) + crash(x*2); }}}

What's the point of testing the reader's knowledge about C evaluation order? :)

Added a comment. The actual point is to prevent the compiler from adding a tail-call optimization during the first call to crash().

Running the test program with assert has the logging function in the trace:

{{{ $ ./test-bt-cl assert Oct 11 22:55:39.705 [err] tor_assertion_failed_(): Bug: src/test/test_bt_cl.c:36: crash: Assertion 1 == 0 failed; aborting. Oct 11 22:55:39.705 [err] Bug: Assertion 1 == 0 failed in crash at src/test/test_bt_cl.c:36. Stack trace: Oct 11 22:55:39.705 [err] Bug: ./test-bt-cl(log_backtrace+0x35) [0x7f3236a5e095] Oct 11 22:55:39.705 [err] Bug: ./test-bt-cl(tor_assertion_failed_+0x9f) [0x7f3236a68a8f] Oct 11 22:55:39.705 [err] Bug: ./test-bt-cl(crash+0x79) [0x7f3236a5de69] […] }}}

Is that something we want to keep?

I'd say, "yes".

For the other one:

{{{ $ ./test-bt-cl crash ============================================================ T=1381524943 Tor died: Caught signal 11 ./test-bt-cl(+0x8fc7)[0x7f156871afc7] ./test-bt-cl(crash+0x48)[0x7f156871ae38] ./test-bt-cl(crash+0x48)[0x7f156871ae38] ./test-bt-cl(oh_what+0x25)[0x7f156871ae95] }}}

There's no symbol for the first function and crash() is there twice. Is that expected?

Odd. This seems Linux-specific. Will investigate.

Replying to nickm: [...]

Odd. This seems Linux-specific. Will investigate.

This investigation turned up nothing helpful; it seems to be happening because of the "n = 1" in backtrace.c, but I am sure that in testing, n = 1" was required under some circumstances. I'm going to call this a harmless wart for now.

Athena's code review follows:

5343ee1a06ebb959fc77753898015186b94a5daa
- Looks okay to me, but (nitpick) I don't think '0' + digit is quite perfectly portable.
f6d8bc9389db72dc654560d0f86fd3edd3b14934
- Looks fine
308ffbcfdd99424a1750b09f90721d9da38f761b
- Why round to 15 minutes?
ff187fd1335968468f307d48157e5dcf4ad88b1f
- Looks fine
82743d6e560bbbbcbd22e09c3ec26f54d0656b75
- Does TOR_CHECK_LDFLAGS actually add it to LDFLAGS for us? what if it fails - shouldn't we set USE_BACKTRACE then?
62645b5e18a8994c9257078f571d4227037a894c
- Yeah, good move there :)
ab2a00ac4378e85f67da6e681a7c8609e3153db1
- Looks okay to me, but yeesh signal handlers can get hairy. I presume you've tested all this by actually invoking it from a signal handler?
fd7aa1af256b11cad9875fec875a570205aea0e0
- Looks fine
c8ca168d0cfe58b8f86badbc17300f85adfedb31
- Is not our code, and I sure hope Google didn't break it too badly. OpenBSD support seems to be missing and probably shouldn't be - and wouldn't be too hard to find out and add. Linux/sparc{,64} is rather more obscure but I happen to have it at hand, so we could add that too.
10411ee8c13bcf8de65a225a0cee5b0844fb5b9f
- Looks fine
3591d2079dabc7450b1a973045e000212a90b67d
- Looks fine
1d6af8099d9f3dafefdf75bb1411ec2bd669087f
- Yay, unit tests! Shouldn't tor_log_err_sigsafe() actually get invoked from a signal handler at some point though?
e1e638a169dd0a34455057b3f7cc895defd1f3cf
- Yay, more unit tests!
abdfd8367b326e918ee49b4f35aa899859c6fca7
- Looks fine
74ac03f54b6d640e97330d4207cbb943337fa71d
- Looks fine
8b926717e601578f7f83aee9c7384d2ffb44d985
- Looks fine
85a520682f832314ffe5615ad2544558092bee52
- Looks fine
e9f95a67ae4b35326293865034fe92e1df59ae43
- Is an empty diff?
f0b9dd03abfd18a037f4ae0bbacecd34b35cad1a
- Ouch! Am I reading that wrong, or did you just #error on any Linux that isn't one of the archs pc_from_ucontext.m4 supports? :(
4871da62cd814a5ca50f03924b3492e8d3b2f89a
- Looks fine
511a8af458bbf56edaf7d1110dfe28c11274b7fb
- Looks okay, but wouldn't it be better to just compile stuff like this with -O0?
6fe53dc7ccd0e84316584a3352e598f66858d46c
- Looks fine

Replying to andrea:

Athena's code review follows:

5343ee1a06ebb959fc77753898015186b94a5daa

Looks okay to me, but (nitpick) I don't think '0' + digit is quite perfectly portable.

Indeed it isn't, but it's one of the things Tor requires. (Specifically, we require that your character set contain ASCII as a subset.) We check for this in compat.h:

#if 'a'!=97 || 'z'!=122 || 'A'!=65 || ' '!=32
#error "It seems that you encode characters in something other than ASCII."
#endif

82743d6e560bbbbcbd22e09c3ec26f54d0656b75

Does TOR_CHECK_LDFLAGS actually add it to LDFLAGS for us? what if it fails - shouldn't we set USE_BACKTRACE then?

Yes, it does add it to LDFLAGS. If the check fails, then presumably the platform is one where the -rdynamic flag doesn't exist, and therefore isn't necessary.

We might need more sophisticated checks, but the ones I had were sufficient to make it work everywhere I tried.

ab2a00ac4378e85f67da6e681a7c8609e3153db1

Looks okay to me, but yeesh signal handlers can get hairy. I presume you've tested all this by actually invoking it from a signal handler?

Yes, though that's not sufficient. The failure mode when invoking non-signal-safe stuff from signal handlers is not "the code fails" but rather "if the signal arrives while exactly the wrong thing is happening, the signal handler code fails, or the other code fails once the signal handler is done" As such, the only real safe way to do this is to enumerate all the system and/or library functions you're calling and make sure they're all on the POSIX safe list.

The ones we call from tor_log_err_sigsafe() are: time(), write(), va_start(), va_ag(), va_end(), strlen(), tor_assert()... whoops. tor_assert() shouldn't be on that list.

Fixing in a fixup commit.

c8ca168d0cfe58b8f86badbc17300f85adfedb31

Is not our code, and I sure hope Google didn't break it too badly. OpenBSD support seems to be missing and probably shouldn't be - and wouldn't be too hard to find out and add. Linux/sparc{,64} is rather more obscure but I happen to have it at hand, so we could add that too.

Indeed. I'm assuming that this will be easier for us to fix up and submit upstream fixes for than it would be for us to build one of these from scratch. (For a while, I was trying the latter, and getting quite frustrated at the amount of research needed.)

They mention OpenBSD support in the comments and later in the file; supposedly it works?

1d6af8099d9f3dafefdf75bb1411ec2bd669087f

Yay, unit tests! Shouldn't tor_log_err_sigsafe() actually get invoked from a signal handler at some point though?

It does in the "./test/test-bt-cl crash" test, I believe?

e9f95a67ae4b35326293865034fe92e1df59ae43

Is an empty diff?

Yes; it's only to add another missing bit of commit log message to the commit it's squashing on to.

f0b9dd03abfd18a037f4ae0bbacecd34b35cad1a

Ouch! Am I reading that wrong, or did you just #error on any Linux that isn't one of the archs pc_from_ucontext.m4 supports? :(

I think I #errored in the case where PC_FROM_UCONTEXT is defined, but the platfrom isn't one of linux, darwin, apple, openbsd, freebsd. I think instead going for '1' as a sane default makes more sense. Doing that.

511a8af458bbf56edaf7d1110dfe28c11274b7fb

Looks okay, but wouldn't it be better to just compile stuff like this with -O0?

Maybe? It's nice to have confirmation that running with -O2 doesn't prevent the backtrace code from working.

Based on the above, I'm adding a few fixup commits, squashing as 'backtrace_squashed', and merging. Thanks!

#10186 (moved) is the ticket to add windows support. This branch is now merged.

Trac:
Status: needs_review to closed
Resolution: N/A to implemented

Stack trace unreadable if FPO enabled in general or used optimization options (then FPO applied selectively by compiler depends function's code) for x86 platform. It's not problem for x86_64 because GCC generates unwind tables there by default. If to enable unwind tables by "-fasynchronous-unwind-tables" then stack trace readable with any optimizations.

-fomit-frame-pointer
    Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.

    On some machines, such as the VAX, this flag has no effect, because the standard calling sequence automatically handles the frame pointer and nothing is saved by pretending it doesn't exist. The machine-description macro FRAME_POINTER_REQUIRED controls whether a target machine supports this flag. See Register Usage.

    Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit GNU/Linux x86 and 32-bit Darwin x86 targets has been changed to -fomit-frame-pointer. The default can be reverted to -fno-omit-frame-pointer by configuring GCC with the --enable-frame-pointer configure option.

    Enabled at levels -O, -O2, -O3, -Os.

Even better. Lets assume FPO was enabled for Tor.

Reopening this until we can copy those ideas into a new ticket.

Trac:
Resolution: implemented to N/A
Status: closed to reopened

log_backtrace() is thread unsafe. If threads going to assert for the same time then calling of backtrace() from every thread going to fill the same cb_buf array.

Trac:
Parent: N/A to #11046 (moved)

The -fomit-frame-pointer issue is now #11047 (moved).

The threadsafety issue is now #11048 (moved)

Closing this ticket again; new issues should ideally be children of #11046 (moved) (or comments on #11046 (moved) if you don't have time to open a new ticket). Thanks!

Trac:
Resolution: N/A to implemented
Status: reopened to closed

Dump stack traces on assertion, crash, or general trouble

Child items 0

Activity