Opened 10 days ago

Closed 10 days ago

#33732 closed defect (wontfix)

bacula sometimes fails with a gdb error

Reported by: anarcat Owned by: anarcat
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Minor Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

typical:

root@forrestii.torproject.org (1 mins. ago) (rapports root tor)
Subject: Bacula GDB traceback of bacula-fd on forrestii
To: root@forrestii.torproject.org
Date: Thu, 26 Mar 2020 00:31:44 +0000

/usr/sbin/btraceback: 60: /usr/sbin/btraceback: gdb: not found

normally it goes away after a while, but just now the copy of forrestii created as part of #33729 is sending one of those every other minute.

Child Tickets

Change History (1)

comment:1 Changed 10 days ago by anarcat

Resolution: wontfix
Status: assignedclosed

so i tried to debug this problem on forrestii. i found the following:

  1. bacula is unhappy because it can't bind on the IPv4 port. i fix this by fixing the /etc/hosts so it points to the correct IP. good.
  2. bacula, then, starts, but is *still* unhappy because it can't bind to the IPv6 port. i fixed this by adding ipv6 to /etc/hosts, but it's not clear to me why this is a problem on this specific host and not others

regarding bug 2, the "fun" thing with that is that it doesn't immediately crash. at first, it gives us this rather cryptic warning:

Mar 26 01:07:06 forrestii bacula-fd[2813]: forrestii.torproject.org-fd: Warning: Cannot bind port 9102: ERR=Cannot assign requested address: Retrying ...

and then *eventually* it will *crash* (with a SIGSEGV, no less), like so:

Mar 26 01:08:06 forrestii bacula-fd[2813]: forrestii.torproject.org-fd: ABORTING due to ERROR in bnet_server.c:132
Mar 26 01:08:06 forrestii bacula-fd[2813]: Cannot bind port 9102: ERR=Cannot assign requested address.
Mar 26 01:08:06 forrestii bacula-fd[2813]: 26-Mar 01:08 forrestii.torproject.org-fd: ABORTING due to ERROR in bnet_server.c:132
Mar 26 01:08:06 forrestii bacula-fd[2813]: Cannot bind port 9102: ERR=Cannot assign requested address.
Mar 26 01:08:06 forrestii bacula-fd[2813]: forrestii.torproject.org-fd: ABORTING due to ERROR in bnet_server.c:132
                                           Cannot bind port 9102: ERR=Cannot assign requested address.

this error comes from:

https://sources.debian.org/src/bacula/9.6.3-1/src/lib/bnet_server.c/#L131

which is basically:

         if (--tmax <= 0) {
            Emsg2(M_ABORT, 0, _("Cannot bind port %d: ERR=%s.\n"), ntohs(fd_ptr->port),
                  be.bstrerror());
            Pmsg2(000, "Aborting cannot bind port %d: ERR=%s.\n", ntohs(fd_ptr->port),
                  be.bstrerror());
         }

from the error message, the seemingly intuitive thing to do is to install gdb. but that doesn't give us anything. instead of the reported error, by email, we get:

root@forrestii.torproject.org (4 mins. ago) (rapports root tor)
Subject: Bacula GDB traceback of bacula-fd on forrestii
To: root@forrestii.torproject.org
Date: Thu, 26 Mar 2020 01:14:42 +0000

ptrace: Operation not permitted.
/var/lib/bacula/2813: No such file or directory.
/etc/bacula/scripts/btraceback.gdb:1: Error in sourced command file:
No symbol table is loaded.  Use the "file" command.

... because of course, bacula can't gdb itself because it dropped privileges. (this, by the way, is a regression of bug 556207 in debian.)

and besides, it segfaulted, so the process is possibly gone already. but even if it *could* have ran, this is what it gets:

Thread 1 "bacula-fd" received signal SIGSEGV, Segmentation fault.
0x00007f7857199185 in e_msg(char const*, int, int, int, char const*, ...) () from /usr/lib/bacula/libbac-9.4.2.so
(gdb) bt                    
#0  0x00007f7857199185 in e_msg(char const*, int, int, int, char const*, ...) () from /usr/lib/bacula/libbac-9.4.2.so
#1  0x00007f785717d5c6 in bnet_thread_server(dlist*, int, workq_tag*, void* (*)(void*)) () from /usr/lib/bacula/libbac-9.4.2.so
#2  0x0000557966d9f62c in ?? ()
#3  0x00007f7856bfc09b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x0000557966d9fb6a in ?? ()
(gdb) up                    
#1  0x00007f785717d5c6 in bnet_thread_server(dlist*, int, workq_tag*, void* (*)(void*)) () from /usr/lib/bacula/libbac-9.4.2.so
(gdb) list                  
No symbol table is loaded.  Use the "file" command.

because of course there's no symbols table, because of course we're in production. why we would fire up a debugger, *by default*, in production code is just beyond me. i looked at the source code:

https://sources.debian.org/src/bacula/9.6.3-1/src/lib/signal.c/#L165-L181

and it looks like this behavior is hardcoded in the the upstream source code, at compile time:

https://sources.debian.org/src/bacula/9.6.3-1/src/version.h/?hl=46#L46

there are no flags to disable it, even at configure/build time.

this is just a mess.

here are the problems with the above:

  1. bacula should not crash when it can't bind to a port, especially if it *can* bind to *another* port (e.g. in dual stack). this has already been reported in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=949366
  2. bacula shouldn't fail to debug itself when it tries. this is regression https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556207
  3. bacula shouldn't unconditionnally try to send a backtrace on crashes. i just reported this as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954971

I'm not sure what else to do with this. I mostly opened this bug report to document this odyssey, since I remember dealing with this before, but not what the outcome was.

TL;DR: look into systemctl status bacula-fd, the error message should be there, and is more informative than the backtrace could ever be. Consider trashing /usr/sbin/btraceback completely to stop it from sending us such warnings ever again.

Note: See TracTickets for help on using tickets.