"
My very stable Tor relay 16102E458460349EE45C0901DAA6C30094A9BBEA is now
rebooting every two days since May 31, 2018.
Running Tor 0.3.3.5-rc (git-4fe9670741478750) on Linux
with Libevent 2.0.21-stable, OpenSSL 1.1.0f, Zlib 1.2.8
You will find a short extract of the Tor log below :
May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Main process
exited, code=killed, status=7/BUS
May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Unit entered
failed state.
May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Failed with
result 'signal'.
May 31 23:12:28 mkuktra systemd[1]: tor@default.service: Service
hold-off time over, scheduling restart.
Jun 05 03:17:59 mkuktra tor[23419]: *** Error in `/usr/bin/tor': free():
invalid next size (normal): 0x000055ed468598d0 ***
Jun 05 03:17:59 mkuktra tor[23419]: ======= Backtrace: =========
Jun 05 03:17:59 mkuktra tor[23419]:
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7fbb01af9bfb]
Jun 05 03:17:59 mkuktra tor[23419]:
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7fbb01afffc6]
"
Jun 04 01:33:20 mkuktra Tor[28659]: descriptor at 0x564bc61006b0 begins with unexpected string "rnuter N0nam3 78.120.51.57 30556 0 30555\nidentity-ed25519\n-----B". Is anothe
Jun 04 01:33:42 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBMIC KEY
Jun 04 01:33:42 mkuktra Tor[14966]: Error tokenizing router descriptor.
Jun 04 01:33:58 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBLIC KEY
Jun 04 01:33:58 mkuktra Tor[14966]: Error tokenizing router descriptor.
Jun 04 01:33:58 mkuktra Tor[14966]: couldn't find start of hashed material "router "
Jun 04 01:33:58 mkuktra Tor[14966]: No router keyword found.
try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem
try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem
The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. So I have no acces to the disk bare metal.
Tor 0.3.4.1-alpha (git-679302e532232d11) also crash.
I am unable to add the new log file to the ticket.
Trac complains with that message :
"Soumission rejetée en tant que spam potentiel
Le contenu répond aux types suivants de contenu indésirable : '1[ -.]?8[ -.]?[045678][ -.]?[045678][ -.]?None..None.?None..None.?None..None.?None..None?None..None.?None..None.?None..None'
Le nombre maximal de liens externes par soumission est dépassé"
In comment:13 that's a simple math captcha put up by TRAC, should be easy to solve. If it's not your language you could use Google Translate. If the log is very large (5MB+) it will help to compress it with xz.
I looked at the log from earlier and it blew at 17 minutes after the hour. Possibly related to obtaining a new consensus diff. Recommend
take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt; eyeball state for anything that does not look kosher and perhaps delete garbage that may appear (make a copy first); the start the relay again to see if it fails
if it continues to fail, please run a debug trace so the exact location and call-arc to it can be unraveled; torrc lines for medium detailed logging less many highest-volume categories:
# retain normal logging to syslogLog="notice syslog"# high-volume debug to a fileLogMessageDomains=1LogTimeGranularity=1Log="[*]info [channel,dir,dirserv,guard,heartbeat,consdiff]debug file debuglog
if category with the fail seems to be missing, please
try to figure out which one it is and add it from the below
general, crypto, net, config, fs, protocol, mm, http, app, control, circ,rend, bug, dir, dirserv, edge, acct, hist, handshake, heartbeat, channel,sched, guard, consdiff, dos
If you have a technical background it would help if the trace had line numbers. Can get a bit involved obtaining them. Possily Debian has "debug" packages companion to the various library packages (similar to RH/Centos) and installing them will magically cause line numbers to appear in the call trace. Possibly tor has a debug package as well--I always build from source and I don't know offhand where to look.
Another diagnostic avenue (if you feel up to hacking on it--not for the faint-of-heart) is to trace glibc memory. This will produce insanely huge log files so only attempt it if you have at least 100GB available. . .1TB is more like it.
Referring to the documentation at http://man7.org/linux/man-pages/man3/mtrace.3.html compile a memory-trace tor executable and add MALLOC_TRACE=/???/mtrace.out to the environment of the daemon.
Once the daemon is up grab a copy of the memory map from /proc/<tor_pid>/maps.
When it fails, one analyzes the log using the process memory map and the mtrace.pl perl script or an alternate.
Not worth the trouble unless demonstrably usable line-number debug information is in place for for the tor images and the system libraries.
WARNING: the above constitutes a lot of trouble and work -- major effort
Another difficult road to travel is building an ASAN+UBSAN (address sanitizer) image in the hope of nailing down corruption occurring "somewhere else" in the code. Problem I've found is ASAN is exorbitant and will slow the daemon to near-useless. I've had success building without ASAN, then removing the .o files for modules of interest not on the hot path and re-making with ASAN options turned on. Might be cleaner to enhance Makefile to mix-and-match ASAN/non-ASAN in one pass, but at the time I was in a demon-hacker frame of mind. Preceding is a royal PitA but could yield results. Not subscribed to this bug but you can email my address as seen in tor-relays list for my (out of date) notes on how I did it. The "hardened build" support in more recent release Makefiles is likely a better starting point.
take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt
If you're going this route, definitely move the files somewhere, rather than deleting them. We want to find this bug, not have one person stop experiencing it. :)
The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. . .
Crossed my mind the "network block device" (iSCSI?) could be corrupting MMAP'ed cached-* or diff-cache files somehow. I distrust network storage for critical applications and have not seen a network solution half as reliable a set of locally attached drives in RAID1/5/6. I'd take a single bare SSD over network storage.
If not already configured, try adding
DisableAllSwap 1
to torrc, which should eliminate paging and lower the probabality of the aforesaid. Verify it's working by issuing
cat /proc/<tor_pid>/status
and checking that VmLck is the same as VmSize less 4 or something like that.
May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=7/BUSJun 04 01:33:21 mkuktra systemd[1]: tor@default.service: Main process exited, code=exited, status=1/FAILUREJun 05 02:07:08 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGVJun 05 02:50:06 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGVJun 05 03:17:59 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=6/ABRT
Four crashes (two SEGV, one BUS, one ABRT). One exit(1). Only backtrace for ABRT case.
Allowing the possibility network storage is not directly corrupting memory, the slowness of paging over network may be exposing a race-condition bug where an unprotected critical-section results in corruption. This of course is the nastiest class of bug.
My understanding is that much of the work processing consensus documents was recently moved from the main event-loop thread to worker threads and this might have led to the introduction of an unprotected race.
Issue may have arrived suddenly due to increasing memory pressure on the shared container or VM from other instances; where previously paging may have not been present, but occurs now. If successful locking of memory with DisableAllSwap reduces or eliminates the traps, theory is further validated.
Best way to find such bugs in my experience with the Valgrind component Helgrind. Helgrind shows where the problem resides without necessarily triggering it. Slow as Hell though. . .only runs test.