Opened 18 months ago

Last modified 6 months ago

#26311 needs_information defect

Error in `/usr/bin/tor': free(): invalid next size (normal): 0x000055ed468598d0

Reported by: cypherpunks Owned by:
Priority: Medium Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.3.5-rc
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

https://lists.torproject.org/pipermail/tor-relays/2018-June/015353.html

"
My very stable Tor relay 16102E458460349EE45C0901DAA6C30094A9BBEA is now
rebooting every two days since May 31, 2018.

Running Tor 0.3.3.5-rc (git-4fe9670741478750) on Linux

with Libevent 2.0.21-stable, OpenSSL 1.1.0f, Zlib 1.2.8

You will find a short extract of the Tor log below :

May 31 23:12:27 mkuktra systemd[1]: tor@…: Main process
exited, code=killed, status=7/BUS
May 31 23:12:27 mkuktra systemd[1]: tor@…: Unit entered
failed state.
May 31 23:12:27 mkuktra systemd[1]: tor@…: Failed with
result 'signal'.
May 31 23:12:28 mkuktra systemd[1]: tor@…: Service
hold-off time over, scheduling restart.

Jun 05 03:17:59 mkuktra tor[23419]: * Error in `/usr/bin/tor': free():
invalid next size (normal): 0x000055ed468598d0
*
Jun 05 03:17:59 mkuktra tor[23419]: ======= Backtrace: =========
Jun 05 03:17:59 mkuktra tor[23419]:
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7fbb01af9bfb]
Jun 05 03:17:59 mkuktra tor[23419]:
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7fbb01afffc6]
"

trac search found also #15776

Child Tickets

Attachments (1)

Reboot_Tor (76.0 KB) - added by pascal.levasseur 18 months ago.
Detail Tor log

Download all attachments as: .zip

Change History (27)

comment:1 Changed 18 months ago by cypherpunks

Version: Tor: 0.3.3.6Tor: 0.3.3.5-rc

comment:2 Changed 18 months ago by cypherpunks

the heap corruption is out there
#25957

comment:3 Changed 18 months ago by cypherpunks

Both serving DirPort

Zlib 1.2.8

What about libs for lzma, zstd, .. ?

comment:4 Changed 18 months ago by cypherpunks

Both serving DirPort

IPv6 OrPort too

Changed 18 months ago by pascal.levasseur

Attachment: Reboot_Tor added

Detail Tor log

comment:5 Changed 18 months ago by pascal.levasseur

The file Reboot_Tor contain a more detail log.

comment:6 Changed 18 months ago by pascal.levasseur

I have moved to 0.3.4.1.

comment:7 Changed 18 months ago by cypherpunks

The file Reboot_Tor contain a more detail log.

Jun 04 01:33:20 mkuktra Tor[28659]: descriptor at 0x564bc61006b0 begins with unexpected string "rnuter N0nam3 78.120.51.57 30556 0 30555\nidentity-ed25519\n-----B". Is anothe

Jun 04 01:33:42 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBMIC KEY
Jun 04 01:33:42 mkuktra Tor[14966]: Error tokenizing router descriptor.
Jun 04 01:33:58 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBLIC KEY
Jun 04 01:33:58 mkuktra Tor[14966]: Error tokenizing router descriptor.
Jun 04 01:33:58 mkuktra Tor[14966]: couldn't find start of hashed material "router "
Jun 04 01:33:58 mkuktra Tor[14966]: No router keyword found.

comment:8 Changed 18 months ago by cypherpunks

rnuter

'n' \x6E
'o' \x6F <- valid

PUBMIC

'L' \x4C <- valid
'M' \x4D

LSB flips
Hardware failure?

comment:9 Changed 18 months ago by starlight

try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem

Last edited 18 months ago by starlight (previous) (diff)

comment:10 Changed 18 months ago by starlight

check for library file corruption: rpm -Va or debsums

comment:11 in reply to:  9 ; Changed 18 months ago by pascal.levasseur

Replying to starlight:

try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem

The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. So I have no acces to the disk bare metal.

comment:12 in reply to:  10 Changed 18 months ago by pascal.levasseur

Replying to starlight:

check for library file corruption: rpm -Va or debsums

root@mkuktra:~# debsums -a -s
debsums: changed file /etc/ntp.conf (from ntp package)
debsums: changed file /etc/tor/torrc (from tor package)

root@mkuktra:~# echo $?
2

comment:13 Changed 18 months ago by pascal.levasseur

Tor 0.3.4.1-alpha (git-679302e532232d11) also crash.

I am unable to add the new log file to the ticket.

Trac complains with that message :

"Soumission rejetée en tant que spam potentiel
Le contenu répond aux types suivants de contenu indésirable : '1[ -\.]?8[ -\.]?[045678][ -\.]?[045678][ -\.]?[0-9].?[0-9].?[0-9].?[0-9]?[0-9].?[0-9].?[0-9]'
Le nombre maximal de liens externes par soumission est dépassé"

Some help will be appreciate !

comment:14 Changed 18 months ago by cypherpunks

You can to paste log to somewhere else, then post link here.

comment:15 Changed 18 months ago by starlight

In comment:13 that's a simple math captcha put up by TRAC, should be easy to solve. If it's not your language you could use Google Translate. If the log is very large (5MB+) it will help to compress it with xz.

I looked at the log from earlier and it blew at 17 minutes after the hour. Possibly related to obtaining a new consensus diff. Recommend

1) take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt; eyeball state for anything that does not look kosher and perhaps delete garbage that may appear (make a copy first); the start the relay again to see if it fails

2) if it continues to fail, please run a debug trace so the exact location and call-arc to it can be unraveled; torrc lines for medium detailed logging less many highest-volume categories:

# retain normal logging to syslog
Log="notice syslog"
# high-volume debug to a file
LogMessageDomains=1
LogTimeGranularity=1
Log="[*]info [channel,dir,dirserv,guard,heartbeat,consdiff]debug file debuglog

if category with the fail seems to be missing, please
try to figure out which one it is and add it from the below

general, crypto, net, config, fs, protocol, mm, http, app, control, circ,
rend, bug, dir, dirserv, edge, acct, hist, handshake, heartbeat, channel,
sched, guard, consdiff, dos

If you have a technical background it would help if the trace had line numbers. Can get a bit involved obtaining them. Possily Debian has "debug" packages companion to the various library packages (similar to RH/Centos) and installing them will magically cause line numbers to appear in the call trace. Possibly tor has a debug package as well--I always build from source and I don't know offhand where to look.

Last edited 18 months ago by starlight (previous) (diff)

comment:16 Changed 18 months ago by starlight

Another diagnostic avenue (if you feel up to hacking on it--not for the faint-of-heart) is to trace glibc memory. This will produce insanely huge log files so only attempt it if you have at least 100GB available. . .1TB is more like it.

Referring to the documentation at http://man7.org/linux/man-pages/man3/mtrace.3.html compile a memory-trace tor executable and add MALLOC_TRACE=/???/mtrace.out to the environment of the daemon.
Once the daemon is up grab a copy of the memory map from /proc/<tor_pid>/maps.

When it fails, one analyzes the log using the process memory map and the mtrace.pl perl script or an alternate.

Not worth the trouble unless demonstrably usable line-number debug information is in place for for the tor images and the system libraries.

WARNING: the above constitutes a lot of trouble and work -- major effort

comment:17 Changed 18 months ago by starlight

Another difficult road to travel is building an ASAN+UBSAN (address sanitizer) image in the hope of nailing down corruption occurring "somewhere else" in the code. Problem I've found is ASAN is exorbitant and will slow the daemon to near-useless. I've had success building without ASAN, then removing the .o files for modules of interest not on the hot path and re-making with ASAN options turned on. Might be cleaner to enhance Makefile to mix-and-match ASAN/non-ASAN in one pass, but at the time I was in a demon-hacker frame of mind. Preceding is a royal PitA but could yield results. Not subscribed to this bug but you can email my address as seen in tor-relays list for my (out of date) notes on how I did it. The "hardened build" support in more recent release Makefiles is likely a better starting point.

Last edited 18 months ago by starlight (previous) (diff)

comment:18 Changed 18 months ago by starlight

OR start with a memory-debug library. Lot easier than comment:16 or comment:17 and could cut straight to it. A survey of more popular ones: https://www.computerworld.com/article/3003957/linux/review-5-memory-debuggers-for-linux-coding.html . Probably deb packages exist for one or more of them. If it doesn't work perhaps try ASAN as plan-B.

comment:19 in reply to:  15 Changed 18 months ago by arma

Replying to starlight:

1) take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt

If you're going this route, definitely *move* the files somewhere, rather than deleting them. We want to find this bug, not have one person stop experiencing it. :)

comment:20 in reply to:  11 Changed 18 months ago by starlight

Replying to pascal.levasseur:

The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. . .

Crossed my mind the "network block device" (iSCSI?) could be corrupting MMAP'ed cached-* or diff-cache files somehow. I distrust network storage for critical applications and have not seen a network solution half as reliable a set of locally attached drives in RAID1/5/6. I'd take a single bare SSD over network storage.

If not already configured, try adding

DisableAllSwap 1

to torrc, which should eliminate paging and lower the probabality of the aforesaid. Verify it's working by issuing

cat /proc/<tor_pid>/status

and checking that VmLck is the same as VmSize less 4 or something like that.

comment:21 Changed 18 months ago by dgoulet

Milestone: Tor: 0.3.4.x-final

comment:22 Changed 18 months ago by cypherpunks

May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=7/BUS
Jun 04 01:33:21 mkuktra systemd[1]: tor@default.service: Main process exited, code=exited, status=1/FAILURE
Jun 05 02:07:08 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGV
Jun 05 02:50:06 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGV
Jun 05 03:17:59 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=6/ABRT

Four crashes (two SEGV, one BUS, one ABRT). One exit(1). Only backtrace for ABRT case.

comment:23 Changed 18 months ago by starlight

Speculative yet plausible theory:

Allowing the possibility network storage is not directly corrupting memory, the slowness of paging over network may be exposing a race-condition bug where an unprotected critical-section results in corruption. This of course is the nastiest class of bug.

My understanding is that much of the work processing consensus documents was recently moved from the main event-loop thread to worker threads and this might have led to the introduction of an unprotected race.

Issue may have arrived suddenly due to increasing memory pressure on the shared container or VM from other instances; where previously paging may have not been present, but occurs now. If successful locking of memory with DisableAllSwap reduces or eliminates the traps, theory is further validated.

Best way to find such bugs in my experience with the Valgrind component Helgrind. Helgrind shows where the problem resides without necessarily triggering it. Slow as Hell though. . .only runs test.

Last edited 18 months ago by starlight (previous) (diff)

comment:24 Changed 18 months ago by starlight

Lately though, TSAN is probably a better bet. Still slow but not quite so painfully.

comment:25 Changed 16 months ago by nickm

Status: newneeds_information

comment:26 Changed 6 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

0.3.4 is EOL; if we're going to get the info we need for these, it will be under 0.3.5.

Note: See TracTickets for help on using tickets.