Error in `/usr/bin/tor': free(): invalid next size (normal): 0x000055ed468598d0

changed milestone to %Tor: 0.3.5.x-final

added component::core tor/tor milestone::Tor: 0.3.5.x-final priority::medium resolution::worksforme severity::normal status::closed type::defect version::tor 0.3.3.5-rc labels

Trac:
Version: Tor: 0.3.3.6 to Tor: 0.3.3.5-rc

the heap corruption is out there #25957 (moved)

Both serving DirPort

Zlib 1.2.8 What about libs for lzma, zstd, .. ?

Both serving DirPort IPv6 OrPort too

Trac:
Username: pascal.levasseur

Reboot_Tor

Detail Tor log

Trac:
Username: pascal.levasseur

The file Reboot_Tor contain a more detail log.

Trac:
Username: pascal.levasseur

I have moved to 0.3.4.1.

Trac:
Username: pascal.levasseur

The file Reboot_Tor contain a more detail log.

Jun 04 01:33:20 mkuktra Tor[28659]: descriptor at 0x564bc61006b0 begins with unexpected string "rnuter N0nam3 78.120.51.57 30556 0 30555\nidentity-ed25519\n-----B". Is anothe

Jun 04 01:33:42 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBMIC KEY Jun 04 01:33:42 mkuktra Tor[14966]: Error tokenizing router descriptor. Jun 04 01:33:58 mkuktra Tor[14966]: parse error: Malformed object: mismatched end tag RSA PUBLIC KEY Jun 04 01:33:58 mkuktra Tor[14966]: Error tokenizing router descriptor. Jun 04 01:33:58 mkuktra Tor[14966]: couldn't find start of hashed material "router " Jun 04 01:33:58 mkuktra Tor[14966]: No router keyword found.

rnuter 'n' \x6E 'o' \x6F <- valid PUBMIC 'L' \x4C <- valid 'M' \x4D

LSB flips Hardware failure?

try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem

check for library file corruption: rpm -Va or debsums

Replying to starlight:

try using smartctl to check raw hard-drive temperatures, remote server management if you have access to check system temperatures and fan speeds; perhaps a fan or two seized, filters clogged with dust or the data center has a chiller problem

The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. So I have no acces to the disk bare metal.

Trac:
Username: pascal.levasseur

Replying to starlight:

check for library file corruption: rpm -Va or debsums

root@mkuktra:~# debsums -a -s debsums: changed file /etc/ntp.conf (from ntp package) debsums: changed file /etc/tor/torrc (from tor package)

root@mkuktra:~# echo $? 2

Trac:
Username: pascal.levasseur

Tor 0.3.4.1-alpha (git-679302e532232d11) also crash.

I am unable to add the new log file to the ticket.

Trac complains with that message :

"Soumission rejetée en tant que spam potentiel Le contenu répond aux types suivants de contenu indésirable : '1[ -.]?8[ -.]?[045678][ -.]?[045678][ -.]?None..None.?None..None.?None..None.?None..None?None..None.?None..None.?None..None' Le nombre maximal de liens externes par soumission est dépassé"

Some help will be appreciate !

Trac:
Username: pascal.levasseur

You can to paste log to somewhere else, then post link here.

In comment:13 that's a simple math captcha put up by TRAC, should be easy to solve. If it's not your language you could use Google Translate. If the log is very large (5MB+) it will help to compress it with xz.

I looked at the log from earlier and it blew at 17 minutes after the hour. Possibly related to obtaining a new consensus diff. Recommend

take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt; eyeball state for anything that does not look kosher and perhaps delete garbage that may appear (make a copy first); the start the relay again to see if it fails
if it continues to fail, please run a debug trace so the exact location and call-arc to it can be unraveled; torrc lines for medium detailed logging less many highest-volume categories:

# retain normal logging to syslog
Log="notice syslog"
# high-volume debug to a file
LogMessageDomains=1
LogTimeGranularity=1
Log="[*]info [channel,dir,dirserv,guard,heartbeat,consdiff]debug file debuglog

if category with the fail seems to be missing, please try to figure out which one it is and add it from the below

general, crypto, net, config, fs, protocol, mm, http, app, control, circ,
rend, bug, dir, dirserv, edge, acct, hist, handshake, heartbeat, channel,
sched, guard, consdiff, dos

If you have a technical background it would help if the trace had line numbers. Can get a bit involved obtaining them. Possily Debian has "debug" packages companion to the various library packages (similar to RH/Centos) and installing them will magically cause line numbers to appear in the call trace. Possibly tor has a debug package as well--I always build from source and I don't know offhand where to look.

Another diagnostic avenue (if you feel up to hacking on it--not for the faint-of-heart) is to trace glibc memory. This will produce insanely huge log files so only attempt it if you have at least 100GB available. . .1TB is more like it.

Referring to the documentation at http://man7.org/linux/man-pages/man3/mtrace.3.html compile a memory-trace tor executable and add MALLOC_TRACE=/???/mtrace.out to the environment of the daemon. Once the daemon is up grab a copy of the memory map from /proc/<tor_pid>/maps.

When it fails, one analyzes the log using the process memory map and the mtrace.pl perl script or an alternate.

Not worth the trouble unless demonstrably usable line-number debug information is in place for for the tor images and the system libraries.

WARNING: the above constitutes a lot of trouble and work -- major effort

Another difficult road to travel is building an ASAN+UBSAN (address sanitizer) image in the hope of nailing down corruption occurring "somewhere else" in the code. Problem I've found is ASAN is exorbitant and will slow the daemon to near-useless. I've had success building without ASAN, then removing the .o files for modules of interest not on the hot path and re-making with ASAN options turned on. Might be cleaner to enhance Makefile to mix-and-match ASAN/non-ASAN in one pass, but at the time I was in a demon-hacker frame of mind. Preceding is a royal PitA but could yield results. Not subscribed to this bug but you can email my address as seen in tor-relays list for my (out of date) notes on how I did it. The "hardened build" support in more recent release Makefiles is likely a better starting point.

OR start with a memory-debug library. Lot easier than comment:16 or comment:17 and could cut straight to it. A survey of more popular ones: https://www.computerworld.com/article/3003957/linux/review-5-memory-debuggers-for-linux-coding.html . Probably deb packages exist for one or more of them. If it doesn't work perhaps try ASAN as plan-B.

Replying to starlight:

take down the relay and delete all cached-* files the entire diff-cache sub-directory in case one of the files is corrupt

If you're going this route, definitely move the files somewhere, rather than deleting them. We want to find this bug, not have one person stop experiencing it. :)

Replying to pascal.levasseur:

The disk system is a "Virtual SSD" somewhere in the infrastructure using nbd. . .

Crossed my mind the "network block device" (iSCSI?) could be corrupting MMAP'ed cached-* or diff-cache files somehow. I distrust network storage for critical applications and have not seen a network solution half as reliable a set of locally attached drives in RAID1/5/6. I'd take a single bare SSD over network storage.

If not already configured, try adding

DisableAllSwap 1

to torrc, which should eliminate paging and lower the probabality of the aforesaid. Verify it's working by issuing

cat /proc/<tor_pid>/status

and checking that VmLck is the same as VmSize less 4 or something like that.

Trac:
Milestone: N/A to Tor: 0.3.4.x-final

May 31 23:12:27 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=7/BUS
Jun 04 01:33:21 mkuktra systemd[1]: tor@default.service: Main process exited, code=exited, status=1/FAILURE
Jun 05 02:07:08 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGV
Jun 05 02:50:06 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=11/SEGV
Jun 05 03:17:59 mkuktra systemd[1]: tor@default.service: Main process exited, code=killed, status=6/ABRT

Four crashes (two SEGV, one BUS, one ABRT). One exit(1). Only backtrace for ABRT case.

Speculative yet plausible theory:

Allowing the possibility network storage is not directly corrupting memory, the slowness of paging over network may be exposing a race-condition bug where an unprotected critical-section results in corruption. This of course is the nastiest class of bug.

My understanding is that much of the work processing consensus documents was recently moved from the main event-loop thread to worker threads and this might have led to the introduction of an unprotected race.

Issue may have arrived suddenly due to increasing memory pressure on the shared container or VM from other instances; where previously paging may have not been present, but occurs now. If successful locking of memory with DisableAllSwap reduces or eliminates the traps, theory is further validated.

Best way to find such bugs in my experience with the Valgrind component Helgrind. Helgrind shows where the problem resides without necessarily triggering it. Slow as Hell though. . .only runs test.

Lately though, TSAN is probably a better bet. Still slow but not quite so painfully.

Trac:
Status: new to needs_information

0.3.4 is EOL; if we're going to get the info we need for these, it will be under 0.3.5.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: 0.3.5.x-final

No info here in more than 2 years; going to assume this isn't still happening on recent Tors?

Trac:
Status: needs_information to closed
Resolution: N/A to worksforme

closed

moved to tpo/core/tor#26311 (closed)

Error in `/usr/bin/tor': free(): invalid next size (normal): 0x000055ed468598d0

Child items 0

Activity