#27808 closed defect (fixed)

tor 0.3.5.early-alpha: Raw assertion failed at smartlist_core.c::191191 : idx>=0

Reported by: jchevali Owned by: rl1987
Priority: Very High Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.5.1-alpha
Severity: Normal Keywords: regression 035-must
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I'm seeing this error in 0.3.5.2-alpha which crashes tor soon after starting and was seeing it earlier in 0.3.5.1-alpha.

My build (on Debian 64bit) identifies as Tor 0.3.5.2-alpha (git-1a85ac8c2de77f40) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.0.1t, Zlib 1.2.8, Liblzma N/A, and Libzstd 1.3.2.

Do let me know if you'd like to see some of the config context.

INTERNAL ERROR: Raw assertion failed at 
src/lib/smartlist_core/smartlist_core.c::191191: : idx>=0
./tor(dump_stack_symbols_to_error_fds+0x31)[0x55b21eaab571]
./tor(tor_raw_assertion_failed_msg_+0x86)[0x55b21eaabbf6]
./tor(smartlist_del+0xbf)[0x55b21eaaa36f]
./tor(retry_all_listeners+0x441)[0x55b21ea37911]
./tor(+0x533db)[0x55b21e90e3db]
./tor(+0x5bed1)[0x55b21e916ed1]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05)[0x7f5bd5b5dc05]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7)[0x7f5bd5b5e537]
./tor(do_main_loop+0x203)[0x55b21e913843]
./tor(tor_run_main+0x11f5)[0x55b21e915cc5]
./tor(tor_main+0x3a)[0x55b21e90e07a]
./tor(main+0x19)[0x55b21e90dbd9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5bd4aa1b45]
./tor(+0x52c29)[0x55b21e90dc29]
./tor(dump_stack_symbols_to_error_fds+0x31)[0x55b21eaab571]
./tor(tor_raw_assertion_failed_msg_+0x86)[0x55b21eaabbf6]
./tor(smartlist_del+0xbf)[0x55b21eaaa36f]
./tor(retry_all_listeners+0x441)[0x55b21ea37911]
./tor(+0x533db)[0x55b21e90e3db]
./tor(+0x5bed1)[0x55b21e916ed1]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05)[0x7f5bd5b5dc05]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7)[0x7f5bd5b5e537]
./tor(do_main_loop+0x203)[0x55b21e913843]
./tor(tor_run_main+0x11f5)[0x55b21e915cc5]
./tor(tor_main+0x3a)[0x55b21e90e07a]
./tor(main+0x19)[0x55b21e90dbd9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f5bd4aa1b45]
./tor(+0x52c29)[0x55b21e90dc29]

Child Tickets

TicketStatusOwnerSummaryComponent
#27928closedRefrain from listener rebinding when address families differCore Tor/Tor

Change History (38)

comment:1 Changed 13 months ago by nickm

Milestone: Tor: 0.3.5.x-final
Priority: MediumHigh

comment:2 Changed 13 months ago by nickm

My guess would be #17873 has something to do with this.

A configuration to reproduce this would be helpful.

comment:3 Changed 13 months ago by rl1987

Owner: set to rl1987
Status: newaccepted

Will look into this today.

comment:4 Changed 13 months ago by nickm

Keywords: regression 035-must added

comment:5 Changed 13 months ago by jchevali

I'm not opening up a port for DNS, though my ORPort settings in torrc look somewhat peculiar, like so:

ORPort 9999
ORPort [1234:5678::1]:9999

If I comment out the second line in torrc and start tor 0.3.5.2-alpha, it does not crash.

My configuration is supported up to 0.3.4.8 however. Though perhaps that's the reason why after a very long time this bridge hasn't still got the guard flag. I had thought this config would earn me an IPv6-certified flag but that never happened either.

comment:6 Changed 13 months ago by rl1987

It explodes when retry_all_listeners calls smartlist_del:

2900     tor_free(r);
2901     SMARTLIST_DEL_CURRENT(replacements, r);
2902   } SMARTLIST_FOREACH_END(r);

I suppose we could remove the above two statements from the loop, as replacements are cleaned up further in the function:

2915   /* Cleanup any remaining listener replacement. */
2916   SMARTLIST_FOREACH(replacements, listener_replacement_t *, r, tor_free(r));
2917   smartlist_free(replacements);

However, I suspect multithreading issue here - from the stack trace it seems that retry_all_listeners is called multiple times, before the first invocation was able to finish.

comment:7 Changed 13 months ago by nickm

It really shouldn't be multithreading -- the main thread is the only one that should ever call that function.

comment:8 Changed 13 months ago by rl1987

Status: acceptedneeds_information

https://github.com/torproject/tor/pull/369

Is this enough? I was unable to reproduce the crashbug within Docker container with IPv6 enabled. jchevali can you check if this fixes the issue?

comment:9 Changed 13 months ago by jchevali

I've applied this 3-line patch on top of ae0cb94, currently heading the development branch.

Unfortunately the error still occurs.

comment:10 Changed 13 months ago by rl1987

What error/stracktrace it prints after you applied the patch?

comment:11 Changed 13 months ago by jchevali

INTERNAL ERROR: Raw assertion failed at
 src/lib/smartlist_core/smartlist_core.c:191: idx>=0
./tor(dump_stack_symbols_to_error_fds+0x31)[0x55ed593789f1]
./tor(tor_raw_assertion_failed_msg_+0x86)[0x55ed59379076]
./tor(smartlist_del+0xbf)[0x55ed593777ef]
./tor(retry_all_listeners+0x459)[0x55ed591e7939]
./tor(+0x6354b)[0x55ed591eb54b]
./tor(+0x68231)[0x55ed591f0231]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05)[0x7f24a506ac05]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7)[0x7f24a506b537]
./tor(do_main_loop+0x90)[0x55ed591efc50]
./tor(tor_run_main+0x1373)[0x55ed591ddd53]
./tor(tor_main+0x3a)[0x55ed591db1aa]
./tor(main+0x19)[0x55ed591dad09]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f24a3faeb45]
./tor(+0x52d59)[0x55ed591dad59]

comment:12 Changed 13 months ago by jchevali

If you can't reproduce, don't give up, post here your torrc. You may include any settings you like, I only need to have in there the two ORPort entries. I'll replace the IPv6 in square brackets with my own and fire it up. If it won't show the bug, it means some other setting in my torrc was a factor in this bug. If it does show the bug to me but when you ran it it didn't, it means that some other OS configuration of mine could have been a factor in this bug, maybe some shared libraries version. We could start narrowing it down.

comment:13 Changed 13 months ago by nickm

Priority: HighVery High

Mark all 035-must tickets as "very high"

comment:14 Changed 13 months ago by rl1987

I'm using the following Docker setup: https://github.com/rl1987/tor-docker-images/tree/master/debian

I edited my Docker configuration JSON to make it is as follows:

{
  "fixed-cidr-v6" : "2001:db8:1::/64",
  "debug" : true,
  "ipv6" : true,
  "experimental" : true
}

Then I did docker build . in the debian folder of my tor-docker-images repo (linked above). Docker container is now built.

Shelled into the container by running docker run -it --entrypoint /bin/bash 9da6b7fb8b7b and verified that it has IPv6 network interface:

tor@128312c91195:/src/tor$ ip -6 addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
37: eth0@if38: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP 
    inet6 2001:db8:1::242:ac11:2/64 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link 
       valid_lft forever preferred_lft forever

Now let's run tor with above torrc entries:

tor@128312c91195:/src/tor$ printf "ORPort 9999\nORPort [2001:db8:1::242:ac11:2]:9999\n" | tor -f -
Oct 01 18:02:01.027 [notice] Tor 0.3.5.2-alpha-dev (git-43211c3a0c4b56d4) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.0.1t, Zlib 1.2.8, Liblzma N/A, and Libzstd N/A.
Oct 01 18:02:01.028 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Oct 01 18:02:01.028 [notice] This version is not a stable Tor release. Expect more bugs than usual.
Oct 01 18:02:01.031 [notice] Your ContactInfo config option is not set. Please consider setting it, so we can contact you if your server is misconfigured or something else goes wrong.
Oct 01 18:02:01.032 [notice] Based on detected system memory, MaxMemInQueues is set to 1499 MB. You can override this by setting MaxMemInQueues by hand.
Oct 01 18:02:01.032 [notice] By default, Tor does not run as an exit relay. If you want to be an exit relay, set ExitRelay to 1. To suppress this message in the future, set ExitRelay to 0.
Oct 01 18:02:01.032 [notice] Opening Socks listener on 127.0.0.1:9050
Oct 01 18:02:01.032 [notice] Opened Socks listener on 127.0.0.1:9050
Oct 01 18:02:01.033 [notice] Opening OR listener on 0.0.0.0:9999
Oct 01 18:02:01.033 [notice] Opened OR listener on 0.0.0.0:9999
Oct 01 18:02:01.033 [notice] Opening OR listener on [2001:db8:1::242:ac11:2]:9999
Oct 01 18:02:01.033 [notice] Opened OR listener on [2001:db8:1::242:ac11:2]:9999

It does not crash and proceeds to download network information from directory server.

Last edited 13 months ago by rl1987 (previous) (diff)

comment:15 Changed 13 months ago by rl1987

However how come it still has assert failure when SMARTLIST_DEL_CURRENT(replacements, r); is removed? Are we missing something?

comment:16 Changed 13 months ago by nickm

What if this is actually happening inside "retry_listener_ports", but we can't tell from the stack trace, because the function got inlined?

In particular, I'm looking at how retry_listener_ports can call SMARTLIST_DEL on a few different things. Is it possible that it is calling SMARTLIST_DEL_CURRENT(old_conns, conn); twice for the same value of conn? That would explain how the index could become negative.

comment:17 Changed 13 months ago by jchevali

If it helps, the last time I tested that (8 days ago), I got the impression that the crash started to happen with a delay of 45 seconds following the start of the server, while before the patch it used to crash pretty much instantly.

Anyway I should be starting another round of testing pretty soon.

comment:18 Changed 13 months ago by jchevali

When I say 45 seconds it could have been a minute, as I'm not counting from launch but from bootstrapped 100%.

comment:19 Changed 13 months ago by jchevali

I've set the log level to info and I'm comparing a successful run with a failed run.

For both runs, there's two entries logged by routerlist_remove_old_routers thirteen seconds after launch. The two entries are separated by a pageful of other entries, but they seem to have the same log time (as do those other entries); or at least they appear to have the same time up to the second since there's no millisecond resolution (there is but it's 000).

One minute after that, there's a further entry of routerlist_remove_old_routers in either log. The successful run continues logging that every minute seemingly, while also doing other stuff. The bad run crashes right after this last entry, with the following message:

INTERNAL ERROR: Raw assertion failed at
 src/lib/smartlist_core/smartlist_core.c:191: idx>=0
./tor(dump_stack_symbols_to_error_fds+0x31)[0x561de403a571]
./tor(tor_raw_assertion_failed_msg_+0x86)[0x561de403abf6]
./tor(smartlist_del+0xbf)[0x561de403936f]
./tor(retry_all_listeners+0x459)[0x561de3fc6939]
./tor(+0x533db)[0x561de3e9d3db]
./tor(+0x5bed1)[0x561de3ea5ed1]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05)[0x7f4acb72ac05]
/usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7)[0x7f4acb72b537]
./tor(do_main_loop+0x203)[0x561de3ea2843]
./tor(tor_run_main+0x11f5)[0x561de3ea4cc5]
./tor(tor_main+0x3a)[0x561de3e9d07a]
./tor(main+0x19)[0x561de3e9cbd9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f4aca66eb45]
./tor(+0x52c29)[0x561de3e9cc29]

Here's the gist of my torrc besides the two entries discussed:

SOCKSPort 9050

Log info file /var/log/tor/log

RunAsDaemon 1

ControlPort 9051
HashedControlPassword ...

HiddenServiceDir ...
HiddenServiceVersion 3
HiddenServicePort ...

HiddenServiceDir ...
HiddenServiceAuthorizeClient ...
HiddenServiceVersion 2
HiddenServicePort ...

ORPort 9999
ORPort [1234:5678::1]:9999

Nickname ...

RelayBandwidthRate ...
RelayBandwidthBurst ...

AccountingMax ...
AccountingStart ...

ContactInfo ...

ExitPolicy reject *:*

ExitRelay 0

BridgeRelay 1

MaxMemInQueues 256 MB
AvoidDiskWrites 1
PidFile /var/lib/tor/pid

ServerTransportPlugin ...

ExtORPort auto
DisableDebuggerAttachment 0
LongLivedPorts ...
GeoIPFile /usr/share/tor/geoip
GeoIPv6File /usr/share/tor/geoip6
ExcludeNodes ...
MaxConsensusAgeForDiffs 6 hours

I'm not using docker, though I could set one up with a little effort.

comment:20 Changed 13 months ago by rl1987

New patch for diagnostics: https://github.com/torproject/tor/pull/382

Just to confirm: you're not seeing Closed no-longer-configured ... in the logs, do you?

retry_all_listeners is called from two places in the codebase: 1) options_act_reversible (when config is (re-)loaded and 2) retry_listener_callback (every minute). I suspect a race condition here, which would explain why it's timing sensitive and doesn't happen every time. Above patch should show us if that's the case. Look for new stacktraces and log entries that mention retry_all_listeners.

comment:21 Changed 13 months ago by rl1987

Opened #27928 and #27929 for possibly related small issues I noticed.

comment:22 Changed 13 months ago by jchevali

No, I was not seeing 'no-longer-configured' in the logs.

I've applied PR-382 on top of the latest commit to the dev branch. Here's the result just before the crash:

tor_bug_occurred_(): 
 Bug: src/core/mainloop/connection.c:2821: retry_listener_ports: 
 Non-fatal assertion smartlist_contains(old_conns, conn) failed. 
  (on Tor 0.3.5.2-alpha-dev fe6365055bd4ec4e)
Bug: Non-fatal assertion smartlist_contains(old_conns, conn)
 failed in retry_listener_ports at src/core/mainloop/connection.c:2821. 
 Stack trace: (on Tor 0.3.5.2-alpha-dev fe6365055bd4ec4e)
Bug:     ./tor(log_backtrace_impl+0x47) [0x55a6c3e71247]
Bug:     ./tor(tor_bug_occurred_+0xbe) [0x55a6c3e6c9be]
Bug:     ./tor(retry_all_listeners+0x69e) [0x55a6c3cdfe4e]
Bug:     ./tor(+0x6396b) [0x55a6c3ce396b]
Bug:     ./tor(+0x68751) [0x55a6c3ce8751]
Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05) [0x7f2d5c792c05]
Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7) [0x7f2d5c793537]
Bug:     ./tor(do_main_loop+0x91) [0x55a6c3ce81b1]
Bug:     ./tor(tor_run_main+0x1373) [0x55a6c3cd6023]
Bug:     ./tor(tor_main+0x3a) [0x55a6c3cd347a]
Bug:     ./tor(main+0x19) [0x55a6c3cd2fd9]
Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2d5b6d6b45]
Bug:     ./tor(+0x53029) [0x55a6c3cd3029]

comment:23 Changed 13 months ago by rl1987

Did it say anything about retry_all_listeners being called (at debug loglevel)?

comment:24 Changed 13 months ago by jchevali

No entries in log containing this string before the crash, even at debug level.

Only the new non-fatal assert and subsequent stack trace contain this string.

comment:25 Changed 13 months ago by rl1987

Okay. I think for your usercase it should be fixed by my patch for #27928.

The problem is the following. In retry_listener_ports there are two smartlist_del calls that remove conn (loop variable in outer loop) from old_conns smartlist. We may remove it in the inner loop (if we determine that rebinding may be needed (this logic wasn't exactly right - see #27928)) or in the outer loop (if we found exact match to one of the wanted ports). Or both, which happened in this case. The code may determine that rebinding may be needed, remove the conn in the inner loop, keep iterating the inner loop, find exact match and then try to remove it in outer loop, which makes tor explode.

For now, please try applying: https://github.com/torproject/tor/pull/383

I'll look into possibly making further changes later this week.

comment:26 Changed 13 months ago by jchevali

With the new line added to connection.c, tor is holding up without crashing.

It did print the following to the log 16 seconds after launch though,

Oct 02 16:47:23.000 [info] pathbias_should_count():
 Bug: Circuit 4 is now being counted despite being
  ignored in the past. Purpose is Measuring circuit timeout,
  path state is new (on Tor 0.3.5.2-alpha-dev fe6365055bd4ec4e)
Oct 02 16:47:23.000 [info] pathbias_count_build_success():
 Bug: Succeeded circuit is in strange path state new.
  Circuit is a Measuring circuit timeout currently open.
  (on Tor 0.3.5.2-alpha-dev fe6365055bd4ec4e)

And one minute later,

Oct 02 16:48:23.000 [info] pathbias_should_count():
 Bug: Circuit 47 is now being counted despite being
  ignored in the past. Purpose is Measuring circuit timeout,
  path state is new (on Tor 0.3.5.2-alpha-dev fe6365055bd4ec4e)

comment:27 Changed 13 months ago by nickm

Okay -- that issue is probably not related to this bug. Thanks for testing this out, jchevali, and for working on the fix, rl1987!

rl1987 -- if you think this is ready for merge, please put it in needs_review? I'd like to get this merged.

comment:28 Changed 13 months ago by rl1987

#27928 is ready for merging, and (mostly) addresses the root problem here.

However pull requests !369 and !382 are purely experimental and I do not propose them to get into upstream. I will try making some further patch soonish.

Last edited 13 months ago by rl1987 (previous) (diff)

comment:29 Changed 13 months ago by rl1987

Status: needs_informationneeds_review

comment:30 Changed 13 months ago by nickm

Resolution: fixed
Status: needs_reviewclosed

This fix looks plausible to me. I'll merge it to master now.

comment:31 Changed 12 months ago by jchevali

Resolution: fixed
Status: closedreopened

I've tested dev at commit de86a70 which is where they were last merged. For reference, connection.c's md5sum at this point is 8a928954d14790ed8f2a9d494d14ea58 (please verify).

It falls down as follows:

[info] routerlist_remove_old_routers(): We have 6432 live routers and 15253 old router descriptors.
[notice] Opening OR listener on 0.0.0.0:9999
[warn] Could not bind to 0.0.0.0:9999: Address already in use. Is Tor already running?
[notice] Opening OR listener on 0.0.0.0:9999
[warn] Could not bind to 0.0.0.0:9999: Address already in use. Is Tor already running?
[err] tor_assertion_failed_(): Bug: src/core/mainloop/connection.c:2901: retry_all_listeners: Assertion new_conn failed; aborting.
[err] Bug: Assertion new_conn failed in retry_all_listeners at src/core/mainloop/connection.c:2901. Stack trace:
[err] Bug:     ./tor(log_backtrace_impl+0x47) [0x55ce6e0bd117]
[err] Bug:     ./tor(tor_assertion_failed_+0x93) [0x55ce6e0b87a3]
[err] Bug:     ./tor(retry_all_listeners+0x6e9) [0x55ce6df2be99]
[err] Bug:     ./tor(+0x6383b) [0x55ce6df2f83b]
[err] Bug:     ./tor(+0x68621) [0x55ce6df34621]
[err] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(+0x22c05) [0x7fb7a7442c05]
[err] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.1.so.6(event_base_loop+0x5a7) [0x7fb7a7443537]
[err] Bug:     ./tor(do_main_loop+0x91) [0x55ce6df34081]
[err] Bug:     ./tor(tor_run_main+0x1373) [0x55ce6df22023]
[err] Bug:     ./tor(tor_main+0x3a) [0x55ce6df1f47a]
[err] Bug:     ./tor(main+0x19) [0x55ce6df1efd9]
[err] Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fb7a6386b45]
[err] Bug:     ./tor(+0x53029) [0x55ce6df1f029]

If I uncomment in torrc the second ORPort line containing the IPv6 it does not fall down.

comment:32 Changed 12 months ago by jchevali

da86a70

comment:33 Changed 12 months ago by jchevali

That last line should have been: 'If I comment out...'

comment:34 Changed 12 months ago by jchevali

Alternatively, the line added to connection.c at PR383 (which turns this connection.c's md5sum into 78fa6d6ac9acb8d324b0fc0826343c46 - please verify) can also keep tor up and running, even with the dual ORPort clause.

I noticed that as this bug was closed PR383 was not merged. Was that intentional?

comment:35 Changed 12 months ago by jchevali

I'm sorry, perhaps you were meant to merge that for #27928. Maybe I should have waited till you merged that onto the dev line to test. If so, feel free to close once more. My apologies.

comment:36 Changed 12 months ago by rl1987

#27928 has to be merged as well, for this bug to be fixed completely. Added that ticket as child.

comment:37 Changed 12 months ago by jchevali

I've retested at tag tor-0.3.5.3-alpha after merge of PR-404 and it's a pass. Thank you.

comment:38 Changed 12 months ago by jchevali

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.