Opened 13 years ago

Last modified 8 years ago

#526 closed defect (Fixed)

Eventdns.c crash after closing and reopening ORPort

Reported by: nickm Owned by: nickm
Priority: Low Milestone: 0.2.1.x-final
Component: Core Tor/Tor Version:
Severity: Keywords:
Cc: nickm, arma Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


[Originally reported by Mike Gersten; moving here so I don't forget about it.]

See or-dev thread from September-October titled "Tor crash", especially: (initial post) (stack trace)

The circumstances were:

"I first shut down the Or-port, to try to let all connections close.

When it was time to actually say "Time to stop", I re-enabled the
Or-port, and then sent a sigint. (If I send sigint without first
re-enabling the or-port, Tor assumes that it should stop immediately,
without notifying the clients).

Tor crashed about a minute later. I don't know if it was related or not.
This is 1.2.17."

The stack trace was:
0 tor 0x00087c04 event_del + 44 (event.c:697)
1 tor 0x0006cdd4 nameserver_up + 84 (eventdns.c:533)
2 tor 0x0006cee0 reply_callback + 112 (eventdns.c:648)
3 tor 0x0006fc6c reply_handle + 684 (eventdns.c:740)
4 tor 0x000714cc nameserver_ready_callback + 1564
5 tor 0x00087364 event_process_active + 240 (event.c:332)
6 tor 0x00087634 event_base_loop + 340 (event.c:448)
7 tor 0x000874cc event_loop + 40 (event.c:382)
8 tor 0x000873d0 event_dispatch + 20 (event.c:346)
9 tor 0x0004b8b0 tor_main + 656 (main.c:1270)
10 tor 0x0000277c _start + 760
11 tor 0x00002480 start + 48

[Automatically added by flyspray2trac: Operating System: All]

Child Tickets

Change History (7)

comment:1 Changed 13 years ago by nickm

My initial analysis, from :

Okay. This part [the stack trace] is _profoundly_ useful; it says where the crash
happened. It looks like the failing operation is event_del (which is
called as evtimer_del) from eventdns.c around line 533. The function
in question gets called when a name server which we believed was down
gives us a reply anyway. eventdns.c says "oh, wonderful!".. and
deletes the timeout event that was going to tell us to test the
nameserver later on.

But this is after shutting down and restarting ORPort! My guess is
that somewhere along the line we freed or removed the timeout event,
but did not remove the request that's making this code get called.

comment:2 Changed 13 years ago by arma

Nick, is this still an 0.2.0.x timeframe item, or are we deferring it?

comment:3 Changed 13 years ago by nickm

Deferring: this is hard to trigger, tricky to fix, and amenable to a "don't do that then" solution for now.

comment:4 Changed 12 years ago by nickm


If reply_callback() is calling nameserver_up, it's actually because of nameserver_send_probe...

...which keeps a pointer to a possibly old-and-freed nameserver structure!

comment:5 Changed 12 years ago by nickm

Fixed in r18306. Log message:

This resolves bug 526, wherein we would crash if the following
events occurred in this order:

A: We're an OR, and one of our nameservers goes down.
B: We launch a probe to it to see if it's up again. (We do this hourly

in steady-state.)

C: Before the probe finishes, we reconfigure our nameservers,

usually because we got a SIGHUP and the resolve.conf file changed.

D: The probe reply comes back, or times out. (There is a five-second

window for this, after B has happens).

IOW, if one of our nameservers is down and our nameserver
configuration has changed, there were 5 seconds per hour where HUPing
the server was unsafe.

Bugfix on Too obscure to backport.

comment:6 Changed 12 years ago by nickm

flyspray2trac: bug closed.

comment:7 Changed 8 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.