Opened 12 years ago

Last modified 7 years ago

#426 closed defect (Fixed)

infinite directory information retrieval loop

Reported by: erikbosman Owned by:
Priority: High Milestone:
Component: Core Tor/Tor Version: 0.1.2.13
Severity: Keywords:
Cc: erikbosman, arma, nickm Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

After upgrading from version 0.1.1.26 to version 0.1.2.13
(Debian package version 0.1.2.13-3 from mirror.noreply.org),
tor fails to initialise. It seems to be stuck in an infinite loop:

May 12 00:59:19.239 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 00:59:19.272 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 00:59:19.305 [notice] I learned some more directory information, but not enough to build a circuit.
...and so on

It seems to bombard the directory servers with new connections at
the same rate.

...
tcp 0 0 192.168.0.164:37919 86.59.21.38:80 ESTABLISHED103 6329904 27340/tor
tcp 0 0 192.168.0.164:37914 86.59.21.38:80 ESTABLISHED103 6329896 27340/tor
tcp 199 0 192.168.0.164:37893 86.59.21.38:80 CLOSE_WAIT 103 6329866 27340/tor
...

My tor router runs behind a NAT-firewall, with the listener ports
9001 and 9030 forwarded to the right host and 'Address' in torrc
set to a FQDN pointing to the external IP address.

The daemon runs in a Xen-domU, but I don't think this problem is
related to that.

[Automatically added by flyspray2trac: Operating System: Other Linux]

Child Tickets

Change History (14)

comment:1 Changed 12 years ago by arma

Make a debug-level log to see where it's happening?
http://wiki.noreply.org/noreply/TheOnionRouter/TorFAQ#Logs

Does it happen every time you restart Tor?

Back up your datadirectory, downgrade to 0.1.1.26 and see if it
still happens?

comment:2 Changed 12 years ago by erikbosman

Tor 0.1.1.26 does not get into trouble:

May 12 17:04:41.287 [notice] Tor 0.1.1.26 opening log file.
May 12 17:04:42.090 [notice] Your Tor server's identity key fingerprint is 'MiaBonkers 4422 4785 5E38 48B9 0F96 C105 02D4 B607 8198 65ED'
May 12 17:04:43.278 [notice] We now have enough directory information to build circuits.
May 12 17:04:45.616 [notice] router_orport_found_reachable(): Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
May 12 17:04:57.679 [notice] Our directory information is no longer up-to-date enough to build circuits.
May 12 17:04:57.680 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:04:58.082 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:04:58.130 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:04:58.170 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:04:58.209 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:04:59.099 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:05:01.195 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:05:12.485 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:05:14.547 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 17:05:15.553 [notice] We now have enough directory information to build circuits.

Removing the /var/lib/tor/cached-status/ directory fixes the problem.
The newly created cached-status/* files do not cause problems.

May 12 17:07:46.419 [notice] Tor 0.1.2.13 opening log file.
May 12 17:07:47.518 [notice] Your Tor server's identity key fingerprint is 'MiaBonkers 4422 4785 5E38 48B9 0F96 C105 02D4 B607 8198 65ED'
May 12 17:07:52.354 [notice] We now have enough directory information to build circuits.
May 12 17:07:53.129 [notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
May 12 17:07:54.664 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
May 12 17:08:25.276 [notice] Performing bandwidth self-test...done.
May 12 17:08:36.126 [notice] Self-testing indicates your DirPort is reachable from the outside. Excellent.

With the old cache-status/* files present, the problem occurs every time.
According to the debug-level log, my server made 3037 connections to the four cached
directory-servers in 132 seconds, the servers don't seem to return any data:

[info] connection_dir_client_reached_eof(): Received server info (size 0) from server '86.59.21.38:80'

Here is an excerpt from the log:

May 12 12:11:00.923 [info] connection_dir_client_reached_eof(): HTTP body from server '86.59.21.38:80' was labeled as deflated, but it seems to be uncompressed. Trying both.
May 12 12:11:00.924 [info] tor_gzip_uncompress(): possible truncated or corrupt zlib data
May 12 12:11:00.924 [info] connection_dir_client_reached_eof(): Received server info (size 0) from server '86.59.21.38:80'
May 12 12:11:00.924 [info] router_load_routers_from_string(): 0 elements to add
May 12 12:11:00.927 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 12:11:00.951 [info] update_router_descriptor_cache_downloads(): Requesting 2 descriptors from authority "moria1"
May 12 12:11:00.951 [debug] directory_initiate_command(): private 0, want_to_tunnel 0.
May 12 12:11:00.951 [debug] directory_initiate_command(): initiating server descriptor fetch
May 12 12:11:00.951 [debug] connection_connect(): Connecting to [scrubbed]:9031.
May 12 12:11:00.952 [debug] connection_connect(): Connection to [scrubbed]:9031 in progress (sock 65).
May 12 12:11:00.952 [debug] connection_add(): new conn type Directory, socket 65, n_conns 430.
May 12 12:11:00.952 [debug] write_to_buf(): added 4 bytes to buf (now 4 total).
May 12 12:11:00.952 [debug] write_to_buf(): added 97 bytes to buf (now 101 total).
May 12 12:11:00.952 [debug] write_to_buf(): added 38 bytes to buf (now 139 total).
May 12 12:11:00.952 [info] update_router_descriptor_cache_downloads(): Requesting 2 descriptors from authority "tor26"
May 12 12:11:00.952 [debug] directory_initiate_command(): private 0, want_to_tunnel 0.
May 12 12:11:00.952 [debug] directory_initiate_command(): initiating server descriptor fetch
May 12 12:11:00.952 [debug] connection_connect(): Connecting to [scrubbed]:80.
May 12 12:11:00.953 [debug] connection_connect(): Connection to [scrubbed]:80 in progress (sock 442).
May 12 12:11:00.953 [debug] connection_add(): new conn type Directory, socket 442, n_conns 431.
May 12 12:11:00.953 [debug] write_to_buf(): added 4 bytes to buf (now 4 total).
May 12 12:11:00.953 [debug] write_to_buf(): added 97 bytes to buf (now 101 total).
May 12 12:11:00.954 [debug] write_to_buf(): added 32 bytes to buf (now 133 total).
May 12 12:11:00.954 [info] update_router_descriptor_cache_downloads(): Requesting 2 descriptors from authority "moria2"
May 12 12:11:00.955 [debug] directory_initiate_command(): private 0, want_to_tunnel 0.
May 12 12:11:00.955 [debug] directory_initiate_command(): initiating server descriptor fetch
May 12 12:11:00.956 [debug] connection_connect(): Connecting to [scrubbed]:80.
May 12 12:11:00.956 [debug] connection_connect(): Connection to [scrubbed]:80 in progress (sock 443).
May 12 12:11:00.956 [debug] connection_add(): new conn type Directory, socket 443, n_conns 432.
May 12 12:11:00.956 [debug] write_to_buf(): added 4 bytes to buf (now 4 total).
May 12 12:11:00.956 [debug] write_to_buf(): added 97 bytes to buf (now 101 total).
May 12 12:11:00.956 [debug] write_to_buf(): added 33 bytes to buf (now 134 total).
May 12 12:11:00.958 [info] connection_dir_client_reached_eof(): Received 0/5 routers requested from 86.59.21.38:80
May 12 12:11:00.958 [debug] conn_close_if_marked(): Cleaning up connection (fd 223).
May 12 12:11:00.958 [debug] connection_remove(): removing socket 223 (type Directory), n_conns now 431
May 12 12:11:00.958 [debug] _connection_free(): closing fd 223.
May 12 12:11:00.959 [debug] conn_read_callback(): socket 219 wants to read.
May 12 12:11:00.959 [debug] read_to_buf_impl(): Encountered eof
May 12 12:11:00.959 [debug] fetch_from_buf_http(): headerlen 197, bodylen 0.
May 12 12:11:00.959 [info] parse_http_response(): Unrecognized content encoding: "x-compress". Trying to deal.
May 12 12:11:00.959 [debug] connection_dir_client_reached_eof(): Received response from directory server '140.247.60.64:80': 200 "OK"
May 12 12:11:00.959 [debug] connection_dir_client_reached_eof(): Time on received directory is within tolerance; we are 2 seconds skewed. (That's okay.)
May 12 12:11:00.959 [info] connection_dir_client_reached_eof(): HTTP body from server '140.247.60.64:80' was labeled with an unknown Content-Encoding, but it seems to be uncompressed. Trying both.
May 12 12:11:00.959 [info] connection_dir_client_reached_eof(): Received server info (size 0) from server '140.247.60.64:80'
May 12 12:11:00.959 [info] router_load_routers_from_string(): 0 elements to add
May 12 12:11:00.965 [notice] I learned some more directory information, but not enough to build a circuit.
May 12 12:11:00.996 [info] update_router_descriptor_cache_downloads(): Requesting 1 descriptors from authority "moria1"
May 12 12:11:00.996 [debug] directory_initiate_command(): private 0, want_to_tunnel 0.
May 12 12:11:00.997 [debug] directory_initiate_command(): initiating server descriptor fetch
May 12 12:11:00.997 [debug] connection_connect(): Connecting to [scrubbed]:9031.

It goes on and on like this.

I still have the tor datadir that's causing problems.
Can I safely attach it to this bug report if I
delete the keys/ subdir?

comment:3 Changed 12 years ago by arma

Neat. Yes, I think you can publish your datadirectory if you leave out the
keys directory.

I'd also like to see as much of the debug log of it failing as I can.

And lastly, if you know how to build Tor yourself from SVN, can you try
the latest SVN on your current datadirectory and see if it breaks too? I
recently fixed a bug that might be related.

Thanks!

comment:4 Changed 12 years ago by erikbosman

I tried to attach the log and the datadir to this bug report,
but I guess they are too large. So here their urls:

http://pi.homeunix.net/tor/debug.log-failing_0.1.2.13.bz2
http://pi.homeunix.net/tor/failing-datadir.tar.bz2

The problem is not fixed in the tor-0_1_2-patches branch
(Revision 10184.)

I tried building SVN-trunk, but it didn't compile
(an error in connection_edge.c.)

comment:5 Changed 12 years ago by arma

Whoops. SVN-trunk ought to compile again now. Sorry about that.

comment:6 Changed 12 years ago by erikbosman

SVN-trunk has the same problem :-(, here's a log:

http://pi.homeunix.net/tor/debug.log-0.2.0.0-rev10186.bz2

comment:7 Changed 12 years ago by arma

Ok, I can replicate this. Thanks!

Actually, svn-trunk should bootstrap eventually. Leave it alone for 10 or 20
minutes and see what happens. :)

comment:8 Changed 12 years ago by nickm

The bug where it requests 2 descriptors, then requests immediately again after failing without backing off, might
be the same as the bug I fixed in r10210. If so, it should be backported into 0.1.2.14.

EDIT: Backported; appears in 0.1.2.14.

comment:9 Changed 12 years ago by nickm

Roger: can you characterize the bad behavior from the remaining part of the bug? What's going wrong here?

comment:10 Changed 12 years ago by arma

I think I fixed most of this in 0.1.2.14:

  • If a directory authority is down, skip it when deciding where to get networkstatus objects or descriptors. Otherwise we keep asking every 10 seconds forever. Fixes bug 384.
  • Count it as a failure if we fetch a valid network-status but we don't want to keep it. Otherwise we'll keep fetching it and keep not wanting to keep it. Fixes part of bug 422.
  • If all of our dirservers have given us bad or no networkstatuses lately, then stop hammering them once per minute even when we think they're failed. Fixes another part of bug 422.

Erik, can you trigger the problems in 0.1.2.14 still?

comment:11 Changed 12 years ago by erikbosman

I can confirm that the problem does not occur with 0.1.2.14,
and the datadir that caused problems earlier.

comment:12 Changed 12 years ago by arma

Unfortunately, I think that's because dizum is running again -- so you
can bootstrap, but it would have worked before too. :)

But none the less, I think we should close this bug, and if people find
similar problems in the future, with 0.1.2.14 and later, it will probably
be for a new reason.

Thanks!

comment:13 Changed 12 years ago by arma

flyspray2trac: bug closed.

comment:14 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.