Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#2183 closed defect (fixed)

router_rebuild_descriptor(): Bug: Couldn't generate extra-info descriptor

Reported by: Falo Owned by: karsten
Priority: Medium Milestone: Tor: 0.2.2.x-final
Component: Core Tor/Tor Version: Tor: 0.2.2.17-alpha
Severity: Keywords: tor-relay
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Some days ago Karsten L. asked me to enable statistic generation again. I did and now one Tor process is logging errors. It appears Bug #1208 is still present in 0.2.2.17 on Debian:

Nov 12 22:52:06.340 [warn] router_rebuild_descriptor(): Bug: Couldn't generate extra-info descriptor.
Nov 12 22:52:06.341 [warn] couldn't find end of hashed material "
router-signature"

Child Tickets

Attachments (1)

all-blutmagie-statsdir.tar (180.0 KB) - added by Falo 9 years ago.
dated Sat Nov 13 21:20 CET 2010

Download all attachments as: .zip

Change History (30)

comment:1 Changed 9 years ago by Falo

affected node is blutmagie1

comment:2 Changed 9 years ago by karsten

Owner: set to karsten
Status: newaccepted

Before investigating this in detail, can you make backups of your stats/ directory, or even better, of your complete data directory? It looks like we have 10 hours to do this before the files will be overwritten.

The next step would be to upload a tarball of the stats/ directory. Everything in there would be included in the extra-info descriptors anyway and is not secret. If you're unsure, mail the tarball to me, and I'll check it and put it online.

comment:3 Changed 9 years ago by karsten

I looked up blutmagie's published server descriptors and extra-info descriptors of the past few days. I added two newlines for time intervals when I'd expect further descriptor publications:

      published      | descriptor | extrainfo | uptime |  dirreq-stats-end
---------------------+------------+-----------+--------+--------------------
 2010-11-06 08:37:00 |  94194858  |  1935209c | 209963 |
 2010-11-07 02:37:43 |  df049947  |  4f4e64c0 | 274806 |
 2010-11-07 20:38:26 |  96e74d8f  |  86f0409c | 339649 |
 2010-11-08 08:30:02 |  f01ab7f0  |  9fa08e1f |      0 |
 2010-11-08 08:32:13 |  b08a4146  |  ef2bab73 |     61 |
 2010-11-08 08:34:15 |  c3d259df  |  97300d82 |    183 |
 2010-11-08 08:54:35 |  c094e070  |  31714386 |   1403 |
 2010-11-08 14:35:20 |  81303aa6  |  9dad28ed |  21848 |
          ???
 2010-11-10 08:31:07 |  2c8b1095  |  65741633 | 172794 | 2010-11-10 08:31:04
 2010-11-11 02:31:10 |  93796a4c  |  493d8bfb | 237598 | 2010-11-10 08:31:04
          ???
 2010-11-13 08:31:12 |  cc259d41  |  7698faff | 431999 | 2010-11-13 08:31:04
 2010-11-13 10:04:07 |  903dbfa7  |  d2a671fd | 437575 | 2010-11-13 08:31:04
 2010-11-13 10:49:52 |  de60d392  |  e771cb58 | 440320 | 2010-11-13 08:31:04

Here's what I think happened:

  • 2010-11-08 08:30:02: You restarted blutmagie (09:30 CET).
  • 2010-11-09 08:30:02: In theory, the first stats should have been written to disk 24 hours after restarting. But blutmagie didn't publish a descriptor between 2010-11-08 14:35:20 and 2010-11-10 08:31:07. It should have published a descriptor at around 2010-11-09 08:35:20, 18 hours after the last descriptor. I guess it tried to publish a descriptor containing the first stats, but failed.
  • 2010-11-10 08:31:04: blutmagie wrote the second 24 hours of stats to disk and successfully included them only 3 seconds later in a descriptor.
  • 2010-11-11 08:31:04: blutmagie should have written its third day of stats to disk. But again, no descriptor.
  • 2010-11-12 08:31:04: blutmagie should have written its fourth day of stats to disk. But again, no descriptor.
  • 2010-11-12 21:52:06: blutmagie printed out the warning that you pasted above. It still tried to include the fourth day of stats, but fails.
  • 2010-11-13 08:31:04: blutmagie finishes its fifth day of stats, writes them to disk, and successfully includes them in the descriptor published at 2010-11-13 08:31:12.

Can you look at your logs to see if there are other warnings than the one you reported?

comment:4 Changed 9 years ago by karsten

Here's a theory: With all the dirreq-stats, entry-stats, and exit-stats turned on, blutmagie generates an extra-info descriptor that is larger than 50 KB, which is the maximum descriptor size that the directory authorities accept and therefore the maximum size that relays generate (MAX_EXTRAINFO_UPLOAD_SIZE 50000).

The quick fix would be to turn off ExitStatistics which takes most of the space.

But before doing that, I'd like to confirm this theory (or find out what else could be the bug) using blutmagie's stats files.

comment:5 Changed 9 years ago by Falo

the same problem currently occurs on blutmagie3. It vanished without manual intervention on blutmagie1 Nov 13 09:31:05. Blutmagie1 and blutmagie3 are directory mirrors, blutmagie2 and blutmagie4 aren't. I'll attach all four stats data directories to this ticket.

So far I didn't burn the corpse meaning no process has been restarted and no stats files have been deleted.

Changed 9 years ago by Falo

Attachment: all-blutmagie-statsdir.tar added

dated Sat Nov 13 21:20 CET 2010

comment:6 Changed 9 years ago by Sebastian

If the theory is right another bug would be that there isn't a good warning that we trigger when an extrainfo desc we produced is too large (and/or that we take action to make it smaller and thus acceptable)

comment:7 Changed 9 years ago by Falo

the error logs from this issue are to large for upload to the bug tracker. They are here: http://selke.de/pics/all-tor-logs-pipe-grep-Bug.gz

I just noticed the bug stroke on blutmagie2, too. Thus it doesn't appear to be directory mirror related.

comment:8 Changed 9 years ago by karsten

Okay, I'm pretty certain that it's the 50 KB limit that we're triggering here. The exit-stats files of blutmagie2-4 are 41K, 45K, and 41K large. (blutmagie's exit-stats file is only 1K large, but that's because it didn't publish a descriptor and was therefore not online in the 24 hours measurement period.) Add a few KB from the other stats, including the byte histories, and we have more than 50 KB.

Olaf, as a quick fix, please change ExitStatistics to 0 on all blutmagies and send the Tor processes a HUP signal. Really sorry for the inconvenience!

We can still improve the log message and shrink the stats later on.

comment:9 Changed 9 years ago by karsten

Status: acceptedneeds_review

See branch fix2183 in my public repository for a possible fix.

comment:10 Changed 9 years ago by Falo

ExitPortStatistics is set to 0 again and processes are sighuped.

comment:11 Changed 9 years ago by Falo

after turning off ExitPortStatistics yesterday the problem didn't occur any more.

comment:12 Changed 9 years ago by rransom

See mitigate2183 ( git://repo.or.cz/tor/rransom.git mitigate2183 ) for patches to make relays generate (and publish) router descriptors even if they cannot generate their extra-info descriptors.

comment:13 Changed 9 years ago by karsten

See updated branch fix2183 in my public repository. This patch takes the (hopefully more robust) approach of joining smartlist elements to generate the extra-info descriptor.

comment:14 in reply to:  12 ; Changed 9 years ago by karsten

Replying to rransom:

See mitigate2183 ( git://repo.or.cz/tor/rransom.git mitigate2183 ) for patches to make relays generate (and publish) router descriptors even if they cannot generate their extra-info descriptors.

I think the approach to publish a router descriptor even if we cannot generate an extra-info descriptor is good.

A quick comment to your patch: You could add a parameter has_extra_info_digest to router_dump_router_to_string() instead of iterating over all bytes of extra_info_digest to find out if it contains a non-zero value.

Also, how about using a new variable for the extra-info-digest line instead of the "%s%s%s%s" part? Other than that, looks good.

comment:15 in reply to:  13 ; Changed 9 years ago by rransom

Replying to karsten:

See updated branch fix2183 in my public repository. This patch takes the (hopefully more robust) approach of joining smartlist elements to generate the extra-info descriptor.

Your patch uses log_err to report conditions that will not crash Tor.

From src/common/torlog.h:

/** Error-level severity: for messages that only appear when something has gone
 * very wrong, and the Tor process can no longer proceed. */
#define LOG_ERR     3

We are considering handling messages with level LOG_ERR specially in response to the current fix for #1125 -- in particular, the process of logging error messages may itself destabilize or crash the Tor process, in an attempt to report a useful error message to Windows Tor users who can only read log messages through Vidalia.

comment:16 Changed 9 years ago by karsten

As a solution to avoid exceeding the 50 KB limit, how about we stop writing ExitPortStatistics to extra-info descriptors? We could instead append them to a local file only, so that they are still available if someone wants them.

I'd like to hear Steven's opinion here, because he was the reason why we implemented these stats a year ago.

comment:17 in reply to:  15 Changed 9 years ago by karsten

Replying to rransom:

Your patch uses log_err to report conditions that will not crash Tor.

You're right. Fixed. Thanks!

comment:18 in reply to:  14 Changed 9 years ago by rransom

Replying to karsten:

Replying to rransom:

See mitigate2183 ( git://repo.or.cz/tor/rransom.git mitigate2183 ) for patches to make relays generate (and publish) router descriptors even if they cannot generate their extra-info descriptors.

I think the approach to publish a router descriptor even if we cannot generate an extra-info descriptor is good.

A quick comment to your patch: You could add a parameter has_extra_info_digest to router_dump_router_to_string() instead of iterating over all bytes of extra_info_digest to find out if it contains a non-zero value.

That flag belongs in the signed_descriptor_t record. I wanted to get a patch out quickly to publish the router descriptor even if no extra-info descriptor can be published. Adding a has_extra_info_digest field to signed_descriptor_t will take more time both to write and to review.

Also, how about using a new variable for the extra-info-digest line instead of the "%s%s%s%s" part? Other than that, looks good.

Using three conditional expressions in a row in the arguments to a printf-like function:

  • takes less programmer time,
  • takes less source-code space, and
  • is harder to screw up in a very harmful way (i.e. buffer overflow)

than allocating a temporary string variable, whether on the stack (as a local char-array variable) or in the heap (using tor_malloc_zero and the associated free function), filling the temporary string variable, and feeding the temporary string to printf.

comment:19 Changed 9 years ago by Sebastian

We should definitely fix this; but we don't have to rush a solution here.

Specifying a more resilient format for extra-info documents making sure that we can't generate them too large seems to be preferable instead of just not having the statistics available

comment:20 Changed 9 years ago by karsten

To recap, we have 3 issues and 3 (partial) solutions here:

  • Relays don't realize when they are generating extra-info descriptors with statistics that exceed the 50 KB limit. This problem is fixed by my branch fix2183. Needs review.
  • Relays that fail to generate an extra-info descriptor for whatever reason also fail to generate a server descriptor. They just print out a warning and retry a couple of milliseconds later. rransom wrote a patch in his mitigate2183 branch that at least generates a server descriptor in this case and that, IMO, needs more work. Sebastian mentioned that failing may also be an option if we can't generate an extra-info descriptor, or we'll never learn about this bug. Needs discussion.
  • Very fast relays generate exit-port statistics that require 40-45 KB of extra-info descriptors. We don't have a fix yet, but I proposed to exclude exit-port statistics from extra-info descriptors. We're waiting for a word from Steven who wanted these statistics a year ago.

I'm opening new tickets for issues 2 and 3. From now on, please only comment on issue 1 in this task.

comment:21 Changed 9 years ago by nickm

reviewing "fix2183":

  • Probably, we should esc_for_log any descriptors before logging them.... but logging them is probably not the best idea anyway, since our log buffer truncates after 10024 bytes.
  • No need to change this, but in the future, stuff like taking out that "opt " should really get its own commit.
  • I prefer having bridge_stats declared in the only block where it's used; there's no real reason to have it at the head of the function.
  • The definition of SIG_LEN really belongs near the declaration of router_append_dirobj_signature(), and wants to be called something more like DIROBJ_MAX_SIG_LEN .
  • Rather than dropping all stats when we exceed 50k, would it be at all useful to remove elements from the end of the chunks smartlist, one by one, until we are under 50k again?

comment:22 in reply to:  21 Changed 9 years ago by karsten

Replying to nickm:

reviewing "fix2183":

  • Probably, we should esc_for_log any descriptors before logging them.... but logging them is probably not the best idea anyway, since our log buffer truncates after 10024 bytes.

I took them out.

  • No need to change this, but in the future, stuff like taking out that "opt " should really get its own commit.

Ooops. You're right. I changed even more stuff, but then decided it was a bad idea and changed it back. This change somehow slipped through.

  • I prefer having bridge_stats declared in the only block where it's used; there's no real reason to have it at the head of the function.

Changed. As a Java programmer I prefer to declare variables where I use them, but my understanding was that it's good C style to declare variables at the top of functions. In general, should I try to move declarations to blocks, if possible?

  • The definition of SIG_LEN really belongs near the declaration of router_append_dirobj_signature(), and wants to be called something more like DIROBJ_MAX_SIG_LEN .

Moved and renamed. I also found another place where 256 bytes were left free for a signature, so I changed the value from 250 to 256.

  • Rather than dropping all stats when we exceed 50k, would it be at all useful to remove elements from the end of the chunks smartlist, one by one, until we are under 50k again?

Done.

Thanks for your comments! Please see my updated branch fix2183.

comment:23 Changed 9 years ago by nickm

In general, should I try to move declarations to blocks, if possible?

As you write new code, yes, if it helps readability and doesn't mess up control flow.

Done.

Hm. The - 2 logic threw me for a bit, but now I get it. Looks good.

comment:24 Changed 9 years ago by nickm

Milestone: Tor: 0.2.2.x-final

comment:25 Changed 9 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

ok, still looks good. merged to 0.2.2 and master.

comment:26 Changed 9 years ago by mikeperry

So what is the effect of removing things from the end of the smartlist? What types of elements can we expect to see removed if the size is over 50KB?

I also really needed to see the exit port byte totals. This is especially important to determining if we should weight certain exits differently than others based on exit policy. The only way we can tell how we should do this weighting is if we see these stats from very fast exits with the default exit policy, which are precisely the exits that are going to have very large extra info descriptors.

Why can't we just be smarter about grouping more of the long tail of ports with low bytes counts into other? This should reduce the size of the exit-kibibyes-* lines.

comment:27 in reply to:  26 Changed 9 years ago by karsten

Replying to mikeperry:

So what is the effect of removing things from the end of the smartlist? What types of elements can we expect to see removed if the size is over 50KB?

The current order in the smartlist is dirreq-stats, entry-stats, cell-stats, and then exit-stats. That means exit-stats will be removed first. This order is purely implementation-specific and not specified anywhere. This is a bug case and a fix to prevent relays from breaking. We shouldn't be in a situation where we want to write 50 KB of stats.

I also really needed to see the exit port byte totals. This is especially important to determining if we should weight certain exits differently than others based on exit policy. The only way we can tell how we should do this weighting is if we see these stats from very fast exits with the default exit policy, which are precisely the exits that are going to have very large extra info descriptors.

You should be able to sum up all ports plus the 'other' port to learn about totals.

Why can't we just be smarter about grouping more of the long tail of ports with low bytes counts into other? This should reduce the size of the exit-kibibyes-* lines.

We can do that. See the discussion in #2196 for ideas to reduce the exit-stats size.

comment:28 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:29 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.