Opened 6 years ago

Last modified 6 months ago

#7164 needs_revision defect

microdesc.c:378: Bug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1

Reported by: jaj123 Owned by: nickm
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.4.19
Severity: Normal Keywords: tor-client, 025-backport, nickm-should-review, review-group-24, 033-triage-20180320, 033-removed-20180320
Cc: asn Actual Points:
Parent ID: Points: 2
Reviewer: asn Sponsor:

Description

Oct 20 21:14:18.594 [Warning] microdesc_free(): Bug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1

Child Tickets

Attachments (1)

ticket7164.tails.17.1.log (3.5 KB) - added by cypherpunks 5 years ago.
complete log of seeing this bug in tails 17.1 after setting the clock

Download all attachments as: .zip

Change History (107)

comment:1 Changed 6 years ago by arma

Component: - Select a componentTor
Milestone: Tor: 0.2.3.x-final

Exciting!

Which Tor version?

What bundle were you using? What OS?

Were you a client or a relay or something else?

Any other configuration changes or hints?

comment:2 Changed 6 years ago by arma

See #6429, #6494, #3153, #4118 for related tickets.

comment:3 in reply to:  1 Changed 6 years ago by jaj123

Replying to arma:

Exciting!

Which Tor version?

What bundle were you using? What OS?

Were you a client or a relay or something else?

Any other configuration changes or hints?

Tor version 0.2.3.22, Vidalia 0.2.20, No idea about bundle, all Tor & Vidalia stuff installed via Synaptic package manager, OS ubuntu 64-bit 12.04 LTS, updated regularly, all possible unity software removed. I use OS's (perhaps related also to firefox somehow) default proxies.

I'm really a newbie in Tor: now I have been running a relay for about two months. In my previous main system (also a 64 bit Linux machine) I tried to run Tor, but it just kept crashing, so I gave up.

comment:4 Changed 6 years ago by arma

Priority: normalmajor
Summary: Tor's request to send a bug reportBug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1

On the previous system, when it 'kept crashing', do you have any other hints for us?

comment:5 Changed 6 years ago by jaj123

When my old Tor crashed, it seemed random: sometimes it took 5 min, sometimes 1 hour. That was about 1½ years ago. I kept no records, I just thought I'll give Tor another try later.

I installed Ubuntu's updates suggested by the Update manager some 8 hours ago. Tor and Vidalia went completely down, as did Firefox (it had Tor enabled). I had to remove Firefox completely and re-install it to get it up. Vidalia and Tor are now up and running, but I had to start them as root (sudo was not enough), While trying, the whole system crashed several times, but recovered without any tricks, creating automatic bug reports for the Ubuntu team. I think the later is Ubuntu's bug: it just would not let me start anything with Gnome GUI as root, and as a result told me even:"...root does not exist, cant' do anything without it" (Sic!). Please keep me informed, there might be some serious bugs spreading around.

comment:6 Changed 6 years ago by nickm

Keywords: tor-client added

comment:7 Changed 6 years ago by jaj123

Update: latest message from Vidalia Message log:"Oct 24 13:57:27.610 [Notice] Heartbeat: Tor's uptime is 2 days 17:53 hours, with 36 circuits open. I've sent 10.95 GB and received 2.09 GB." Quite a lot sent compared to received, maybe even a security failure. Only the relay has been running after the crash. I generally use client services very little, basically I have only tested weather they work. Please keep me informed, if there's anything new I should do in order to keep my relay running as it should. I've been away for a couple of days, so I could no report the current status earlier.

comment:8 Changed 6 years ago by nickm

It would be great to have a stack trace from the line in microdesc.c that's giving this message. Do you know how to use gdb?

comment:9 Changed 6 years ago by jaj123

Sorry, I don't how to use gdb. With Your help, I'm willing to try.

comment:10 Changed 6 years ago by jaj123

Update: Tor down, cleanly by itself, on 27.10, reporting (roughly): Interrupt, not accepting new connections, shutting down in 30 seconds. Any advise?

comment:11 Changed 6 years ago by nickm

The "Interrupt, not accepting new connections, shutting down in 30 seconds" thing is what happens if Tor received a SIGINT signal, or a control-c at the command line. It usually doesn't indicate a bug.

comment:12 Changed 6 years ago by jaj123

Thanks nickm. To my surprise, Tor and vidalia are now working as I expect: they started from the "onion"-button from Gnome when I was logged in as a normal user. Not a single warning or error message, the latest:"Oct 31 13:22:33.228 [Notice] Heartbeat: Tor's uptime is 22:54 hours, with 15 circuits open. I've sent 810.66 MB and received 790.18 MB." There's been an update by Ubuntu Update Manager: the current version is 0.2.3.24-rc. The one that I had problems with was 0.2.3.22-rc. Problem solved?

comment:13 Changed 6 years ago by jaj123

It happened again:Nov 06 20:00:46.279 [Warning] microdesc_free(): Bug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1. Latest normal log:Nov 06 19:22:33.228 [Notice] Heartbeat: Tor's uptime is 7 days 4:54 hours, with 58 circuits open. I've sent 16.31 GB and received 16.02 GB. I have no idea what's happening. Please help, I want to run Tor without causing any trouble to other users.

comment:14 in reply to:  13 Changed 6 years ago by arma

Replying to jaj123:

I want to run Tor without causing any trouble to other users.

For what it's worth, seeing this message is probably harmless. That is, we should track it down, but it's probably not causing any trouble to other users.

comment:15 Changed 6 years ago by kencf0618

Tor 0.2.3.25; I don't recall offhand which bundle I'm using, but I keep them up to date; Ubuntu Linux 12.04.1 LTS; running as a non-exit relay. This is the first time I've encountered this particular error in my years of running Tor. The only thing I've been doing differently lately is running EtherApe.

comment:16 Changed 6 years ago by cypherpunks

I've also caught this in the wild. Running Trisquel GNU/Linux, TOR version 0.2.3-1~oneric+1. Installed from the TOR PPA.

comment:17 Changed 6 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

comment:18 Changed 6 years ago by nickm

Status: newneeds_review

I looked over all the microdesc_free() calls again, and none seems super likely here. Maaaybe the one in microdesc_cache_clean? But how would a currently live node reference a microdescriptor that was last listed over 7 days ago? And could something else be going on?

In an attempt to track this down, I did a quick patch to log the fname:lineno that's invoking microdesc_free(). See branch "bug7164_diagnostic" in my public repo. The branch is against 0.2.4.

comment:19 in reply to:  18 Changed 6 years ago by andrea

Replying to nickm:

I looked over all the microdesc_free() calls again, and none seems super likely here. Maaaybe the one in microdesc_cache_clean? But how would a currently live node reference a microdescriptor that was last listed over 7 days ago? And could something else be going on?

In an attempt to track this down, I did a quick patch to log the fname:lineno that's invoking microdesc_free(). See branch "bug7164_diagnostic" in my public repo. The branch is against 0.2.4.

I think it's fine to merge this diagnostic branch.

comment:20 Changed 6 years ago by nickm

Status: needs_reviewneeds_information

Okay, merged the diagnostic branch into 0.2.4. If somebody hits this, we can figure out the real fix, I hope.

comment:21 Changed 6 years ago by nickm

Closed #8556 as a duplicate of this.

Changed 5 years ago by cypherpunks

Attachment: ticket7164.tails.17.1.log added

complete log of seeing this bug in tails 17.1 after setting the clock

comment:22 Changed 5 years ago by arma

closed #8871 as a duplicate of this too

comment:23 Changed 5 years ago by tryy3

I got same error
tor version: 0.2.35
vidalia version: 0.2.21
OS: windows 7
bundle: I installed "Vidalia Relay Bundle" but i changed in settings to a exit relay!
I am a client and relay, mostly relay but use client once in a while, got it running but don't use it as much as relay!

I get the error about 10 times every hour!
Gonna update to 0.2.4 t test!

comment:24 Changed 5 years ago by tryy3

0.2.4 fixed it!

comment:25 Changed 5 years ago by nickm

Keywords: 024-backport added
Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final

comment:26 in reply to:  20 Changed 5 years ago by arma

Replying to nickm:

Okay, merged the diagnostic branch into 0.2.4. If somebody hits this, we can figure out the real fix, I hope.

See #10638 and #9813 for examples of 0.2.4 clients that hit it.

src/or/microdesc.c:378 and src/or/microdesc.c:379 respectively.

comment:27 Changed 5 years ago by arma

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Status: needs_informationnew
Summary: Bug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1microdesc.c:378: Bug: microdesc_free() called, but md was still referenced 1 node(s); held_by_nodes == 1
Version: Tor: 0.2.3.22-rcTor: 0.2.4.19

#10630 also saw that line

comment:28 Changed 5 years ago by arma

Nick: just to make it explicit, it looks like 0.2.4.x clients *are* seeing this, and the new line numbers in the debug logs should give us our hints.

comment:29 Changed 5 years ago by Envite

I have just found this bug again in RT#22370

I have asked the user for exact version information and OS version information.

[sex 14. mar 19:32:52 2014] Tor Software Error - The Tor software
encountered an internal bug. Please report the following error message to
the Tor developers at bugs.torproject.org: "microdesc_free(): Bug:
microdesc_free() called, but md was still referenced 1 node(s);
held_by_nodes == 1
"

comment:30 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Status: newneeds_review

The line number suggests that this is happening in microdesc_cache_clean():

  for (mdp = HT_START(microdesc_map, &cache->map); mdp != NULL; ) {
    if ((*mdp)->last_listed < cutoff) {
      ++dropped;
      victim = *mdp;
      mdp = HT_NEXT_RMV(microdesc_map, &cache->map, mdp);
      victim->held_in_map = 0;
      bytes_dropped += victim->bodylen;
      microdesc_free(victim);
    } else {
      ++kept;
      mdp = HT_NEXT(microdesc_map, &cache->map, mdp);
    }
  }

So, there's a microdesc that is (probably) held by a node, but its last-listed is more than one week ago. Interesting!

In theory:

  • A node should not exist unless it has a routerstatus or a routerinfo.
  • A node should not have a microdescriptor unless it has a routerstatus.
  • Whenever a networkstatus is loaded, we should be updating the last_seen field of the microdescriptors.

So something has gone wrong with the theory.

I'm not too sure what -- if somebody has ideas, that would be great. I've tried to write an improved diagnostic branch. Please review "bug7164_diagnose_harder" in my public repository. It's more logs, not a fix.

comment:31 Changed 5 years ago by Envite

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final

RT#22370 happens in Tor 0.2.3.25

comment:32 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final

Please don't mess with the milestones. The "024-backport" tag is what means "backport this to 0.2.4 when it's done". All bugfixes for 0.2.4 or 0.2.3 need to get tested in 0.2.4 (the latest branch) before they can get get backported.

comment:33 in reply to:  30 Changed 4 years ago by andrea

Replying to nickm:

The line number suggests that this is happening in microdesc_cache_clean():

  for (mdp = HT_START(microdesc_map, &cache->map); mdp != NULL; ) {
    if ((*mdp)->last_listed < cutoff) {
      ++dropped;
      victim = *mdp;
      mdp = HT_NEXT_RMV(microdesc_map, &cache->map, mdp);
      victim->held_in_map = 0;
      bytes_dropped += victim->bodylen;
      microdesc_free(victim);
    } else {
      ++kept;
      mdp = HT_NEXT(microdesc_map, &cache->map, mdp);
    }
  }

So, there's a microdesc that is (probably) held by a node, but its last-listed is more than one week ago. Interesting!

In theory:

  • A node should not exist unless it has a routerstatus or a routerinfo.
  • A node should not have a microdescriptor unless it has a routerstatus.
  • Whenever a networkstatus is loaded, we should be updating the last_seen field of the microdescriptors.

So something has gone wrong with the theory.

I'm not too sure what -- if somebody has ideas, that would be great. I've tried to write an improved diagnostic branch. Please review "bug7164_diagnose_harder" in my public repository. It's more logs, not a fix.

Doesn't this also change the behavior as well?

-    if ((*mdp)->last_listed < cutoff) {
+    const int is_old = (*mdp)->last_listed < cutoff;
+    const unsigned held_by_nodes = (*mdp)->held_by_nodes;
+    if (is_old && !held_by_nodes) {

The test that would match the old behavior would be just if (is_old), surely?

comment:34 Changed 4 years ago by andrea

Also, the wording of the log messages ("Microdescriptor seemed very old (last listed %d hours ago vs %d hour cutoff), but is still marked as being held by %d node(s). I found %d node(s) holding it.") suggests you only want to emit for old microdescriptors, but the enclosing test is just if (held_by_nodes) rather than if (is_old && held_by_nodes). Am I missing something here?

comment:35 Changed 4 years ago by andrea

I think the actual logging code is okay as long as we're sure the tests are right; making them over-eager means a lot of searching the whole list of nodes in nodelist_find_nodes_with_microdesc() whenever we run microdesc_cache_clean()

comment:36 Changed 4 years ago by nickm

The test that would match the old behavior would be just if (is_old), surely?

The old behavior was to call "microdesc_free()" if the microdescriptor was old... and then the warning would be produced because held_by_nodes was nonzero, and we wouldn't free the thing. The change here has the old behavior in the non-warning case, and the new behavior in the case where we would have given a warning.

Also, the wording of the log messages ("Microdescriptor seemed very old (last listed %d hours ago vs %d hour cutoff), but is still marked as being held by %d node(s). I found %d node(s) holding it.") suggests you only want to emit for old microdescriptors, but the enclosing test is just if (held_by_nodes) rather than if (is_old && held_by_nodes). Am I missing something here?

Oops. Yeah, adding a fixup commit. Better now?

comment:37 Changed 4 years ago by nickm

Keywords: 025-triaged added
22:12 < nickm> athena: okay to merge now IYO?
22:12 < athena> yeah

I shoudl squash and merge rsn then.

comment:38 Changed 4 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Squashed and merged; thanks, all!

comment:39 Changed 4 years ago by nickm

Resolution: fixed
Status: closedreopened

comment:40 Changed 4 years ago by nickm

Status: reopenedneeds_information

Oops, that should have gone into needs_information. That was a diagnostic message improvement, not a total fix.

comment:41 Changed 4 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.???

Putting this in 0.2.???; they should get moved back into a more timed milestone once they're no longer in needs_information.

comment:42 Changed 4 years ago by arma

Should we sneak in a log severity downgrade in 0.2.4.21?

It's my understanding that the more useful log message change went into 0.2.5, leaving the 0.2.4 with nothing useful to us but their logs still yell at them.

I was reminded of this idea from #11668.

comment:43 Changed 4 years ago by nickm

Should we sneak in a log severity downgrade in 0.2.4.21?

(ITYM 0.2.4.22.) Sure, that's fine with me.

comment:44 in reply to:  43 Changed 4 years ago by nickm

Replying to nickm:

Should we sneak in a log severity downgrade in 0.2.4.21?

(ITYM 0.2.4.22.) Sure, that's fine with me.

Done as 6a4f5d9b4df1d05d1b158417cfcc85a547ad1549 in 0.2.3 and 0.2.4

comment:45 Changed 4 years ago by cypherpunks

7 days to trigger bug:
day 0: user launches Tor, fetches md (which later fails) and caches it with last_listed = 0 to disk;
day 1: user online; restarting, shutdowning etc;
day 2: user online, relay still announcing the same keys as for md from cache, no cached md changes;
day 3: the same;
day 4: the same;
day 5: user offline;
day 6: user offline;

day 7: user starting Tor, cached consensus is several days old; md's last_listed stored on disk is zero; update_microdescs_from_networkstatus can't to update last_listed field in memory anymore; if microdesc_cache_clean launched before new consensus arrived and "last_listed < today - 7days" then bug triggeres.

comment:46 in reply to:  45 Changed 4 years ago by cypherpunks

Replying to cypherpunks:

7 days to trigger bug:

It was wrong idea. It' impossible to clean md cache with non reasonably live consensus

  if (! force &&
      ! networkstatus_get_reasonably_live_consensus(now, FLAV_MICRODESC))
      return;

No live consensus no call of microdesc_free

comment:47 Changed 4 years ago by cypherpunks

Corruption of nodelist_map could to explain this bug.

  for (iter = HT_START(nodelist_map, &the_nodelist->nodes_by_id); iter; ) {

Probably ends before affected node->md fixed.
That corruption probably not affects HT_FIND or HT_INSERT because values of
(head)->hth_table[HT_ELT_HASH_(elm,field,hashfn) % head->hth_table_length] not broken.
Corruption probably happens during HT_GROW or something.

It's only theory that could be confirmed by new logs, node->rs will be NULL if nodelist_map was corrupted.

comment:48 Changed 4 years ago by cypherpunks

Corruption of nodelist_map

Just for clarity. All HT used by Tor with general mem alloc functions (malloc, realloc). It checks result of malloc or realloc but at low level only (by HT_GROW), and no checks result of HT_GROW (it returns -1 if something wrong and 0 if something ok and wrong too for overfilled by limits).

Last edited 4 years ago by cypherpunks (previous) (diff)

comment:49 Changed 4 years ago by cypherpunks

Just for clarity.
no checks result of HT_GROW

Wrongish. It can't to hurt HT.

comment:50 Changed 4 years ago by cypherpunks

OK. corruption could to explain everything except how that corruption happens.

comment:51 Changed 4 years ago by nickm

Also, if corruption is happening, we need to figure out why corruption only seems to be happening to this hashtable: the other ones seem to be working okay.

comment:52 Changed 4 years ago by cypherpunks

the other ones seem to be working okay.

Or maybe not. It's probably corruption of this hashtable leads to visible effects while corruption of another hashtables do triggers nothing.

comment:53 Changed 4 years ago by nickm

New instance with 0.2.5.8-rc reported in #13481:

microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 168 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0.

microdesc_cache_clean(): Bug: [0]: ID=99C6B1CB2E5E1030F40FE4AA55A1AA3566037507. md=0x7ff53e51ce50, rs=0x7ff53e5e4d20, ri=0x7ff53fc19c20. Microdesc digest in RS does match. RS okay in networkstatus.

comment:54 Changed 4 years ago by nickm

To me, these things are suggestive, or at least interesting:

  • There is an 'ri' value set.
  • This is a server.
  • There are many, many of these messages occurring.

And it seems that there are a couple of problems with our diagnostic:

  • If we're a relay, looking at node->rs is pointless here, since it's from the FLAV_NS consensus.. We need to look at the rs from our latest microdescriptor consensus.
  • Also, there's a typo: "Microdest digest in RS does match" should be "...does not match". Fortunately, we give a different message when it does.

comment:55 Changed 4 years ago by AndrewBuck

I'm not technical enough a user to understand the details in the discussion above but I will give a couple more datapoints here just in case it helps in diagnosing the issue. I run a middle relay on version 2.5.10 (but I started seeing these about a week ago on 2.5.9). I had never seen them before but I kept getting them every hour (about 20 at a time). After a few days of this I restarted my relay both to update to 2.5.10 and to see if that cleared up the issue. After restart the relay worked normally for about a week and now I am seeing them again. Here is one pair of lines (again getting about 20 of these every hour).

Nov 12 07:25:07.334 [Warning] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 168 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 0 hours old. Hashtable badness is 0.

Nov 12 07:25:07.338 [Warning] microdesc_cache_clean(): Bug: [0]: ID=18E0BC920307C0B79F576EDB3473D74D4D1BCA94. md=0x7f8a1bcdaf60, rs=0x7f8a1d09c290, ri=0x7f8a1d86c200. Microdesc digest in RS does match. RS okay in networkstatus.

Here is my node on atlas if that helps as well.
https://atlas.torproject.org/#details/55E80F21D6D05CD5FAE52DAEF19EB9E62B8E1F6B

Everything in the logs looks normal before these errors start cropping up, the only thing even a bit out of place is that in the last heartbeat before the errors start I am seeing about 8000 TAP handshakes compared to 3 or 4 thousand normally (the number of circuits and NTor handshakes are normal though).

Anyway, hope this information helps diagnose the problem. I will keep my relay up like this for a couple more days before restarting in case there is any other kind of information you would like gathered from it.

P.S. I copied the logs into a text file and did a bit of sed/grep work on them and compiled a list of relay ID numbers from the ID= field. There are a total of 414 of them listed and not a single one appears twice in the list. Out of 6000 possible relays this represents a sizeable fraction of the network.

P.P.S now about 2 days later I am up to 1800 relays showing a bad entry, and still exactly 0 repeats.

Last Update - I am now up to about 4200 relays listed in the error logs, however it has started to "loop back around" and I now see about 1000 or so with 2 occurrences each, and the rest with one only; none with 3 or more.

Last edited 4 years ago by AndrewBuck (previous) (diff)

comment:56 in reply to:  53 Changed 4 years ago by cypherpunks

Replying to nickm:

New instance with 0.2.5.8-rc reported in #13481:

#13481 is probably related and not fully duplicate of this bug.
Bunch of "md was still referenced" should be able to trigger if to launch tor as client and to change it to relay mode after all stuff fetched then wait for a week or so.
nodelist_set_consensus() called only for usable_consensus_flavor() (FLAV_MICRODESC for client and FLAV_NS for relay). While client mode enabled nodelist_set_consensus() changes node->md, and skips it if relay mode enabled, then if node->rs exist for given ns consensus then such node can't be purged.
Probably valid explanation, if nothing missed.

comment:57 Changed 4 years ago by cypherpunks

Probably valid explanation

Except maybe nodelist_add_microdesc() that should to update everything in question.

comment:58 Changed 4 years ago by cypherpunks

Except maybe nodelist_add_microdesc()

That called by:

    networkstatus_t *ns = networkstatus_get_latest_consensus();
    if (ns && ns->flavor == FLAV_MICRODESC)
      SMARTLIST_FOREACH(added, microdesc_t *, md, nodelist_add_microdesc(md));

Where networkstatus_get_latest_consensus depends we_use_microdescriptors_for_circuits and would to return current_ns_consensus if relay mode enabled (generally it depends another options yet).

comment:59 Changed 4 years ago by cypherpunks

Bunch of "md was still referenced" should be able to trigger if to launch tor as client and to change it to relay mode after all stuff fetched then wait for a week or so.

Yet way to trigger bug is to use bridges, if one bridge supports md and another bridge have no md support. Client switches to ns-consensus after bridge that supports md fails circuit. Then client leaving pointers (node->md) and md unchanged, so bug depends cached stuff (@last-listed values) and client uptime. However, this scenario, if it was confirmed yet, shouldn't be actual today, md supported since 0.2.3.1-alpha version and 0.2.3.x is outdated.

comment:60 Changed 4 years ago by cypherpunks

Status: needs_informationneeds_review

#13481 and probably some another reported case should be fixed by next patch:

--- nodelist.c.original	2014-10-10 06:06:24.000000000 -0700
+++ nodelist.c	2014-12-18 06:29:32.696802404 -0800
@@ -229,6 +229,10 @@
         if (node->md)
           node->md->held_by_nodes++;
       }
+    } else { /* No md-consensus used, releasing md used by node */
+      if (node->md)
+        node->md->held_by_nodes--;
+      node->md = NULL;
     }
 
     node_set_country(node);

comment:61 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.6.x-final

comment:62 Changed 4 years ago by cypherpunks

It's still keep buggy for case switching in between client only and relay modes while both current_md_consensus and current_ns_consensus cached. Everything then depends timing till new consensus update, and md's last-listed values.

comment:63 Changed 4 years ago by cypherpunks

Next patch tries to resolve last edge case:

--- microdesc.c.original	2014-12-16 10:16:08.393137000 -0800
+++ microdesc.c	2014-12-18 08:30:27.637096984 -0800
@@ -847,6 +847,8 @@
 we_use_microdescriptors_for_circuits(const or_options_t *options)
 {
   int ret = options->UseMicrodescriptors;
+  static int prev_ret_we_use_md = -1;
+  networkstatus_t ns = NULL;
   if (ret == -1) {
     /* UseMicrodescriptors is "auto"; we need to decide: */
     /* If we are configured to use bridges and none of our bridges
@@ -859,6 +861,24 @@
      * a partitioning issue here where bridges differ from clients. */
     ret = !server_mode(options) && !options->FetchUselessDescriptors;
   }
+  /* Detect if preferable consensus flavor changed,
+   * update nodelist according to choosen consensus then. */
+  if (prev_ret_we_use_md != -1 && prev_ret_we_use_md != ret) {
+
+     /* We can't to call networkstatus_get_latest_consensus(),
+      * it returns current_consensus that depends
+      * call of we_use_microdescriptors_for_circuits() */
+    if (ret) /* we use microdescriptors, we need md-consensus */
+      ns = networkstatus_get_latest_consensus_by_flavor(FLAV_MICRODESC);
+    else 
+      ns = networkstatus_get_latest_consensus_by_flavor(FLAV_NS);
+
+    /* update nodelist with any latest consensus */
+    if (ns)
+      nodelist_set_consensus(ns);
+  }
+
+  prev_ret_we_use_md = ret;
   return ret;
 }
--- nodelist.c.original	2014-10-10 06:06:24.000000000 -0700
+++ nodelist.c	2014-12-18 08:23:30.969080066 -0800
@@ -229,6 +229,11 @@
         if (node->md)
           node->md->held_by_nodes++;
       }
+    } else { /* No md-consensus used, releasing md used by node if need */
+      if (node->md) {
+        node->md->held_by_nodes--;
+        node->md = NULL;
+      }
     }
 
     node_set_country(node);

comment:64 Changed 4 years ago by cypherpunks

Next patch tries to resolve last edge case

More preferable and non-kludgy way is to drop check of any_bridge_supports_microdescriptors() by we_use_microdescriptors_for_circuits, assuming every bridge support md. Then to update nodelist by options_act if config options affecting preferable consensus flavor was changed.

comment:65 Changed 4 years ago by nickm

Keywords: 025-backport nickm-should-review added; 024-backport 025-triaged removed

comment:66 Changed 4 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.???
Status: needs_reviewneeds_revision

comment:67 Changed 3 years ago by nickm

#16373 was another report of this; so was #13945.

comment:68 Changed 2 years ago by teor

Severity: Normal

#19619 was another report of this.

comment:69 Changed 2 years ago by nickm

whoo, just hit this in 0.2.8, when starting up with a very old data directory. It said:

Jul 13 11:46:43.000 [notice] Bootstrapped 100%: Done
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=3EFB929FA15E084A5F53C9C7A058C593A97DFD57. md=0x55b7d50ae050, rs=0x55b7d5443610, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=179B10784BF8955C73313CCB195904AE133E5F53. md=0x55b7d50adc20, rs=0x55b7d5412f20, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=1EA3292F7D9EB3A9CEBA0DE4C0C19895C7C449C5. md=0x55b7d5152820, rs=0x55b7d541af00, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D. md=0x55b7d4f91a30, rs=0x55b7d548cde0, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=E79699F226A6ED3B1D13B0F6B983D40779B8693E. md=0x55b7d50a9e90, rs=0x55b7d5513e50, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=F378A4DD858B6B7BBB79C6EAB1CE6912AC0FF8BC. md=0x55b7d51523b0, rs=0x55b7d5522a20, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=2679B51C906158F3DF4C59AFD73E2B1FDA6535E1. md=0x55b7d50aa890, rs=0x55b7d540d340, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 1056 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.5-rc-dev )
Jul 13 12:16:38.000 [warn] microdesc_cache_clean(): Bug:   [0]: ID=B870A9F8085D9D63541CA1C73C82C5D2827919F1. md=0x55b7d50ac210, rs=0x55b7d54daeb0, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.2.8.5-rc-dev )
Jul 13 17:46:38.000 [notice] Heartbeat: Tor's uptime is 5:59 hours, with 0 circuits open. I've sent 527 kB and received 4.90 MB.

comment:70 Changed 2 years ago by nickm

That was with:

Jul 13 11:46:37.359 [notice] Tor v0.2.8.5-rc-dev running on Linux with Libevent 2.0.21-stable, OpenSSL 1.1.0-pre6-dev and Zlib 1.2.8.

comment:71 Changed 2 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.9.x-final
Status: needs_revisionnew

Huh. Those microdescriptors supposedly have corresponding routerstatus objects with matching md digests. They shouldn't be "very old" now; we should consider them "last listed" one hour ago.

comment:72 Changed 2 years ago by nickm

Okay. A consensus arrives, and in directory.c, in connection_dir_client_reached_eof, we update the consensus and call "update_microdescs_from_networkstatus()", which will eventually update the last-listed times. But before that function updates the last-listed times, it calls get_microdesc_cache().

If get_microdesc_cache() is being called for the first time, it loads the microdescs from disk, and then calls microdesc_cache_clean(). We have a new consensus, but we have not yet finished update_microdescs_from_networkstatus.

comment:73 Changed 2 years ago by nickm

Milestone: Tor: 0.2.9.x-finalTor: 0.2.8.x-final

If my diagnosis is right, then to see this bug, you need to restart tor after exactly 7 days, give or take a couple of hours.

I've done a fix as bug7164_028. Needs review. :)

comment:74 in reply to:  72 ; Changed 2 years ago by cypherpunks

Replying to nickm:

If get_microdesc_cache() is being called for the first time, it loads the microdescs from disk, and then calls microdesc_cache_clean(). We have a new consensus, but we have not yet finished update_microdescs_from_networkstatus.

U2FsdGVkX19xU5DZ+k76OvKVNZbPzyTLeVU3rQslY+Bz8xq+mXT59freOTt4KDRQ
ScTaE9PCTdMED5CkBlXJuTJfOCJNEO69AskcLeDTRRFT1w9W3QWgxZshpgc8T5qD
GwweQ02CWm/bdq0NW070xUHuno10nLbIBAO8a/bzcfSbyquibM+FIZlQTA0n0Fda
Qey6pv3d9y7v/Lc8OHfZUD1Vvbe9v9P5xqrpwcwkdT0klI5B5Vgaike7m25P1hDh
MBzO6gX3NNnA1ST5pP0qpuFgZMK79uJmbXUj0iECLs6sGD/HgY19p9Vw/T6Pt7P4
8+mh5lUOzcLAX2Lc1yKZEBYIJmSnYwgWYensAepobKH785iuij/a6X4YLdJ39BCz
nvkKCfm4cLCLGEG+4MkrRClBN7YtpZ57K9LSyWnxTHKJinFYLpGMc9VYViScYq3E
Qm9xVYisuET5bD7tBjgbXCJdF67++7w=

(openssl enc -aes-256-xts -d -base64
sha256(password) = 62493873306edef2c14c567d252b5b029d67fe3119c703486209a93bb9e4117e)

EDIT: function name copy/paste typo

Last edited 2 years ago by cypherpunks (previous) (diff)

comment:75 Changed 2 years ago by nickm

If you want me to see that, you should probably just GPG it to me. I don't know what password you're using.

(And if you don't want me to see that, I don't understand what's going on.)

comment:76 Changed 2 years ago by nickm

Status: newneeds_review

comment:77 Changed 2 years ago by cypherpunks

I got this when starting tor 0.2.8.6:

[warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 3762 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 1 hours old. Hashtable badness is 0. (on Tor 0.2.8.6 )

comment:78 Changed 2 years ago by nickm

Keywords: review-group-7 added

comment:79 Changed 2 years ago by nickm

Owner: set to nickm
Status: needs_reviewaccepted

comment:80 Changed 2 years ago by nickm

Status: acceptedneeds_review

comment:81 Changed 2 years ago by dgoulet

Status: needs_reviewmerge_ready

lgtm;

Note that I can't really test this, I do not have a old data set of tor :S and using CollecTor files makes it a bit complicated since the archive of microdescriptors are ordered by ID and not in one single fat file.

comment:82 in reply to:  74 Changed 2 years ago by cypherpunks

Replying to cypherpunks:

U2FsdGVkX19xU5DZ+k76OvKVNZbPzyTLeVU3rQslY+Bz8xq+mXT59freOTt4KDRQ
ScTaE9PCTdMED5CkBlXJuTJfOCJNEO69AskcLeDTRRFT1w9W3QWgxZshpgc8T5qD
GwweQ02CWm/bdq0NW070xUHuno10nLbIBAO8a/bzcfSbyquibM+FIZlQTA0n0Fda
Qey6pv3d9y7v/Lc8OHfZUD1Vvbe9v9P5xqrpwcwkdT0klI5B5Vgaike7m25P1hDh
MBzO6gX3NNnA1ST5pP0qpuFgZMK79uJmbXUj0iECLs6sGD/HgY19p9Vw/T6Pt7P4
8+mh5lUOzcLAX2Lc1yKZEBYIJmSnYwgWYensAepobKH785iuij/a6X4YLdJ39BCz
nvkKCfm4cLCLGEG+4MkrRClBN7YtpZ57K9LSyWnxTHKJinFYLpGMc9VYViScYq3E
Qm9xVYisuET5bD7tBjgbXCJdF67++7w=

(openssl enc -aes-256-xts -d -base64
sha256(password) = 62493873306edef2c14c567d252b5b029d67fe3119c703486209a93bb9e4117e)

EDIT: function name copy/paste typo

Found password

echo "the report of my death was an exaggeration" | openssl sha256

Deciphered

get_microdesc_cache() is being called for the first time by nodelist_set_consensus(). md->held_by_nodes incremented by nodelist_set_consensus() too (nodelist_add_microdesc skip that part as no any nodes created yet). microdesc_cache_clean() is being called for the first time without held_by_nodes for any md. if no held_by_nodes then no bug.

comment:83 Changed 2 years ago by cypherpunks

Okay. A consensus arrives

Your case happens if consensus were set by networkstatus_note_certs_arrived(), it miss all update_*() functions.

Alas, it's not last edge case that trigger thus warnings. nodelist_set_consensus() must to clear all node->md like it does for node->rs to fix all remain cases.

comment:84 Changed 2 years ago by cypherpunks

Another problem, there are dozens relays (last time checked it was around 40 relays) that never rotate onion keys for some reason.

comment:85 in reply to:  84 Changed 2 years ago by teor

Status: merge_readyneeds_revision

Replying to cypherpunks:

Another problem, there are dozens relays (last time checked it was around 40 relays) that never rotate onion keys for some reason.

Split off #20055 for this issue.

The rest of this ticket seems to need a revision, based on the cypherpunks comments.
(Next time, please PGP encrypt an email for vulnerabilities, or make plain-text comments otherwise. It helps us fix all the issues at the same time.)

comment:86 Changed 2 years ago by nickm

Keywords: review-group-8 added; review-group-7 removed

comment:87 Changed 2 years ago by nickm

Keywords: TorCoreTeam201609 added

comment:88 Changed 2 years ago by nickm

Keywords: review-group-9 added; review-group-8 removed

comment:89 Changed 2 years ago by nickm

Keywords: review-group-10 added; review-group-9 removed

Moving not-reviewed-by-me tickets in review-group-9, and for-0.2.9/0.2.8 tickets, to review-group-10.

comment:90 Changed 2 years ago by nickm

Keywords: review-group-10 removed
Milestone: Tor: 0.2.8.x-finalTor: 0.3.0.x-final

If this is so pervasive, we should expect the fix to be destabilizing, and aim it for 0.3.0. :/

comment:91 Changed 21 months ago by nickm

Points: 2

comment:92 Changed 21 months ago by nickm

Milestone: Tor: 0.3.0.x-finalTor: unspecified

comment:93 Changed 19 months ago by teor

Milestone: Tor: unspecifiedTor: 0.3.2.x-final

Duplicate in #21640, aspirationally moving this to 0.3.2, because it would be great to fix it.

comment:94 Changed 16 months ago by nickm

Keywords: TorCoreTeam201609 removed

comment:95 Changed 12 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Status: needs_revisionneeds_review

Branch bug7164_032 is a rebase of the older attempted fix (possibly irrelevant) with the other fix suggested above.

I don't see how clearing all the md values is necessary in nodelist_set_consensus() -- any microdesc value not reached in the main loop in that function should get removed later, by nodelist_purge().

comment:96 Changed 11 months ago by nickm

Keywords: review-group-24 added

review-group-24 is now open.

comment:97 Changed 11 months ago by asn

Hm, took a look at this ticket given that I have spent some time with mds lately. I don't entirely understand the dc60e87d6 fix here. I think some more docs would be appreciated.

Why do we call update_microdescs_from_networkstatus() in networkstatus_note_certs_arrived() and not in networkstatus_set_current_consensus()? Would we ever want to update our microdescs if we didn't actually successfuly set a consensus, and we just fetched some certs?

comment:98 Changed 11 months ago by asn

Cc: asn added
Reviewer: asn

comment:99 Changed 11 months ago by nickm

Status: needs_reviewneeds_revision

comment:100 Changed 11 months ago by arma

moria1 hits this once a week or so:

Nov 01 15:18:22.682 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 168 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 0 hours old. Hashtable badness is 0. (on Tor 0.3.3.0-alpha-dev 853bbb9112a16055)
Nov 01 15:18:22.682 [warn] microdesc_cache_clean(): Bug:   [0]: ID=BB119A5A4D5DA2BBB7B796ECC50E3C0F1D4FD910. md=0x7fdea11000b0, rs=0x7fdea6c395c0, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.3.3.0-alpha-dev 853bbb9112a16055)
Nov 01 15:18:22.690 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 168 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 0 hours old. Hashtable badness is 0. (on Tor 0.3.3.0-alpha-dev 853bbb9112a16055)
Nov 01 15:18:22.690 [warn] microdesc_cache_clean(): Bug:   [0]: ID=B62A3F23955AF323A03F4ADB355C7D00594CFBD1. md=0x7fdea194a650, rs=0x7fdec70afbf0, ri=(nil). Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.3.3.0-alpha-dev 853bbb9112a16055)
Nov 01 15:18:22.690 [info] microdesc_cache_clean(): Removed 126/55533 microdescriptors as old.

comment:101 Changed 11 months ago by teor

I wonder if this can happen when a relay wakes up from hibernation after a week or two?

168 hours isn't very old for a microdesc any more: Tor relays used to change their onion keys every 7 days, forcing a microdesc hash change every week. But now, the expired microdesc can be listed in the new consensus after hibernation, because the onion keys only change every 30 days.

Maybe this isn't the issue, but it seems like it might be worth looking into.

comment:102 Changed 10 months ago by arma

Nov 30 08:53:25.913 [warn] microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 174 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 2 hours old. Hashtable badness is 0. (on Tor 0.3.3.0-alpha-dev 8ff60b52889ecd40)
Nov 30 08:53:25.913 [warn] microdesc_cache_clean(): Bug:   [0]: ID=9695DFC35FFEB861329B9F1AB04C46397020CE31. md=0x7f2386b4d9a0, rs=(nil), ri=0x7f238222a380. No RS. (on Tor 0.3.3.0-alpha-dev 8ff60b52889ecd40)

comment:103 Changed 6 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:104 Changed 6 months ago by nickm

Keywords: 033-removed-20180320 added

Mark all not-already-included tickets as pending review for removal from 0.3.3 milestone.

comment:105 Changed 6 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: unspecified

These tickets were marked as removed, and nobody has said that they can fix them. Let's remember to look at 033-removed-20180320 as we re-evaluate our triage process, to see whether we're triaging out unnecessarily, and to evaluate whether we're deferring anything unnecessarily. But for now, we can't do these: we need to fix the 033-must stuff now.

comment:106 Changed 6 months ago by cypherpunks

After syncing time in ticket:19460#comment:6 and hibernation, Tor shows

Tor NOTICE: Heartbeat: Tor's uptime is 13 days 0:41 hours, with 3 circuits open. I've sent 110.30 MB and received 728.68 MB. 

but then

Tor WARN: microdesc_cache_clean(): Bug: Microdescriptor seemed very old (last listed 168 hours ago vs 168 hour cutoff), but is still marked as being held by 1 node(s). I found 1 node(s) holding it. Current networkstatus is 11 hours old. Hashtable badness is 0. (on Tor 0.3.2.10 ) 
Tor WARN: microdesc_cache_clean(): Bug:   [0]: ID=0E71008524180D47796858A79EEC0F2233568B59. md=027FF798, rs=020E2B68, ri=00000000. Microdesc digest in RS matches. RS okay in networkstatus. (on Tor 0.3.2.10 ) 
Tor NOTICE: Your system clock just jumped 38542 seconds forward; assuming established circuits no longer work. 
Tor NOTICE: Heartbeat: Tor's uptime is 13 days 3:58 hours, with 0 circuits open. I've sent 110.95 MB and received 732.77 MB. 
Tor NOTICE: Average packaged cell fullness: 61.425%. TLS write overhead: 5% 
Tor NOTICE: Tor has successfully opened a circuit. Looks like client functionality is working. 

So, Tor thinks 3:17 hours = 38542 seconds and also shows this bug.

Note: See TracTickets for help on using tickets.