Opened 3 years ago

Last modified 4 months ago

#18320 new enhancement

Clear old entries from the key-pinning journal file

Reported by: teor Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dos, 034-triage-20180328, 034-removed-20180328
Cc: Actual Points:
Parent ID: Points: 3
Reviewer: Sponsor: SponsorV-can

Description

In the key-pinning-journal file, we accumulate duplicates over time as relays switch back and forth between two different ed25519 keys with the same rsa key. This has both the issue of exhausting dirauth disk space as well as making the involved data structure slow, if it happens a lot. We should prune duplicates when updating our latest view on a relay's identity mapping.

Child Tickets

Change History (56)

comment:1 Changed 3 years ago by teor

Keywords: TorCoreTeam201602 added

comment:2 Changed 3 years ago by teor

Milestone: Tor: 0.2.7.x-finalTor: 0.2.9.x-final

comment:3 Changed 3 years ago by nickm

Instead, if we are allowing conflicts, maybe we should remember the conflict, and not re-add it over and over.

comment:4 Changed 3 years ago by isabela

Sponsor: SponsorU-can

comment:5 Changed 3 years ago by nickm

Parent ID: #17668
Points: medium

comment:6 Changed 3 years ago by nickm

Keywords: TorCoreTeam201602 removed

comment:7 Changed 3 years ago by nickm

Keywords: tor-dos added

comment:8 Changed 3 years ago by isabela

Points: medium3

comment:9 Changed 3 years ago by nickm

Parent ID: #17293

comment:10 Changed 3 years ago by andrea

Owner: set to andrea
Status: newassigned

Taking ownership for 0.2.9 triage

comment:11 Changed 3 years ago by nickm

Keywords: TorCoreTeam201606 added

comment:12 Changed 3 years ago by andrea

Just why are relays switching back between multiple ed25519 keys with the same RSA key? I'm going to go ahead and implement periodic pruning of duplicates from the journal file, but I think nickm's "maybe we should remember the conflict" comment hinges on whether this is a sign of a bug somewhere else we should fix, or something we should accommodate here.

comment:13 Changed 3 years ago by andrea

To prune the journal, we should remove duplicates and lines that introduce mappings
that will subsequently be removed as conflicts. The easiest approach will be to
parse the journal, then re-generate it from scratch based on the in-memory structure,
but this will fail to preserve the comment/reserved lines and the ordering of non-
conflicting entries. Do we care?

If we do, then the next-easiest approach is probably an in-memory list of
lines, and hash tables indexing it to query all lines matching particular
keys, so when we spot a conflicting line during pruning we can find the
conflicting previous lines and remove them, then re-emit the file from
the list of remaining lines.

Since there are no more than a few thousand active routers, a fully pruned
list cannot become unreasonably large, but this ticket is concerned with
running out of disk space due to excessive size. I think if we remove
earlier conflicting lines immediately while parsing to prune we'll keep it
down to a reasonable size though.

Has using tor_mmap_file() in keypin_load_journal() become a problem for
space reasons too?

comment:14 in reply to:  12 Changed 3 years ago by nickm

Replying to andrea:

Just why are relays switching back between multiple ed25519 keys with the same RSA key? I'm going to go ahead and implement periodic pruning of duplicates from the journal file, but I think nickm's "maybe we should remember the conflict" comment hinges on whether this is a sign of a bug somewhere else we should fix, or something we should accommodate here.

I think we should investigate whether it's happening for just a few nodes or a bunch.

If just a few, we can try to investigate that more, or just forbid it entirely.

If a lot, we need to investigate.

comment:15 Changed 3 years ago by andrea

Observed in passing: the existing code here counts conflicts incorrectly; it only counts one conflicting line when a line conflicts with two previously seen keypin journal entries, even though it correctly removes two hash table entries in that case. Will fix.

comment:16 in reply to:  13 Changed 3 years ago by nickm

Replying to andrea:

To prune the journal, we should remove duplicates and lines that introduce mappings
that will subsequently be removed as conflicts. The easiest approach will be to
parse the journal, then re-generate it from scratch based on the in-memory structure,
but this will fail to preserve the comment/reserved lines and the ordering of non-
conflicting entries. Do we care?

I don't care about the ordering of the nonconflicting entries, but the comments and reserved lines probably should get preserved; especially since we're storing info in the comment.

If we do, then the next-easiest approach is probably an in-memory list of
lines, and hash tables indexing it to query all lines matching particular
keys, so when we spot a conflicting line during pruning we can find the
conflicting previous lines and remove them, then re-emit the file from
the list of remaining lines.

Since there are no more than a few thousand active routers, a fully pruned
list cannot become unreasonably large, but this ticket is concerned with
running out of disk space due to excessive size. I think if we remove
earlier conflicting lines immediately while parsing to prune we'll keep it
down to a reasonable size though.

Has using tor_mmap_file() in keypin_load_journal() become a problem for
space reasons too?

I don't _think_ so -- on every place where we'd consider running an authority, we have mmap. So we'd only need to worry if we were at risk of exhausting our address space. I think that if we have any authorities that are running on a 32-bit system (where address space exhaustion can happen), limiting the keypin file to 1G is probably safe.

comment:17 Changed 3 years ago by andrea

Okay; the version I'm implementing preserves all the comments and reserved lines by building a linked list of lines indexed by hash tables for the ones which actually parse to anything, so we get order preservation for free.

comment:18 Changed 3 years ago by andrea

Status: assignedneeds_review

Proposed fix in my ticket18320_squashed branch, including new unit test. There are two relevant things worth considering whether to adjust here:

  • How often should we prune? Right now we invoke the pruner each time a dirauth starts up, before loading the keypinning journal.
  • What should we do if we encounter problems while writing the pruned keypinning journal? See new comment from keypin.c:
676     /*
677      * Loading into the pruner succeeded; truncate the old file and emit
678      *
679      * The output should always be smaller than the original file was, but
680      * under really pessimal circumstances, something else might gobble up
681      * a lot of disk space before we're done emitting and we might fail to
682      * emit and end up losing the journal.
683      *
684      * The alternative would be to emit to a temp file and then only delete
685      * the old journal when we've written the new one, but then we might fail
686      * to prune when disk space is tight.
687      */

This has been tested using a small alternate dirauth set as well as the unit test for the pruner.

comment:19 Changed 3 years ago by nickm

Keywords: review-group-3 added

Move some tickets into review-group-3: they are in 0.2.9, and they are needs_review.

comment:20 Changed 3 years ago by dgoulet

Reviewer: dgoulet
Status: needs_reviewneeds_revision

General comment.

I'm not so cheerful about the keypin code directly using HT_* instead of the tested and robust interface we have that are digestmap and digest256map _especially_ when we use RSA and ed25519 digest as keys. Maybe there is a reason? I don't see one for now nor see one in the comment. So I understand why this patch is using the HT_* interface but it adds quite a bit of complexity for review... Re-implementing the add/remove/iteration is really not ideal as we duplicate lots of things.

Why the double linked list? Isn't a smartlist_t ordered and when removing we can make sure to keep it with smartlist_del_keeporder()? This would simplify things since implementing a double linked list in this file is not ideal I would say.

Some code comment:

DG1: keypin_add_or_replace_entry_in_map() can return -2 I believe but it's not documented.

DG2: This if(pruner) is not useful there: if (pruner) keypin_free_pruner(pruner);

DG3: In keypin_prune_journal(), the pruned journal is written to disk only if r == 0 that is we parsed it successfully. However, on error the log_info(...) indicating that we've pruned the journal is given. Maybe if r is an error, we should goto err? (might be confusing with strerror(errno) though).

DG4: keypin_add_line_to_pruner(), I think len should be size_t here.

Ok, done for now.

comment:21 Changed 3 years ago by andrea

It ended up using HT_* because the existing keypin code used it; I can look at changing that if you like.

The double linked list is there because this table potentially has thousands of entries and smartlist_t is linear time to add/remove. Not changing that.

comment:22 in reply to:  21 Changed 3 years ago by dgoulet

Replying to andrea:

It ended up using HT_* because the existing keypin code used it; I can look at changing that if you like.

Oh, I think that would be much better for both review and maintaining it in the long run. I do understand it's a bit of work though so I would say do it if you feel like it won't take you 2 days of work :). Else, we can create a new ticket about it and flag it "lorax" but I fear it will never be addressed... :S

The double linked list is there because this table potentially has thousands of entries and smartlist_t is linear time to add/remove. Not changing that.

Very good point! Could we add this reason to the comment of the double linked list. Knowing why we use a certain type of data structure is _extremely_ useful (at least for my two-years-in-the-future self). Would be nice to see one day a linked list in the famous container.h :).

comment:23 Changed 3 years ago by nickm

I'm in favor of using ht_* stuff more, actually. It is a typed interface, whereas foomap_t is untyped.

Also, we have a set of linked-list types: see tor_queue.h .

comment:24 Changed 3 years ago by andrea

Status: needs_revisionneeds_review

See updated ticket18320_v2; I agree with nickm's arguments for HT_* on reflection. Nickm: would you prefer this converted to use TOR_LIST_* ?

comment:25 Changed 3 years ago by nickm

If it wouldn't be too hard, then yeah, tor_list_* would be great.

comment:26 Changed 3 years ago by andrea

See updated ticket18320_v2

comment:27 Changed 3 years ago by nickm

dgoulet -- is this still on your review queue, or is it on mine now?

comment:28 Changed 3 years ago by dgoulet

Keywords: TorCoreTeam201607 added; TorCoreTeam201606 removed
Status: needs_reviewneeds_information

Only one thing. The rest lgtm;

DG1: In keypin_load_journal_impl(), the last "add line to pruner" call below assumes no duplicate.

if (pruner) keypin_add_line_to_pruner(pruner, ent, cp, len);

Now because there is an entry ent, we get to keypin_add_or_replace_entry_in_map() which if duplicate, returns 0 then setting the ent pointer to NULL. Then back in keypin_load_journal_impl(), if also_add_to_main_map is set, then we'll try to add a NULL entry which I doubt it will go well :).

Here is the good news though, I don't see any call to keypin_load_journal_impl() with both a pruner and also_add_to_main_map set to 1 (even in the test). So in the end, why is also_add_to_main_map useful at all? If it is useful for let's say future use maybe?, we should handle the duplicate issue above then. If the assumption is that there shouldn't be duplicates in the pruner when loading it, we should make sure of it because this data comes from disk thus "untrusted" source.

comment:29 Changed 3 years ago by andrea

...returns 0 then setting the ent pointer to NULL.

You mean this in keypin_add_line_to_pruner(), where ent is a param and not the same as the one we're calling keypin_add_or_replace_entry_in_map() on back in keypin_load_journal_impl()? I don't think this is actually a bug, unless you had something else in mind.

457     if (r == 0) {
458       /*
459        * This was a duplicate; we just freed it and won't be adding
460        * anything to the list, but should bump the duplicate counter.
461        */
462       adding = 0;
463       ent = NULL;
464       ++(p->nlines_pruned_duplicate);

comment:30 in reply to:  29 Changed 3 years ago by dgoulet

Status: needs_informationmerge_ready

Replying to andrea:

...returns 0 then setting the ent pointer to NULL.

My mistake here. ent = NULL for some reason computed for me that we nullify the given pointer but it's wrong :).

lgtm;

comment:31 Changed 3 years ago by nickm

Keywords: review-group-6 added

comment:32 Changed 3 years ago by nickm

Status: merge_readyneeds_revision

Quick comments:

  • I think that we should use a temporary file instead of truncate-and-rewrite. In nearly every case where we haven't used a temporary file in the past, somebody *did* crash at just the wrong time and lose all their data. :p
  • Can the unit test verify that the correct lines get removed? FWICT it only verifies that the correct number of lines is removed, which is a little scary.
  • Suggestion: can we use the NULL-object pattern to do fewer "if (pruner)" checks? In other words, how about we just define all these functions to be no-ops or whatever when pruner is NULL? (I'm thinking of keypin_remove_entry and keypin_add_line_to_pruner.)

otherwise looks fine

comment:33 Changed 3 years ago by nickm

Keywords: review-group-7 added; review-group-6 removed

All review-group-6 tickets have had at least one review; moving them to 7.

comment:34 Changed 3 years ago by nickm

Keywords: review-group-8 added; review-group-7 removed

comment:35 Changed 3 years ago by nickm

Keywords: review-group-9 added; review-group-8 removed

comment:36 Changed 2 years ago by nickm

Parent ID: #17293

Leaving these in 0.2.9.x, but un-parenting . They are -can items, and their -must parent is closing.

comment:37 Changed 2 years ago by nickm

Keywords: nickm-deferred-20161005 added
Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final

Deferring big/risky-feature things (even the ones I really love!) to 0.3.0. Please argue if I'm wrong.

comment:38 Changed 2 years ago by nickm

These have sat in needs_revision for a few weeks at least. Removing from review-group-9, not adding to review-group-10.

comment:39 Changed 2 years ago by nickm

Keywords: review-group-9 removed

comment:40 Changed 2 years ago by nickm

Type: defectenhancement

batch modify: I think these are "enhancement", though I could be wrong about some.

comment:41 Changed 2 years ago by dgoulet

Owner: andrea deleted
Status: needs_revisionassigned

comment:42 Changed 2 years ago by dgoulet

Status: assignedneeds_revision

comment:43 Changed 2 years ago by nickm

Milestone: Tor: 0.3.0.x-finalTor: 0.3.1.x-final

comment:44 Changed 2 years ago by nickm

Sponsor: SponsorU-can

Clear sponsorS and sponsorU from open tickets in 0.3.1

comment:45 Changed 22 months ago by arma

I don't see a rush to get this one into 0.3.1.

But it might also be wise to grab onto it pretty soon in 0.3.2, before the code diverges too much from the patches here.

comment:46 Changed 22 months ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

0.3.1 feature freeze today: these don't seem like they will be all sorted out in time. Let's try to make progress in 0.3.2.

comment:47 Changed 22 months ago by nickm

Keywords: nickm-deferred-20161005 removed

comment:48 Changed 22 months ago by nickm

Keywords: TorCoreTeam201607 review-group-3 removed

comment:49 Changed 21 months ago by nickm

Sponsor: SponsorV-can

comment:50 Changed 18 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Defer all needs_revision non-spec enhancements to 0.3.3.

comment:51 Changed 14 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Mark a lot of assigned/needs_revision tickets as 0.3.4. If you think this should happen in 0.3.3 instead, just let me know?

comment:52 Changed 12 months ago by nickm

Keywords: 034-triage-20180328 added

comment:53 Changed 12 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:54 Changed 12 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These needs_revision, tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if somebody does the necessary revision.

comment:55 Changed 4 months ago by dgoulet

Reviewer: dgoulet

comment:56 Changed 4 months ago by dgoulet

Status: needs_revisionnew

Going back in "New" state since this probably has to be picked off from the start and onto the new refactored code structure.

Note: See TracTickets for help on using tickets.