Opened 4 years ago

Closed 3 years ago

Last modified 13 months ago

#9645 closed defect (fixed)

Tor client goes berserk downloading microdescriptors forever, if you rm -rf its datadir

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client, 024-backport, 2016-bug-retrospective
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Run your Tor client, with a datadirectory pointed to a new (empty or nonexistent) directory.

Once the Tor client has made the datadir and satisfied itself about its permissions, but while it's still bootstrapping, rm -rf the datadirectory.

It will produce an unending stream of

Sep 02 00:38:26.849 [warn] Couldn't open "/tmp/tordatadir/cached-microdescs.new" (/tmp/tordatadir/cached-microdescs.new) for writing: No such file or directory
Sep 02 00:38:26.849 [warn] Couldn't append to journal in /tmp/tordatadir/cached-microdescs.new: No such file or directory

which is not surprising since the directory isn't there anymore. The sad part though is:

Sep 02 00:28:59.198 [debug] count_usable_descriptors(): 4183 usable, 0 present (microdescs).
Sep 02 00:28:59.207 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:28:59.215 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:29:00.812 [debug] count_usable_descriptors(): 4183 usable, 0 present (microdescs).
Sep 02 00:29:00.819 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:29:00.825 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:29:12.586 [debug] count_usable_descriptors(): 4183 usable, 0 present (microdescs).
Sep 02 00:29:12.593 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:29:12.600 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
[...]
Sep 02 00:35:29.524 [debug] count_usable_descriptors(): 4183 usable, 0 present (microdescs).
Sep 02 00:35:29.532 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).
Sep 02 00:35:29.539 [debug] count_usable_descriptors(): 917 usable, 0 present (microdesc exits).

It seems like it's stuck in a loop realizing it doesn't have them, fetching them, failing to write them, realizing it doesn't have them, repeat.

Found while trying to reproduce #9639.

Child Tickets

Change History (19)

comment:1 Changed 4 years ago by arma

  • Milestone changed from Tor: unspecified to Tor: 0.2.4.x-final

See comment at https://trac.torproject.org/projects/tor/ticket/9639#comment:3 where I speculate that this can happen when you run out of disk space in your datadir too.

That argues for an 0.2.4 fix if we can find a clean one.

comment:2 Changed 4 years ago by arma

  • Keywords tor-client added

comment:3 follow-up: Changed 4 years ago by nickm

One option would be just to hold them in RAM when writing to disk fails, and not save them to disk at all.

Another (simpler) option would be to exit Tor in this case.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 4 years ago by nickm

  • Status changed from new to needs_review

Replying to nickm:

One option would be just to hold them in RAM when writing to disk fails, and not save them to disk at all.

This is now "bug9645" in my public repository.

Another (simpler) option would be to exit Tor in this case.

This is now "bug9645_big_hammer" in my public repository.

These are incompatible with one another, and they both make the #9639 fix unneeded.

comment:5 follow-up: Changed 4 years ago by mr-4

I am not so keen on either solutions to be honest. Writing to RAM, particularly in resource-constrained systems would be very undesirable (I use 7 instances of tor, all deployed on such systems).

The "big hammer" fix isn't a solution to me either, though I don't know what the implications are if tor doesn't save this new information (descriptors?), so I can't really judge what the impact is going to be if things are left as they are - without any fix.

OK, tor will loop endlessly until it can write its descriptors back, but what happens if:

  1. tor tries limited amount of time (say 60 times) and then doesn't try any more (stays with the old descriptors)? or
  2. tor does what it is currently programmed to do - loop indefinitely until it can write those descriptors (I still can't see why is that bad, provided there are no memory leaks)?

comment:6 in reply to: ↑ 5 ; follow-up: Changed 4 years ago by nickm

Replying to mr-4:

I am not so keen on either solutions to be honest. Writing to RAM, particularly in resource-constrained systems would be very undesirable (I use 7 instances of tor, all deployed on such systems).

Right now, they're left in RAM anyway when the disk write is successful.

The "big hammer" fix isn't a solution to me either, though I don't know what the implications are if tor doesn't save this new information (descriptors?), so I can't really judge what the impact is going to be if things are left as they are - without any fix.

Right now, it's going to hammer on the directory caches. IMO, if a user doesn't have enough disk to save microdescriptors to disk, and they don't have RAM to keep them in memory, then they don't have enough resources to run Tor.

OK, tor will loop endlessly until it can write its descriptors back, but what happens if:

  1. tor tries limited amount of time (say 60 times) and then doesn't try any more (stays with the old descriptors)? or
  2. tor does what it is currently programmed to do - loop indefinitely until it can write those descriptors (I still can't see why is that bad, provided there are no memory leaks)?

It doesn't do that. It throws them out, then re-downloads them, which is not good for the network. That's why Roger thinks this needs a fix in 0.2.4.

comment:7 in reply to: ↑ 6 Changed 4 years ago by mr-4

Replying to nickm:

Right now, they're left in RAM anyway when the disk write is successful.

Ah, OK. So tor has them in memory anyway, but occasionally tries to write them to disk. I take it the information written to disk is only usable/needed when tor starts the next time, is that correct?

If so, then it makes perfect sense for tor to hold the "write" bit until it is capable of doing it and issue warnings every 6 hours or so (together with the HEARTBEAT message).

OK, tor will loop endlessly until it can write its descriptors back, but what happens if:

  1. tor tries limited amount of time (say 60 times) and then doesn't try any more (stays with the old descriptors)? or
  2. tor does what it is currently programmed to do - loop indefinitely until it can write those descriptors (I still can't see why is that bad, provided there are no memory leaks)?

It doesn't do that. It throws them out, then re-downloads them, which is not good for the network. That's why Roger thinks this needs a fix in 0.2.4.

Got it! You are quite right, that is undesirable. Given the above, I think tor should keep them in RAM (as it does that anyway), try to write them at regular intervals to disk and issue warnings at regular (but not very frequent!) intervals.

comment:8 Changed 4 years ago by mr-4

I am going to try the "bug9645" as well when I test it on Thursday/Friday.

comment:9 in reply to: ↑ 4 ; follow-ups: Changed 4 years ago by arma

Replying to nickm:

Replying to nickm:

One option would be just to hold them in RAM when writing to disk fails, and not save them to disk at all.

This is now "bug9645" in my public repository.

Another (simpler) option would be to exit Tor in this case.

This is now "bug9645_big_hammer" in my public repository.

These are incompatible with one another, and they both make the #9639 fix unneeded.

I don't think we can afford to exit here, so big_hammer isn't a good option.

Are we likely to try writing them again later when we reform our journal file? Are there bugs there? I was about to merge bug9645 into maint-0.2.4 for the new release when I realized it's the sort of thing that will produce weird edge cases. Maybe it's best to let it bake in master for a while before backporting it (or deciding not to).

comment:10 in reply to: ↑ 9 ; follow-up: Changed 4 years ago by mr-4

Replying to arma:

I don't think we can afford to exit here, so big_hammer isn't a good option.

+1. To me, "bug9645" is the best solution of all 3 proposals.

Are we likely to try writing them again later when we reform our journal file?

This is what I was going to suggest as well (see below).

Are there bugs there? I was about to merge bug9645 into maint-0.2.4 for the new release when I realized it's the sort of thing that will produce weird edge cases. Maybe it's best to let it bake in master for a while before backporting it (or deciding not to).

OK, I had a chance to run "bug9645" (although using 2.4.16-rc instead of .14) for the last 14 hours or so and here are my findings:

  1. The memory leak is now completely gone (so #9639 is fixed as Nick suggested earlier) - tor's memory footprint is extremely stable at ~17 to 19MB.
  1. When I delete all files in datadir, *but* keep datadir (i.e. "rm -rf /var/lib/tor/*"), tor recreates them all, except the "cached-certs" file. This is valid for all cases of executing the file removal command: pre-, during and post- tor boot up.
  1. Things are a bit different when I delete all files *and* datadir (i.e. "rm -rdf /var/lib/tor"):
  • in "pre" tor boot up, tor just recreates everything, including datadir.
  • in "post" tor boot up (i.e. after I see that tor has booted up 100%), the situation is the same as described in the initial report above, minus the memory leaks. In other words, I still get the error messages, but have no memory leaks. Interestingly, the tor memory footprint doesn't seem to increase that much - it is very stable between 17 and 19MB.
  • in "during" tor boot up, the situation is similar as in "post" above, with the exception that the error messages are not that frequent (I had about a dozen such messages when running it for a couple of hours or so, compared to about 3 times that amount in running it in just over an hour in the "post" scenario).

So, I think the tor code should be changed slightly so that the behaviour in "post" and "during" is consistent with "pre" boot scenario - if datadir doesn't exist, recreate it and if a specific file doesn't exist, do the same.

Display error message only when that is not possible, but don't flood the logs (as is the case now). Maybe issue an error at the HEARTBEAT interval (6 hours in my case). Just a thought!

comment:11 in reply to: ↑ 9 Changed 4 years ago by nickm

  • Keywords 024-backport added

Replying to arma:

Are we likely to try writing them again later when we reform our journal file? Are there bugs there?

I *believe* that's the behavior, but I'm far from certain. There could indeed be bugs, but since this is the same logic we use for microdescriptors in the journal, I don't think it's super likely.

We'll only try once, since if we attempt to write a microdescriptor to the cache when rebuilding the cache, but the write fails, we then set the no_save =1 flag on the md.

I was about to merge bug9645 into maint-0.2.4 for the new release when I realized it's the sort of thing that will produce weird edge cases. Maybe it's best to let it bake in master for a while before backporting it (or deciding not to).

Sounds good, especially given mr-4's "It didn't explode" report. Merging this branch (my bug9645) to master and marking for 0.2.4 backport.

comment:12 in reply to: ↑ 10 ; follow-ups: Changed 4 years ago by nickm

Replying to mr-4:

So, I think the tor code should be changed slightly so that the behaviour in "post" and "during" is consistent with "pre" boot scenario - if datadir doesn't exist, recreate it and if a specific file doesn't exist, do the same.

Hm. I think that generally, it's a bit questionable to expect that Tor should always recover from having parts of its data directory deleted from beneath it: this is not a supported operation. Still, that wouldn't be out of the question: opening a new ticket for that.

comment:13 in reply to: ↑ 12 Changed 4 years ago by nickm

Replying to nickm:

Still, that wouldn't be out of the question: opening a new ticket for that.

Created as #9680.

comment:14 in reply to: ↑ 12 Changed 4 years ago by mr-4

Replying to nickm:

Hm. I think that generally, it's a bit questionable to expect that Tor should always recover from having parts of its data directory deleted from beneath it: this is not a supported operation. Still, that wouldn't be out of the question: opening a new ticket for that.

My thinking is that there needs to be a consistency - if files are recreated in "pre", "post" and "during", the same should apply for datadir as well. Do one or the other, not something which is neither here nor there (as is the case now).

comment:15 Changed 3 years ago by arma

I just tested it in 0.2.5, and it looks like it works ok. Backport sounds fine.

(I'm not worried about consistency in behavior -- you're not supposed to delete your datadir while Tor is running, and the resulting behavior is undefined (but we'd like it to not kill all your pets).

I think people are more likely to run across this behavior in practice when their disk goes full. And I think we behave ok in that case now.

comment:16 Changed 3 years ago by nickm

  • Resolution set to fixed
  • Status changed from needs_review to closed

Okay. Merged it into 0.2.4, and hoping for the best.

comment:17 Changed 13 months ago by nickm

  • Keywords 2016-bug-retrospective added

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

comment:18 Changed 13 months ago by nickm

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

comment:19 Changed 13 months ago by nickm

Mark bugs for 2016 bug retrospective based on hand-examination of changelogs for 0.2.5 onwards.

Note: See TracTickets for help on using tickets.