Opened 9 years ago

Closed 8 years ago

#2334 closed defect (wontfix)

Torouter on Buffalo breaks with large cached-descriptors[.new] files

Reported by: karsten Owned by: ioerror
Priority: Medium Milestone:
Component: Archived/Torouter Version: Tor: 0.2.1.26
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Torouter breaks with 64MB flash memory because of large cached-descriptors[.new] files. Jake reports that he tried to start a bridge with these files being 12M and 9M large, which made his box die badly.

The question is whether we can throw out cached descriptors more quickly on a bridge. Which leads to the question whether bridge clients request descriptors from us, or if they only establish a directory connection to a directory mirror via us.

Child Tickets

Attachments (1)

descriptor-age-2011-01-03.png (34.2 KB) - added by karsten 9 years ago.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 9 years ago by Sebastian

Bridges act as directory caches for their clients, the descriptors are served from that cache.

comment:2 Changed 9 years ago by karsten

Priority: majorblocker

My Torouter bridge breaks after 2.5 days without a single client connecting. Here's the course of action:

  • Dec 29 14:06:59: started Tor
  • Dec 31 13:30:15: 17.5M left in /var/lib/tor/, 17704 bytes RAM left, Tor still running
  • Jan 1 06:28:43: Tor complains for the first time that there's no space left to write cached-descriptors.tmp, but it keeps running
  • Jan 1 18:38:13: 9.4M left in /var/lib/tor/, 11072 bytes RAM left, Tor still running
  • Jan 3 07:47:55: 0 bytes left in /var/lib/tor/, 1220 bytes RAM left, Tor still running, killed

As a next step we should investigate how many descriptors we need to cache as a bridge (or as a relay). We can probably throw out descriptors more quickly.

I'm going to set up a new directory mirror to a) take hourly snapshots of the cached-descriptors* files and b) log the requested descriptor digests (using log granularity of 15 minutes). With these data we can answer 1) for what consensuses we're keeping descriptors and 2) which consensuses clients use to decide which descriptors to download. Once we know that clients don't download old descriptors, we can stop caching them.

Another question is whether we can avoid writing the cached-descriptors.tmp file and instead delete cached-descriptors and recreate it. In the unlikely case that we crash during this operation we'd have to download them once again. Maybe we can make this an option for devices with limited disk space.

The next problem we'll have to solve is that we're running out of RAM.

comment:3 in reply to:  2 Changed 9 years ago by karsten

Replying to karsten:

As a next step we should investigate how many descriptors we need to cache as a bridge (or as a relay). We can probably throw out descriptors more quickly.

I'm going to set up a new directory mirror to a) take hourly snapshots of the cached-descriptors* files and b) log the requested descriptor digests (using log granularity of 15 minutes). With these data we can answer 1) for what consensuses we're keeping descriptors and 2) which consensuses clients use to decide which descriptors to download. Once we know that clients don't download old descriptors, we can stop caching them.

I ran a similar experiment to the one described above. I modified a moderately fast directory mirror to log which descriptors are requested and collected a few hours of those logs. But instead of hourly snapshots I took the cached-descriptors* files of a relay that was running for a couple of days. No reason to wait a couple of days if we can have results right now.

See the attached graph descriptor-age-2011-01-03.png.

You'll see that 97.12% of requested, but only 19.61% of cached descriptors were last referenced in a consensus at most 3 hours old. Also note that 15.21% of cached descriptors are not referenced in any consensus at all.

That means if we threw out all descriptors that are not referenced in one of the last 3 consensuses, we wouldn't be able to answer 2.88% of the requests, but would reduce the cached-descriptors* files to 1/5 of the current size.

What about these 2.88% of requests? Can the requesting clients even make use of descriptors of relays that are very likely not running anymore?

Changed 9 years ago by karsten

comment:4 Changed 9 years ago by nickm

It would be neat to have a feature that says, "Don't use more than X bytes of disk space."

Do we already store last-time-mentioned information for descriptors? If not, we could maybe start, and implement a "remember as many as we have room for, from most recently mentioned to least recently mentioned" policy.

The problem with eliminating the .tmp file is that the regular cached-descriptors file is pulled into RAM via mmap, and that's where the older descriptors live in RAM. If we started replacing cached-descriptors, it wouldn't actually get removed from the disk until we closed the file... and if we closed the file, we wouldn't have the older descriptors in RAM any more. So if we want to avoid using the .tmp file, we'll need to be able to fit all of the descriptors into RAM as we rebuild cached-descriptors.

OTOH, we might be able to save disk space while rebuilding if we started by deleting the cached-descriptors.new file: all of _those_ descriptors are in heap memory.

Also, if we felt particularly nutty, we could split cached-descriptors into a few separate files (say, cached-descs/0 .. cached-descs/F) so that rebuilding any particular file wouldn't need much temporary space.

---

A danger to consider: flash memory degrades more frequently with a lot of writes, so we do not want to rebuild more often then necessary.

---

Another way to approach this is looking at our current logic for rebuilding the store. If it is big enough (over 64K), we rebuild it whenever the journal length is more than half the store length, OR the number of bytes we know have dropped from the store is at least half the store length.

First off, I don't know if the "bytes dropped" count is accurate, and if it includes bytes dropped from the journal. I think the answer is "yes" on both counts, but if it isn't, we should fix that.

Assuming that the totals are accurate, we might do well to have the logic take into account our maximum disk space.

comment:5 in reply to:  4 Changed 9 years ago by karsten

Replying to nickm:

Do we already store last-time-mentioned information for descriptors? If not, we could maybe start, and implement a "remember as many as we have room for, from most recently mentioned to least recently mentioned" policy.

I spent quite some time reading Tor sources and learned a lot of interesting stuff. :) But I can't seem to find where we would learn last-time-mentioned information for descriptors. Do you have a hint?

Another way to approach this is looking at our current logic for rebuilding the store. If it is big enough (over 64K), we rebuild it whenever the journal length is more than half the store length, OR the number of bytes we know have dropped from the store is at least half the store length.

I don't understand the 64K constant here. Is that bytes? That's pretty low and rather meant as a safeguard that we don't rebuild the store after the first few descriptors, right? With cached-descriptors[.new] files of 17M and 5M, we always end up in the if-case, correct?

From the logs of my directory cache it looks like we're rebuilding our router store 2 or 3 times a day.

Also, I found that we're storing descriptors for up to 2 days as clients and up to 5 days as directory caches. Maybe we can reduce this to 2 days for bridges?

comment:6 Changed 8 years ago by phobos

From talking to people who work with these embedded devices, getting more space is the easier solution. The two excito boxes have been running as very busy bridges for the past month without an issue. One runs 0.2.1.29 the other 0.2.2.22-alpha.

comment:7 Changed 8 years ago by runa

Yeah, the quick & easy solution here is to attach a USB stick to the Buffalo router and change the DataDirectory in the torrc file to point to the USB. Not sure how well that would work over time, but I think it's worth trying.

comment:8 Changed 8 years ago by phobos

Summary: Torouter breaks with large cached-descriptors[.new] filesTorouter on Buffalo breaks with large cached-descriptors[.new] files

comment:9 in reply to:  7 Changed 8 years ago by fermenthor

Replying to runa:

Yeah, the quick & easy solution here is to attach a USB stick

It's working fine with a 1GB usb flash after about one week of use. I don't know how long before the drive wears out, so it would be nice to be able to fit the cached-* files in tmpfs.

comment:10 Changed 8 years ago by runa

fermenthor: is that as a bridge or a client? If it's a bridge; do you have any numbers for how many clients it can handle and at what rate?

comment:11 in reply to:  10 Changed 8 years ago by fermenthor

Replying to runa:

is that as a bridge or a client? If it's a bridge; do you have any numbers for how many clients it can handle and at what rate?

Both bridge/client, though i've not been actively using the client functionality much in the recent weeks. I don't know what it's capable of handling but i've just gathered traffic stats over a 24h period while not using the client at all. I have a cable connection with the down/up speeds of 9Mbps/500Kbps.
Total In: 24.4 Mb
Total Out: 20.3 Mb
Max In: 238 Kb/s
Max Out: 85 Kb/s
Distinct clients seen: 31

This was soon after a restart, so i'll redo it soon.

As a note for this specific ticket, Tor on my Buffalo uses 75% of RAM, which doesn't leave much for tmpfs. No matter how you organize these data files, not much can be done without adding extra flash storage unless something comes up with a big change in how Tor deals with this data.

comment:12 Changed 8 years ago by phobos

Priority: blockernormal

comment:13 Changed 8 years ago by ioerror

Resolution: wontfix
Status: newclosed

It turns out that we don't care about this bug anymore - we won't run on a system without enough disk space unless MD are in use.

Note: See TracTickets for help on using tickets.