Opened 3 months ago

Last modified 6 days ago

#29697 new defect

archive.tpo is soon running out of space

Reported by: boklm Owned by:
Priority: Medium Milestone:
Component: Internal Services Version:
Severity: Normal Keywords:
Cc: helix, karsten, lunar, mikeperry, nickm, gk Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The machine hosting archive.tpo will soon be running out of space.

Possible solutions include:

  • adding more resources to it. According to weasel, it seems doing this would require a new host, which would be quite expensive.
  • removing some of the less useful files from archive.tpo. For example we could remove Tor Browser versions that are more than 4 years old. Or alpha versions that are more than 2 years old.

Child Tickets

Change History (6)

comment:1 Changed 3 months ago by nickm

I just ran a du-sm on /srv/archive.torproject.org/htdocs/*/*. The largest directories were:

472	/srv/archive.torproject.org/htdocs/tor-package-archive/technology-preview
509	/srv/archive.torproject.org/htdocs/tor-package-archive/vidalia
568	/srv/archive.torproject.org/htdocs/tor-package-archive/obfsproxy
585	/srv/archive.torproject.org/htdocs/tor-package-archive/onionoo
1269	/srv/archive.torproject.org/htdocs/tor-package-archive/rpm
4810	/srv/archive.torproject.org/htdocs/tor-package-archive/tormessenger
7326	/srv/archive.torproject.org/htdocs/tor-package-archive/vidalia-bundles
11847	/srv/archive.torproject.org/htdocs/amnesia.boum.org/tails
286565	/srv/archive.torproject.org/htdocs/tor-package-archive/debs
1515387	/srv/archive.torproject.org/htdocs/tor-package-archive/torbrowser

It seems to me that if there's anything in there we don't need to save forever, or that we're comfortable letting somebody else host, that would be the place to start looking.

comment:2 Changed 3 months ago by nickm

(Tor sources themselves are 1173 megabytes in total, which would put them between rpm and tormessenger.)

(edited to add 'mega')

Last edited 3 months ago by nickm (previous) (diff)

comment:3 Changed 3 months ago by gk

Cc: gk added

FWIW: If we go for deleting older Tor Browser releases on a scheduled basis (e.g. like the one boklm is mentioning in the description) then we should keep at least one bundle for each platform for historical reasons or as a means to understand questions better that impact older Tor Browser releases (like one got bug XYZ first visible in a release we shipped etc.).

comment:4 Changed 9 days ago by arma

It seems like the right answer is to get a bigger archive. Maybe we should team up with the folks from archive.org? Deleting past Tor releases from the world would be a shame, especially if we have the option to not.

comment:5 Changed 6 days ago by anarcat

TL;DR: possible paths:

  1. Internet Archive (IA)
  2. Software Heritage
  3. commercial storage (e.g. Amazon Glacier)
  4. host our own
  5. spend more time deciding on archival policies
  6. mix of the above

One way to manage stuff like this is to break it up in smaller pieces and distribute it around. a typical way I manage those archives is with git-annex, which allows for reliable tracking of N copies (say "3 redundant copies") and supports *many* different "remotes", including Amazon Glacier, Internet Archive (IA) and so on. It's what I used in the Brazil archival project and it mostly worked. It's hard to use, unfortunately, which may be a big blocker for adoption.

If git-annex is too complicated, we can talk to IA directly. I would recommend, however, against using their web-based upload interface which, even they acknowledge, is terrible and barely useable. I packaged the internetarchive python client in Debian to work around that problem and it works much better.

Moving files to IA only shifts the problem, in my opinion: then we have only a single copy, elsewhere and while we don't need to manage that space anymore, we also don't manage backups and will never know if they drop stuff on us (and they do, sometimes, either deliberately or by mistake). I would propose that if stuff moves out of our "backed-up" infrastructure, it should be stored in at least two administratively distinct locations.

Another such location we could use, apart from commercial providers like Amazon, is the Software Heritage project (WP) which is *designed* to store copies of source code and software artifacts of all breeds. It might already have something for Tor even.

Otherwise, assuming we can solve this problem ourselves, I think this question boils down to "How big of an archive do we actually need and how fast does it grow?" With the limited Grafana history I had available a week ago, I have calculated we dump roughly ~10GB per week of new stuff on there, but naturally the sample size is too small to take that number seriously. To give you another metric, in the last two weeks now (one week later), we have gone from 254GB to 207GB, eating a whopping 47GB in 15 days, which clocks the rate at ~3GB a day or ~24GB a week. When I looked at it a week ago, we had 220GB left, which gives us a rate of 13GB/week, so I would estimate the burn rate is between 10 to 20GB/week, which gives us about 10 to 20 weeks to act on this problem.

Assuming 10GB/week, this means we need ~500GB of *new* storage every year. In our current capacity, this trickles into roughly 2x1TB of storage per year because of RAID and backups.

So if we want this problem to go away for ~10 years (assuming current rate, which is probably inaccurate, at beast), we could throw hardware at the problem and give Hetzner another ~200EUR/mth specifically for an archival server. We might be able to save some costs by *not* backing up the server and using IA/Software Heritage as a fallback, with git-annex as well.

Fundamentally, this is a cost problem. Do you want us to spend time to figure out a proper archival policy and cheap/free storage locations or pay for an archival server?

In any case, I'd be happy to dig deeper into this to figure out the various options beyond the above napkin calculations.

comment:6 Changed 6 days ago by irl

I believe we have similar issues with CollecTor, although we're not going to hit problems very soon we are seriously lacking any redundancy other than the TPA backups. TPA backups are not guaranteed to be consistent because they are not CollecTor-aware so I've been thinking about building in some replication mechanisms using S3, probably looking at Glacier and IA as the targets. (S3 has the nice property that operations are atomic.)

As far as I know the Software Heritage stack is built on Azure, so that would be independent of IA/AWS if they were interested in also hosting a copy of Tor Metrics' archive.

I would be happy to talk about options in a session in Stockholm if time permits, although it is starting to look as if it is going to be very busy there. We could also have a call to look at options, or we could treat these as different problems and have different solutions to them.

I would be extremely sad if the plan is to delete anything.

Note: See TracTickets for help on using tickets.