archive.tpo is soon running out of space

added budget_needed component::internal services owner::weasel priority::high resolution::fixed severity::major status::closed type::defect labels

I just ran a du-sm on /srv/archive.torproject.org/htdocs//. The largest directories were:

472	/srv/archive.torproject.org/htdocs/tor-package-archive/technology-preview
509	/srv/archive.torproject.org/htdocs/tor-package-archive/vidalia
568	/srv/archive.torproject.org/htdocs/tor-package-archive/obfsproxy
585	/srv/archive.torproject.org/htdocs/tor-package-archive/onionoo
1269	/srv/archive.torproject.org/htdocs/tor-package-archive/rpm
4810	/srv/archive.torproject.org/htdocs/tor-package-archive/tormessenger
7326	/srv/archive.torproject.org/htdocs/tor-package-archive/vidalia-bundles
11847	/srv/archive.torproject.org/htdocs/amnesia.boum.org/tails
286565	/srv/archive.torproject.org/htdocs/tor-package-archive/debs
1515387	/srv/archive.torproject.org/htdocs/tor-package-archive/torbrowser

It seems to me that if there's anything in there we don't need to save forever, or that we're comfortable letting somebody else host, that would be the place to start looking.

(Tor sources themselves are 1173 megabytes in total, which would put them between rpm and tormessenger.)

(edited to add 'mega')

FWIW: If we go for deleting older Tor Browser releases on a scheduled basis (e.g. like the one boklm is mentioning in the description) then we should keep at least one bundle for each platform for historical reasons or as a means to understand questions better that impact older Tor Browser releases (like one got bug XYZ first visible in a release we shipped etc.).

Trac:
Cc: helix, karsten, lunar, mikeperry, nickm to helix, karsten, lunar, mikeperry, nickm, gk

It seems like the right answer is to get a bigger archive. Maybe we should team up with the folks from archive.org? Deleting past Tor releases from the world would be a shame, especially if we have the option to not.

TL;DR: possible paths:

Internet Archive (IA)
Software Heritage
commercial storage (e.g. Amazon Glacier)
host our own
spend more time deciding on archival policies
mix of the above

One way to manage stuff like this is to break it up in smaller pieces and distribute it around. a typical way I manage those archives is with git-annex, which allows for reliable tracking of N copies (say "3 redundant copies") and supports many different "remotes", including Amazon Glacier, Internet Archive (IA) and so on. It's what I used in the Brazil archival project and it mostly worked. It's hard to use, unfortunately, which may be a big blocker for adoption.

If git-annex is too complicated, we can talk to IA directly. I would recommend, however, against using their web-based upload interface which, even they acknowledge, is terrible and barely useable. I packaged the internetarchive python client in Debian to work around that problem and it works much better.

Moving files to IA only shifts the problem, in my opinion: then we have only a single copy, elsewhere and while we don't need to manage that space anymore, we also don't manage backups and will never know if they drop stuff on us (and they do, sometimes, either deliberately or by mistake). I would propose that if stuff moves out of our "backed-up" infrastructure, it should be stored in at least two administratively distinct locations.

Another such location we could use, apart from commercial providers like Amazon, is the Software Heritage project (WP) which is designed to store copies of source code and software artifacts of all breeds. It might already have something for Tor even.

Otherwise, assuming we can solve this problem ourselves, I think this question boils down to "How big of an archive do we actually need and how fast does it grow?" With the limited Grafana history I had available a week ago, I have calculated we dump roughly ~10GB per week of new stuff on there, but naturally the sample size is too small to take that number seriously. To give you another metric, in the last two weeks now (one week later), we have gone from 254GB to 207GB, eating a whopping 47GB in 15 days, which clocks the rate at ~3GB a day or ~24GB a week. When I looked at it a week ago, we had 220GB left, which gives us a rate of 13GB/week, so I would estimate the burn rate is between 10 to 20GB/week, which gives us about 10 to 20 weeks to act on this problem.

Assuming 10GB/week, this means we need ~500GB of new storage every year. In our current capacity, this trickles into roughly 2x1TB of storage per year because of RAID and backups.

So if we want this problem to go away for ~10 years (assuming current rate, which is probably inaccurate, at beast), we could throw hardware at the problem and give Hetzner another ~200EUR/mth specifically for an archival server. We might be able to save some costs by not backing up the server and using IA/Software Heritage as a fallback, with git-annex as well.

Fundamentally, this is a cost problem. Do you want us to spend time to figure out a proper archival policy and cheap/free storage locations or pay for an archival server?

In any case, I'd be happy to dig deeper into this to figure out the various options beyond the above napkin calculations.

I believe we have similar issues with CollecTor, although we're not going to hit problems very soon we are seriously lacking any redundancy other than the TPA backups. TPA backups are not guaranteed to be consistent because they are not CollecTor-aware so I've been thinking about building in some replication mechanisms using S3, probably looking at Glacier and IA as the targets. (S3 has the nice property that operations are atomic.)

As far as I know the Software Heritage stack is built on Azure, so that would be independent of IA/AWS if they were interested in also hosting a copy of Tor Metrics' archive.

I would be happy to talk about options in a session in Stockholm if time permits, although it is starting to look as if it is going to be very busy there. We could also have a call to look at options, or we could treat these as different problems and have different solutions to them.

I would be extremely sad if the plan is to delete anything.

I would be extremely sad if the plan is to delete anything.

I would be sad too, but we need a plan. :)

Not deleting anything right now means:

host our own
spend more time deciding on archival policies

And it might mean that someone needs to get approval for more budget, especially for step 4. Time is running out! :)

I would propose doing both 1 and 2 -- find two places to host all our huge archival stuff. Then hopefully they won't both fail at once. Then we can delete huge obsolete things from our archive, comfortable in the hope that they won't disappear from the world.

This approach has the twin advantages that (a) it helps us reconnect to our partner orgs in this space, rather than trying to do everything ourselves monolith style, and (b) these other orgs are actually supposed to be good at archiving things, whereas it isn't really our specialty.

And then I guess we do some variant of 5 after that, to pick a policy for which old obsolete things we don't need to keep on our hot archive.

I'm going to rephrase our options slightly so we can scale them out :)

non-commercial, friends and free hosting (Internet Archive (IA), Software Heritage, OSUOSL, infrared)
(deliberately kept blank to avoid renumbering)
commercial hosting (e.g. Amazon Glacier, Backblaze)
host our own
spend more time deciding on archival policies
mix of the above

I've added OSUOSL and Backblaze to the list as well.

Also, over IRC, roger said:

15:46:30 <+arma1> yep. am leaving that to anarcat for now

not sure that was refering to IA or the entire process, but in any case, it seems I'm the one bottomlining this anyways, so I'll just assign this myself.

I can try to reach out to the non-commercial providers. I would definitely need help to figure out step 5 (better archival policies), ie. what to keep and so on. As for "host our own", I honestly think it's the simplest option, but maybe it's something we can talk about at the july meeting as well...

Trac:
Status: new to assigned
Owner: N/A to anarcat

About reaching out to non-commercial providers, could you reach out to the Infrared network and see if providers there would be interested in host part of this archivs?

sounds like a great idea, adding that to the list! :)

by the way, if people feel they have good connections anywhere, feel free to reach out. just let me know here so we don't duplicate efforts.

i'm not the best networked guy out there, although I do know some of those folks... ;)

Replying to anarcat:

If git-annex is too complicated, we can talk to IA directly. I would recommend, however, against using their web-based upload interface which, even they acknowledge, is terrible and barely useable. I packaged the internetarchive python client in Debian to work around that problem and it works much better.

Moving files to IA only shifts the problem, in my opinion: then we have only a single copy, elsewhere and while we don't need to manage that space anymore, we also don't manage backups and will never know if they drop stuff on us (and they do, sometimes, either deliberately or by mistake). I would propose that if stuff moves out of our "backed-up" infrastructure, it should be stored in at least two administratively distinct locations.

Recently I had the idea to archive some early flash proxy/pyobfsproxy browser bundles from circa 2013--some of them were only ever present under !https://people.torproject.org/~dcf/ and so what I have locally is a superset of what's at archive.torproject.org (for this specific group of packages). The problem I'm encountering with IA is the automatic malware scan--as soon as I upload a self-extracting Windows .exe package, the virus scan returns positive and automatically darks (hides) the entire item. Here are some attempted uploads that got darked:

https://archive.org/details/tor-flashproxy-browser-2.4.6-alpha-1
https://archive.org/details/tor-flashproxy-browser-2.4.6-alpha-2
https://archive.org/details/tor-flashproxy-pyobfsproxy-browser-2.4.7-alpha-1 Here's a sample report from the upload log. Notice some of the matches say "Not-a-virus" and are simply reporting the presence of tor, but it's enough to fail the IA check.
Kaspersky: Not-a-virus:NetTool.Win32.Tor.k
Qihoo-360: Win32/Virus.NetTool.c06
Microsoft: PUA:Win32/Presenoker
ZoneAlarm by Check Point: Not-a-virus:NetTool.Win32.Tor.k It seems that I can avoid the virus check by structuring the uploads: upload all files except the .exe, let them be virus scanned, then upload the .exe. The upload log says "item already had a curatenote indicating it had been checked, no need to update" and the item remains undarked. But this is no solution; besides being an apparent bug in the malware scanning system, it'll only work until the next time someone runs a batch scan or something, and then the items will disappear. For the sake of example, here are items I managed to upload in that way:
https://archive.org/details/tor-flashproxy-pyobfsproxy-browser-2.4.7-test-1
https://archive.org/details/tor-pluggable-transports-browser-2.4.11-alpha-1 TL;DR: archiving at IA will probably require talking to someone there and getting them to make a special collection for us with viruscheck disabled.

i reached out to the software heritage folks and they can't store binary executables, only source code. they have this API where we can submit source:

https://archive.softwareheritage.org/api/1/origin/save/

we could hook up git.torproject.org in there if it isn't already. for tarballs, they don't have a user-visible API but could make one if we need to.

i feel this wouldn't fix the problem entirely, in other words, so might need to look for other solutions. next step is to reach out to the IA people to create a collection or something for our stuff?

@dcf i made archivebot crawl that page of yours, it might have a better shot at publishing to IA than ordinary users.

The job number was bb0m84e2vmz1y0epiu0zz1g28 and it crawled 42MB of stuff in 73 hits. It should be visible here within 24 hours:

https://archive.fart.website/archivebot/viewer/?q=torproject.org

i reached out to my contact at IA and wrote to the infrared mailing list.

Another Grafana update: we went from 163GB to 100GB free in the last 14 days, getting closer to the 20 or even 30GB per week mark. At this rate, we'll run out in 3 weeks. In case of emergency, we can tap into 800GB more in the parent host, but we shouldn't burn through all that space because that will create other problems elsewhere.

So, TL;DR: we might be in real trouble here even before Stockholm if we don't act quickly. In two weeks, i'll see if we can give us an extra 100GB here to work around the immediate crash.

Trac:
Severity: Normal to Major
Priority: Medium to High

Trac:
Cc: helix, karsten, lunar, mikeperry, nickm, gk to helix, karsten, lunar, mikeperry, nickm, gk, gaba
Keywords: N/A deleted, budget_needed added

Good news: my contacts at the internet archive tell me we can have a collection. We just need to create a user for Tor, and that mostly means figuring out an email address to associate with the account. Which email should we use? It would preferably be a role alias, not an individual user...

Trac:
Keywords: budget_needed deleted, N/A added

I should also mention that an entry-level SX62 at hetzner is 75EUR/mth for 4x10TB (20TB in RAID-1) and would solve this problem for anywhere between 8 to 38 years depending on the burn rate. :p

Trac:
Keywords: N/A deleted, budget_needed added

fwiw, i regained control of https://archive.org/details/@torproject and set the email address to the archive group. next step is to get the IA people to give us a collection to upload our stuff.

we now have access to this collection:

     https://archive.org/details/tor_project_archives

We can upload stuff there through the web interface, the ia commandline tool (which I packaged in Debian) and/or the s3 API.

The collection metadata can be edited here: https://archive.org/editxml/tor_project_archives

Now we need someone to take over the archival work. Can anyone take this on?

Replying to anarcat:

we now have access to this collection:
     https://archive.org/details/tor_project_archives
We can upload stuff there through the web interface, the ia commandline tool (which I packaged in Debian) and/or the s3 API.

The collection metadata can be edited here: https://archive.org/editxml/tor_project_archives

Thanks!

Now we need someone to take over the archival work. Can anyone take this on?

I think I can do that.

awesome, moving the ticket over to you for the archival part. i'm still going to look at making sure we have enough disk space on the box, but we still have some days to spare on that.

what I'd like is to make sure we test the account to see if we can upload stuff without it getting "darked". i can put you in touch with our contact at IA so that you can followup there as well.. would that be alright with you?

Trac:
Owner: anarcat to boklm

Replying to anarcat:

what I'd like is to make sure we test the account to see if we can upload stuff without it getting "darked". i can put you in touch with our contact at IA so that you can followup there as well.. would that be alright with you?

Ok, that sounds good.

slightly more readable disk usage report...

=What=	=Disk=	=Ratio=	=Team=
TBB	1.7TB=	83.1%=	TBB
Tor debian packages	312GB=	15.1%=	network
Tails	15GB=	0.7%=	Tails
Vidalia	7GB=	0.4%=	TBB?
Messenger	5GB=	0.2%=	?
Tor RPMs	1GB=	0.1%=	network
Others	< ~60GB=	<0.3%	?
Total	~2.1TB=	100.0%=	?

Disk space is overwhelmingly used by the TBB team, but the network team is next. Together, they account for 98% of the disk space in use.

Everything else is less than 1GB so can be ignored (but someone should still take care of archiving it). That's tricky: it's a large diversity of "stuff" with various responsible parties (or no one at all). It would be great if someone from one of the larger teams volunteered to just sync that stuff up or find responsible parties to do so.

Replying to anarcat:

@dcf i made archivebot crawl that page of yours

Do you mean !https://people.torproject.org/~dcf/? Most of what I have to archive is no longer there--admins make us delete large files from that host after a while. But I kept a local copy of everything I had to delete, which is what I want to archive now. But anyway thanks, there's other stuff under there that it's good to have a copy of.

@dcf i made archivebot crawl that page of yours

Do you mean https://people.torproject.org/~dcf/?

I do!

Most of what I have to archive is no longer there--admins make us delete large files from that host after a while.

Those damn admins heh? ;) Good thinking to keep a copy!

But I kept a local copy of everything I had to delete, which is what I want to archive now. But anyway thanks, there's other stuff under there that it's good to have a copy of.

Excellent. I see you have successfully uploaded this item:

https://archive.org/details/tor-pluggable-transports-browser-2.4.11-alpha-2

And it wasn't darked, so that's great! I sent an email to our contact at IA to thank them for the assistance.

I've also added you to the archive group and email alias, as agreed. Thanks for your work and welcome onboard! :)

This makes me think: as we grant other people access to those resources, it would be great if we could keep an audit trail of what goes where and how.

For example, if we know we have a copy of the source code of X on IA and archive.tpo, but the latter goes down, how do we know it's safe to restore from IA or even that the copy is there at all? There's historical knowlege in people's brains, but that's not good enough.

That's exactly the kind of stuff i use git-annex for. I know it's not the easiest software to use, but by storing everything in git, we keep an audit trail of who does what. It also keeps track of where stuff is, and keeps a checksum of everything. By wrapping this into (signed) git tags, we also get verifiability.

But maybe the tradeoffs are not right?

Hi! Anarcat asked me to comment here explaining what we (the network team) need.

As best as I know, the only things that we make that get uploaded to archive.tpo are our source distribution tarballs and their accompanying signatures. We should make sure that none of these ever gets lost. We produce 1-4 of these per month, and upload them to dist-master.torproject.org, which then syncs them to dist.torproject.org. Previously they have been synced to archive.torproject.org automatically. We remove them from dist-master when they are sufficiently obsolete.

The easiest solution for us would be to leave all of our packages in place on archive.tpo and do nothing at all. This may not be feasible for reason of disk space.

The next easiest solution would be to have some automatic process that uploads these tarballs (and their signature files) to archive.org whenever they are uploaded to dist.

If neither of those is possible, we need permissions and instructions for archiving these tarballs manually. These instructions should get folded into doc/HACKING/ReleasingTor.md in our git repository, and it would be great if they were so simple that a C developer could do them without messing up. :)

awesome nickm, thanks for the clarification. if someone could figure out that sync process, it shouldn't be too hard to add a step that uploads the goods to IA as well. they support S3 and there's also a commandline python thing that can talk to it which we could use, that way we would replicate our stuff there. we could simply iterate through the existing archive to upload the existing content.

i'm not familiar with the sync process so, if i could avoid it, i would prefer to not have to implement this myself. i'd be happy to delegate the IA account credentials to whoever takes on that task, however, and have already done so to dcf who did some tests.

that said, i agree that the IA process is taking slightly too long for us to fix the problem in that way right now, so i'll probably go ahead and just allocate new hardware for this service so we can move on. it would still be nice, however, to hook IA up into the release process somehow.

who knows how the dist / archive.tpo stuff works anyways?

update on the metal front...

i built a new machine from scratch at hetzner, as planned: a SX62 with 4x10TB that should give us around 20TB of storage in the long term. the install process was document through the creation of scripts in tsa-misc that were almost systematically used to install the system and will hopefully be useful to install the next physical boxes.

i have created a 5TB physical volume on the new box, which more than doubles our available space and should give us at least a year more. i'm leaving a lot of free space in the volume group in case we need to do snapshots or other funky things.

the rsync is now in progress and I will add the box, named archive-01.torproject.org, into our usual infrastructure. once the rsync is complete, i will hook the box into the mirroring system...

rsync is complete, and the machine survived a reboot: it returned automatically (thanks to mandos), probably because the raid array had sync'd. so all seems good, but i think some bits might still be missing, because the backups of staticforme and arlgirdense are twice as big as archive-01. this could be just because of incrementals, however.

weasel was saying we might be missing some scripts from the archive team, but I am not sure what those are. i need to email the archive team but will delegate the followup of this to weasel for now...

Trac:
Owner: boklm to weasel

re archive-01: cleaned (removed) /etc/ssh/ssh_host_ecdsa_key* and edited the remaining .pub files to correctly have the hostname.

Trac:
Cc: helix, karsten, lunar, mikeperry, nickm, gk, gaba to helix, karsten, lunar, mikeperry, nickm, gk, gaba, anarcat

I did a final sync of arlgirdense's /srv/archive to archive-01, enabled your cronjobs on archive-01 and disabled them on arlgirdense.

Will switch DNS later in the day, and then shut off arlgirdense.

If you care about your service, please have a look and provide feedback (ack or nak).

Trac:
Cc: helix, karsten, lunar, mikeperry, nickm, gk, gaba, anarcat to helix, karsten, lunar, mikeperry, nickm, gk, gaba, anarcat, boklm

updating dns for archive and rsync.tpo

Replying to weasel:

I did a final sync of arlgirdense's /srv/archive to archive-01, enabled your cronjobs on archive-01 and disabled them on arlgirdense.

Ok, I looked at the cronjobs on archive-01 and that looks good.

Will switch DNS later in the day, and then shut off arlgirdense.

That sounds good.

arlgirdense shut off. cf. #31082 (moved) for decommissioning.

It's not exactly clear what the "is-fixed" state of this ticket ought to be.

On the assumption that it's "we have a new system for now", I'm closing this.

Trac:
Resolution: N/A to fixed
Status: assigned to closed

for me the "fixed" state here is when archive-01 can and does provide all the services arlgirdense provides.

it seems that's the case, so I'll look at removing all traces of the old server in our infrastructure.

and i was just starting to figure out how to spell the damn thing. ;)

re archive-01: cleaned (removed) /etc/ssh/ssh_host_ecdsa_key* and edited the remaining .pub files to correctly have the hostname.

Is this policy to not have ecdsa keys? They might also be in LDAP, were they removed there?

If it's policy, we should add that to Puppet or the install procedure...

weasel's disk removal job ran today and finished removing traces of this.

From: root@kvm4.torproject.org
Subject: Output from your job        2
To: root@kvm4.torproject.org
Date: Tue, 30 Jul 2019 20:05:31 +0000

removed '/srv/vmstore/arlgirdense.torproject.org/arlgirdense.torproject.org-lvm-cache'
removed '/srv/vmstore/arlgirdense.torproject.org/arlgirdense.torproject.org-root'
removed '/srv/vmstore/arlgirdense.torproject.org/arlgirdense.torproject.org-swap'
removed '/srv/big/vmstore-big/arlgirdense.torproject.org/arlgirdense.torproject.org-lvm'
rmdir: removing directory, '/srv/vmstore/arlgirdense.torproject.org'
rmdir: removing directory, '/srv/big/vmstore-big/arlgirdense.torproject.org'

closed

mentioned in issue #30857 (moved)

mentioned in issue #31082 (moved)

archive.tpo is soon running out of space

Child items 0

Activity