The machine hosting archive.tpo will soon be running out of space.
Possible solutions include:
adding more resources to it. According to weasel, it seems doing this would require a new host, which would be quite expensive.
removing some of the less useful files from archive.tpo. For example we could remove Tor Browser versions that are more than 4 years old. Or alpha versions that are more than 2 years old.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
It seems to me that if there's anything in there we don't need to save forever, or that we're comfortable letting somebody else host, that would be the place to start looking.
FWIW: If we go for deleting older Tor Browser releases on a scheduled basis (e.g. like the one boklm is mentioning in the description) then we should keep at least one bundle for each platform for historical reasons or as a means to understand questions better that impact older Tor Browser releases (like one got bug XYZ first visible in a release we shipped etc.).
It seems like the right answer is to get a bigger archive. Maybe we should team up with the folks from archive.org? Deleting past Tor releases from the world would be a shame, especially if we have the option to not.
One way to manage stuff like this is to break it up in smaller pieces and distribute it around. a typical way I manage those archives is with git-annex, which allows for reliable tracking of N copies (say "3 redundant copies") and supports many different "remotes", including Amazon Glacier, Internet Archive (IA) and so on. It's what I used in the Brazil archival project and it mostly worked. It's hard to use, unfortunately, which may be a big blocker for adoption.
If git-annex is too complicated, we can talk to IA directly. I would recommend, however, against using their web-based upload interface which, even they acknowledge, is terrible and barely useable. I packaged the internetarchive python client in Debian to work around that problem and it works much better.
Moving files to IA only shifts the problem, in my opinion: then we have only a single copy, elsewhere and while we don't need to manage that space anymore, we also don't manage backups and will never know if they drop stuff on us (and they do, sometimes, either deliberately or by mistake). I would propose that if stuff moves out of our "backed-up" infrastructure, it should be stored in at least two administratively distinct locations.
Another such location we could use, apart from commercial providers like Amazon, is the Software Heritage project (WP) which is designed to store copies of source code and software artifacts of all breeds. It might already have something for Tor even.
Otherwise, assuming we can solve this problem ourselves, I think this question boils down to "How big of an archive do we actually need and how fast does it grow?" With the limited Grafana history I had available a week ago, I have calculated we dump roughly ~10GB per week of new stuff on there, but naturally the sample size is too small to take that number seriously. To give you another metric, in the last two weeks now (one week later), we have gone from 254GB to 207GB, eating a whopping 47GB in 15 days, which clocks the rate at ~3GB a day or ~24GB a week. When I looked at it a week ago, we had 220GB left, which gives us a rate of 13GB/week, so I would estimate the burn rate is between 10 to 20GB/week, which gives us about 10 to 20 weeks to act on this problem.
Assuming 10GB/week, this means we need ~500GB of new storage every year. In our current capacity, this trickles into roughly 2x1TB of storage per year because of RAID and backups.
So if we want this problem to go away for ~10 years (assuming current rate, which is probably inaccurate, at beast), we could throw hardware at the problem and give Hetzner another ~200EUR/mth specifically for an archival server. We might be able to save some costs by not backing up the server and using IA/Software Heritage as a fallback, with git-annex as well.
Fundamentally, this is a cost problem. Do you want us to spend time to figure out a proper archival policy and cheap/free storage locations or pay for an archival server?
In any case, I'd be happy to dig deeper into this to figure out the various options beyond the above napkin calculations.
I believe we have similar issues with CollecTor, although we're not going to hit problems very soon we are seriously lacking any redundancy other than the TPA backups. TPA backups are not guaranteed to be consistent because they are not CollecTor-aware so I've been thinking about building in some replication mechanisms using S3, probably looking at Glacier and IA as the targets. (S3 has the nice property that operations are atomic.)
As far as I know the Software Heritage stack is built on Azure, so that would be independent of IA/AWS if they were interested in also hosting a copy of Tor Metrics' archive.
I would be happy to talk about options in a session in Stockholm if time permits, although it is starting to look as if it is going to be very busy there. We could also have a call to look at options, or we could treat these as different problems and have different solutions to them.
I would be extremely sad if the plan is to delete anything.
I would propose doing both 1 and 2 -- find two places to host all our huge archival stuff. Then hopefully they won't both fail at once. Then we can delete huge obsolete things from our archive, comfortable in the hope that they won't disappear from the world.
This approach has the twin advantages that (a) it helps us reconnect to our partner orgs in this space, rather than trying to do everything ourselves monolith style, and (b) these other orgs are actually supposed to be good at archiving things, whereas it isn't really our specialty.
And then I guess we do some variant of 5 after that, to pick a policy for which old obsolete things we don't need to keep on our hot archive.
I've added OSUOSL and Backblaze to the list as well.
Also, over IRC, roger said:
15:46:30 <+arma1> yep. am leaving that to anarcat for now
not sure that was refering to IA or the entire process, but in any case, it seems I'm the one bottomlining this anyways, so I'll just assign this myself.
I can try to reach out to the non-commercial providers. I would definitely need help to figure out step 5 (better archival policies), ie. what to keep and so on. As for "host our own", I honestly think it's the simplest option, but maybe it's something we can talk about at the july meeting as well...
Trac: Status: new to assigned Owner: N/Ato anarcat
About reaching out to non-commercial providers, could you reach out to the Infrared network and see if providers there would be interested in host part of this archivs?
If git-annex is too complicated, we can talk to IA directly. I would recommend, however, against using their web-based upload interface which, even they acknowledge, is terrible and barely useable. I packaged the internetarchive python client in Debian to work around that problem and it works much better.
Moving files to IA only shifts the problem, in my opinion: then we have only a single copy, elsewhere and while we don't need to manage that space anymore, we also don't manage backups and will never know if they drop stuff on us (and they do, sometimes, either deliberately or by mistake). I would propose that if stuff moves out of our "backed-up" infrastructure, it should be stored in at least two administratively distinct locations.
Recently I had the idea to archive some early flash proxy/pyobfsproxy browser bundles from circa 2013--some of them were only ever present under !https://people.torproject.org/~dcf/ and so what I have locally is a superset of what's at archive.torproject.org (for this specific group of packages). The problem I'm encountering with IA is the automatic malware scan--as soon as I upload a self-extracting Windows .exe package, the virus scan returns positive and automatically darks (hides) the entire item. Here are some attempted uploads that got darked:
ZoneAlarm by Check Point: Not-a-virus:NetTool.Win32.Tor.k
It seems that I can avoid the virus check by structuring the uploads: upload all files except the .exe, let them be virus scanned, then upload the .exe. The upload log says "item already had a curatenote indicating it had been checked, no need to update" and the item remains undarked. But this is no solution; besides being an apparent bug in the malware scanning system, it'll only work until the next time someone runs a batch scan or something, and then the items will disappear. For the sake of example, here are items I managed to upload in that way:
we could hook up git.torproject.org in there if it isn't already. for tarballs, they don't have a user-visible API but could make one if we need to.
i feel this wouldn't fix the problem entirely, in other words, so might need to look for other solutions. next step is to reach out to the IA people to create a collection or something for our stuff?
i reached out to my contact at IA and wrote to the infrared mailing list.
Another Grafana update: we went from 163GB to 100GB free in the last 14 days, getting closer to the 20 or even 30GB per week mark. At this rate, we'll run out in 3 weeks. In case of emergency, we can tap into 800GB more in the parent host, but we shouldn't burn through all that space because that will create other problems elsewhere.
So, TL;DR: we might be in real trouble here even before Stockholm if we don't act quickly. In two weeks, i'll see if we can give us an extra 100GB here to work around the immediate crash.
Good news: my contacts at the internet archive tell me we can have a collection. We just need to create a user for Tor, and that mostly means figuring out an email address to associate with the account. Which email should we use? It would preferably be a role alias, not an individual user...
I should also mention that an entry-level SX62 at hetzner is 75EUR/mth for 4x10TB (20TB in RAID-1) and would solve this problem for anywhere between 8 to 38 years depending on the burn rate. :p
fwiw, i regained control of https://archive.org/details/@torproject and set the email address to the archive group. next step is to get the IA people to give us a collection to upload our stuff.
awesome, moving the ticket over to you for the archival part. i'm still going to look at making sure we have enough disk space on the box, but we still have some days to spare on that.
what I'd like is to make sure we test the account to see if we can upload stuff without it getting "darked". i can put you in touch with our contact at IA so that you can followup there as well.. would that be alright with you?
what I'd like is to make sure we test the account to see if we can upload stuff without it getting "darked". i can put you in touch with our contact at IA so that you can followup there as well.. would that be alright with you?
Disk space is overwhelmingly used by the TBB team, but the network team is next. Together, they account for 98% of the disk space in use.
Everything else is less than 1GB so can be ignored (but someone should still take care of archiving it). That's tricky: it's a large diversity of "stuff" with various responsible parties (or no one at all). It would be great if someone from one of the larger teams volunteered to just sync that stuff up or find responsible parties to do so.
Do you mean !https://people.torproject.org/~dcf/? Most of what I have to archive is no longer there--admins make us delete large files from that host after a while. But I kept a local copy of everything I had to delete, which is what I want to archive now. But anyway thanks, there's other stuff under there that it's good to have a copy of.
Most of what I have to archive is no longer there--admins make us delete large files from that host after a while.
Those damn admins heh? ;) Good thinking to keep a copy!
But I kept a local copy of everything I had to delete, which is what I want to archive now. But anyway thanks, there's other stuff under there that it's good to have a copy of.
Excellent. I see you have successfully uploaded this item:
This makes me think: as we grant other people access to those resources, it would be great if we could keep an audit trail of what goes where and how.
For example, if we know we have a copy of the source code of X on IA and archive.tpo, but the latter goes down, how do we know it's safe to restore from IA or even that the copy is there at all? There's historical knowlege in people's brains, but that's not good enough.
That's exactly the kind of stuff i use git-annex for. I know it's not the easiest software to use, but by storing everything in git, we keep an audit trail of who does what. It also keeps track of where stuff is, and keeps a checksum of everything. By wrapping this into (signed) git tags, we also get verifiability.
Hi! Anarcat asked me to comment here explaining what we (the network team) need.
As best as I know, the only things that we make that get uploaded to archive.tpo are our source distribution tarballs and their accompanying signatures. We should make sure that none of these ever gets lost. We produce 1-4 of these per month, and upload them to dist-master.torproject.org, which then syncs them to dist.torproject.org. Previously they have been synced to archive.torproject.org automatically. We remove them from dist-master when they are sufficiently obsolete.
The easiest solution for us would be to leave all of our packages in place on archive.tpo and do nothing at all. This may not be feasible for reason of disk space.
The next easiest solution would be to have some automatic process that uploads these tarballs (and their signature files) to archive.org whenever they are uploaded to dist.
If neither of those is possible, we need permissions and instructions for archiving these tarballs manually. These instructions should get folded into doc/HACKING/ReleasingTor.md in our git repository, and it would be great if they were so simple that a C developer could do them without messing up. :)
awesome nickm, thanks for the clarification. if someone could figure out that sync process, it shouldn't be too hard to add a step that uploads the goods to IA as well. they support S3 and there's also a commandline python thing that can talk to it which we could use, that way we would replicate our stuff there. we could simply iterate through the existing archive to upload the existing content.
i'm not familiar with the sync process so, if i could avoid it, i would prefer to not have to implement this myself. i'd be happy to delegate the IA account credentials to whoever takes on that task, however, and have already done so to dcf who did some tests.
that said, i agree that the IA process is taking slightly too long for us to fix the problem in that way right now, so i'll probably go ahead and just allocate new hardware for this service so we can move on. it would still be nice, however, to hook IA up into the release process somehow.
who knows how the dist / archive.tpo stuff works anyways?
i built a new machine from scratch at hetzner, as planned: a SX62 with 4x10TB that should give us around 20TB of storage in the long term. the install process was document through the creation of scripts in tsa-misc that were almost systematically used to install the system and will hopefully be useful to install the next physical boxes.
i have created a 5TB physical volume on the new box, which more than doubles our available space and should give us at least a year more. i'm leaving a lot of free space in the volume group in case we need to do snapshots or other funky things.
the rsync is now in progress and I will add the box, named archive-01.torproject.org, into our usual infrastructure. once the rsync is complete, i will hook the box into the mirroring system...
rsync is complete, and the machine survived a reboot: it returned automatically (thanks to mandos), probably because the raid array had sync'd. so all seems good, but i think some bits might still be missing, because the backups of staticforme and arlgirdense are twice as big as archive-01. this could be just because of incrementals, however.
weasel was saying we might be missing some scripts from the archive team, but I am not sure what those are. i need to email the archive team but will delegate the followup of this to weasel for now...