wiki:org/meetings/2019BrusselsAdminTeamMinutes

Tor sysadmin team meeting Brussels 2019 minutes

System lifecycle

Service owners going away/services becoming irrelevant.

Maybe systems (like $host.torproject.org) should have an expiration date, of say a year or two in the future. After that, the system would either shut down automatically or not run any services. So, each host should also have a list of unix-groups that are stakeholders of (services on) that host. They would be mailed a week or a month before the expiration date and would be expected to reply to torproject-admin@ to extend the expiration date of that host.

ACTION ITEMS

  • Make ticket(s) (qbi) (#29304)
  • Adapt ldap to have the extra fields (weasel) (#29305)
  • Write script to mail people (weasel) (#29306)
  • Figure out: how to shut down vms cleanly/prevent services from being run. (manually for now, with nagios tie in?)
  • Remove the expiration columns from infrastructure page (done)
  • Link db.tpo's machines page on infrastructure page and have the info there.
  • Go over ldap, add information about stakeholders and add an expiration date

User account lifecycle/group memberships

  • Set expiration dates on accounts (maybe just shadowExpire) and auto-lock accounts.
  • Account holders to be mailed beforehand, and if they reply we extend the account.
  • E-mail should keep working for another x days/weeks/months even after an account has been locked.
  • Yearly group membership pings. Ask people to re-confirm which groups they want to be in.

ACTION ITEMS

  • Set expiration dates on accounts (weasel)
  • Set expiration dates on groups (weasel)
  • Adapt ldap scripts to honor them (weasel)
  • Write spam-scrip to run weekly (weasel)

Puppetize more?

Ie everything a service needs from tsa.

DECISION

  • Do it for new stuff, but do not go and mess with existing things that work.

Packaging services / maintaining ansible for services

DECISION

Experience from Debian says that packaging services does not really work out well; let's have service owners use ansible, salt or whatever and make sure we set permissions and similar correctly for them

Making more of our repositories public

... in particular the puppet repo.

  • Strategy: Start a fresh repo, and move files there individually.
  • Leave out virtual/aliases thing.
  • Primary host is pauli, with mirror to git-rw.tpo and git.tpo.
  • Keep account-keyring read-all for tor people

ACTION ITEMS

  • Get it done (ln5)

Trending/monitoring

... now that munin is gone. Prometheus?

ACTION ITEMS

  • Investigate disk storage and latency requirements, as well as RAM requirements, for prometheus (hiro)
  • Set up a VM for prometheus as soon as we know the prometheus requirements (ln5)

Domain portfolio (like "should we drop *.is")

We serve DNS for a bunch of domains we do not "own" (fr, is, nl, se). We should stop doing that. If/when we own it, we can still park them ourselves. We should not do the mail and web redirect game with all the TLDs

ACTION ITEMS

  • Ask mgmt if they want to cover costs for All The Domains; ANSWER: no, let's drop them but wait for the new sysadmin to do this
  • Reach out to owners. ask for either: transfer (if yes was the answer to (1)) or stop pointing to our nameservers

Office solution (nextcloud/owncloud)

... so that people don't use google docs all the time.

If we stay on google, things should move to a Tor group account.

PUNT TO

  • Saturday meeting with leadership.

What services are missing on infrastructure page

CONCLUSION

Identified two missing entries.

ACTION ITEMS

  • Reconcile with what's in DNS and LDAP

Dedicated sudo passwords

  • We allow use of dedicated sudo passwords since 2016, but we have not enforced their use, instead still also accepting the LDAP password.
  • We should stop accepting ldap passwords to authenticate for sudo passwords.

ACTION ITEMS

  • Send mail to -project (weasel)
  • Configure pam on all but the CRM hosts to only accept the sudo passwords (weasel)
  • Talk to giantrabbit if they can set sudo passwords, depending on the answer, make sudo to crm for them passwordless. (ln5)
  • Make PAM on the CRM hosts match the rest of tpo (ln5)

Setup a loghost

ACTION ITEMS

  • jfdi

DNS providers

Right now, we do our own authoritative DNS. We would like to move away from that. We added dnsnode in-zone. We should add at least a second provider, and then retire our hosts. Then, we should update the delegation(s) in the parent(s).

ACTION ITEMS

  • Shop around and figure out prices (zero would be nice) (ln5)

Should we have regular team meetings?

DECISION

Monthly on IRC, first monday of the month at 17:00 local time for central european people, starting march 2019.

Mail system

We are not running a full stack today because of the following issues:

  1. User support is painful
  2. Don't want to store other ppls emails (subpoenas, other legal issues)

We could separate .com from .org and let "company people" have their email on servers supplied by TPI for an email service. We could

  • Run it ourselves, using TREES to avoid issue 2
  • By the service from someone like riseup, mailfence, mailbox.org, google

Ties into the upcoming Saturday discussions.

Adding hiro to groups for sysadmin tasks

  • [done: ldap group updated, ldap gid updated, tpa-passwords reencrypted, mail alias updated --weasel]

ACTION ITEMS

  • document these steps (ln5)

help.torproject.org sysadmin documentation

ACTION ITEMS

  • ln5 to add his things (ln5)

TB update over onions (#17216)

DECISION

Communicate to TB team that we prefer they not switch to the .onion service for updates.: [GeKo: fwiw that would be just for the update.xml files not the actual udpdate files (just to make that point clear); however, I am fine with the arguments against doing that now even for the xml files]

  • onionbalance is unmaintained,
  • does not support v3 onion services.
  • Additionally, there should be a way to run more than one onionbalance for each service, such that the onionbalance host is not a SPOF.

Once these issues are addressed, we can reconsider the issue.

ACTION ITEMS

  • Reply to ticket (weasel) (DONE)

Cymru hardware

ACTION ITEMS

  • ping sina about power usage, form factor and more (ln5)
  • collect the above info ourselves if sina doesn't come back during Friday
  • ask micah about seattle hosting opportunities (ln5)
  • connect with EFF sysadmins re hosting opportunities (ln5)

Help users improve when requesting resources

When requesting resources, provide the following to enable us to plan better, as well as providing some cost transparency.

  • disk space requirements (size, speed)
  • ram
  • cpu
  • the above three estimates should be given for t, t+6m, t+[123]y
  • expected lifetime
  • project/team to "bill" to

ACTION ITEMS

  • Communicate the above "form" by putting it on help.tpo and send an email about it to -team or -internal

check/dnsel: host and existing service should be retired

ACTION ITEMS

  • re-implement things and have a deployment ready by end-of-March (karsten, irl)

gitlab service

2G RAM, CPUS, some disk (on kvm5)

Gitlab evaluation

  • Some team (snowflake?) to use gitlab exclusively. move (copy + add link to gl) existing tickets to gitlab service (not by tsa but by gitlab team)
  • Runners could be provided by anyone. so, it could be done outside of tpa/tpo for evaluation, and if we like it in the end we can add some runners later.

ACTION ITEMS

  • create new group with ahf, dgoulet, hiro
  • create a test VM. similar to godard.debian.org (the host running salsa)
  • tpa provides postgres, postfix(?)+dovecot(?), apache
  • tpa to ping gitlab team once VM is ready in week 6 or 7.
  • gitlab team to use the debian ansible setup for salsa also for this thing

documentation for things

sysadmin hiring process.

  • what are we trying to assess anyhow?
  • finish 2nd round within 2 weeks.

ACTION ITEMS

  • Talk to Erin Monday or Tuesday about next steps (ln5)

Strategic hosting plan

VMs, VS. individual rented metal, vs. owned machines.

  • We like to run VMs on our own infrastructure, so we can ensure that disks only ever see encrypted data.
  • Let's start building a system to "manage" encryption keys for disks, so that when a host boots it can connect to the service and request its own disk encryption keys. This should also work from initrds.
  • Then, once we have that, running on 3rd party cloud would enable us to encrypt the VMs there too, and run things more easily there. (Assuming we can setup disk encrypted VMs, which might be a blocking issue too.)
  • (The service will also be useful as it enables us to reboot the KVM hosts we do run in a more reasonable way.)
  • Own rented "racks" with more machines and own own networking stuff would enable us to run things like ganeti.

Access to resources to non-members?

Anwer: Yes, easy -- give them an LDAP account, just don't put them in group 'torproject'.

weasel@draghi:~$ ud-useradd -g     # -g is for guest

Probably wants a dedicated keyring [nope, should not be needed --weasel], and a new mail template [even that is kinda optional --weasel], and probably a new unix group. All easy.

Requested services/collecting thoughts about S3 object storage

irl/metrics requirements: 200GB-1TB in the next 3yrs (50GB within the first 6 months)

Backup storage/size

brulloi is running out of disk space.

ACTION ITEMS

  • Ask leadership for a bigger box (eg. SX132 [H]) (ln5, done)
  • Once approved, order, setup, configure in bacula, wait for new full backups, retire old box.

REFERENCES

  • Do it on dedicated domain names.
  • Either do it our own, or outsource it (starting at 60 to 600 EUR /month)
  • On our own HW/VMs
    • Want at least 2 instances (since it's a user-facing service, and we need the redundancy)
    • Probably looking at hosting costs of 20-30 EUR/month, not including person-hours of our work (setup costs, ongoing mnt, ...)

ACTION ITEMS

  • Refresh the email thread from November -18 with the above options (hiro)

CDN use vs. own static rotation network

Should we invest more in the latter or start moving more on the former?

  • Categories of data
    • TB download + upgrade (about 99% of the traffic)
    • Static web content
    • If we want to keep serving things ourselves, we will need more resources (like a bunch of 10GigE connected machines).

ACTION ITEMS

  • Talk to TB team about moving TB downloads onto CDN (upgrades are already there) (ln5)

IPv6 monitoring

ACTION ITEMS

  • Implement host-alive (ping) checks on v6 (weasel)
  • Investigate what prometheus can do for us wrt to multiple (ie 2) checks turning into one single alarm

Services that Hiro runs alone

  • media
  • Sandstorm
  • Survey

ACTION ITEMS

  • Find out who should be on tormedia and who could be a co-mainteiner
  • Find out who could help maintain storm.tp.o
  • Find out who could help maintain survey.tp.o

NextCloud (NC) as a Sandstorm and SVN replacement?

  • Several of the team members administer their own NC instance, mostly happy about it
  • NC provides file storage and sharing (a la Dropbox), shareable calendar + tasks, contacts, a Kabana board thing ("Decks") and more
  • Can be self hosted or bought from f.ex. Hetzner
  • Upside with hosting ourselves include being more in control of stored data, both in terms of migrating it and certainty of data being encrypted on disk. The latter part could be considered a non-issue if the client-side encryption of folders works well enough.
  • Upside with buying the service is avoiding having to maintain another critical service ourselves

ACTION ITEMS

  • Ask for 10 EUR/month for a test period of 6 months for buying NC from Hetzner (or another, perhaps US for a better user experience in Seattle office?)
  • Enroll ~10 Tor people to test it out, possibly copying or moving data from Sandstorm and/or SVN to NC

Things that need doing but aren't ours

  • Should be noted on a list and brought to the attention of gaba or pili
  • Maybe file tickets, and tag them somehow with some keyword.

Non-TSA issues

Helping out.

Cleaning unused packages on dist.tpo

ACTION ITEMS

  • Clean up once (hiro)
  • Work (possibly with irl) on a solution like what debian has for uploading (hiro)

Trac

... policy regarding cypherpunks, having a pseudonymous account, etc.

Account cleanup: We have tens of thousands of users, most of them created by bots and never used.

Anonymous user (ie not logged in) issue: Crawling puts too much pressure on pgsql.

ACTION ITEMS

  • Create a script that queries the db for last login time of all users and removes users that hasn't logged in for a year (qbi)
  • Install the qos module for apache2 and configures it like geyeri (ln5)
  • Enable search for !loggedin users again (qbi)

References

META

ACTION ITEMS

  • Turn this into a minutes mail/trac-page and send/post it (ln5)
  • Turn all action items into tickets where applicable or do other smart things (ln5)
Last modified 9 days ago Last modified on Feb 7, 2019, 12:25:34 PM