Opened 13 months ago

Closed 9 months ago

Last modified 8 months ago

#30881 closed task (fixed)

answer the opsreportcard questionnaire, AKA the "limoncelli test"

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tom Limoncelli is the reknowned author of Time management for sysadmins and practice of network and system administration, two excellent books I recommend every sysadmin reads attentively.

He made up a 32-question test (PDF, website version on opsreportcard.com or the previous one-page HTML version) that covers the basic of a well-rounded setup. I believe we will get a good score, but going through the list will make sure we don't miss anything.

Child Tickets

Change History (10)

comment:1 Changed 12 months ago by anarcat

Section A: Public Facing Practices

Question 1. Are user requests tracked via a ticket system?

http://opsreportcard.com/section/1

The answer is: mostly yes. Most requests can be tracked here, in Trac, but some requests *do* come by email, on torproject-admin@tpo, and some of those *are* a little more difficult to track. Furthermore, that alias gets a lot of noise from servers, as root@ aliases are redirected there.

Because Trac is public, we don't have a good way of tracking requests that should be private as well.

Recommendation, as discussed in Stockholm: start experimenting with triaging root@ emails to RT, and possibly the rest of torproject-admin to RT as well. See #31242.

2. Are "the 3 empowering policies" defined and published?

http://opsreportcard.com/section/2

Specifically, this is three questions:

  • How do users get help?: Right now, this is unofficially "open a ticket in Trac", "ping us over IRC for small stuff", or "write us an email". This could be made more official somewhere. Update: I made that official in https://help.torproject.org/tsa/doc/how-to-get-help/.
  • What is an emergency? I am not sure this is formally defined.
  • What is supported? We have the distinction between systems and service admins. We did talk in Stockholm about clarifying that item, so this is worth expanding further.

See #31243 for followup.

3. Does the team record monthly metrics?

http://opsreportcard.com/section/3

Somewhat. We now have a Prometheus server that records lots of information on the TPA machines, but it doesn't store information beyond one month. It also doesn't record more high-level metrics like:

  • how many machines do we have
  • how many support tickets we deal with
  • how many people on staff
  • etc

The monitoring systems also collect a *lot* of metrics and it might be worth creating a dashboard with the most important ones for our purposes, to get a bird eye's view of everything.

Cute dashboard doesn't seem like high priority, but I've created a ticket for long-term prometheus storage in #31244 at least, so that we can create a dashboard that looks further back in time in the future eventually.

Update: I wrote a little script to post metrics every month, and I'll made a Grafana dashboard out of this that's now the "home" dashboard for the grafana instance.

Score: 1.5/3

Last edited 10 months ago by anarcat (previous) (diff)

comment:2 Changed 12 months ago by anarcat

Section B: Modern Team Practices

4. Do you have a "policy and procedure" wiki?

http://opsreportcard.com/section/4

Yes, in help.tpo. It might become a GitLab wiki in the future, but that's kind of an implementation detail at this point. It's good enough for now, but it *is* lacking some documentation. In particular:

Consider separating the documentation in four categories:

  1. tutorials - simple, brainless step-by-step instructions requiring no or little technical background
  2. howtos - more in-depth procedure that may require interpretation
  3. reference - how things are built, explaining the complex aspects of the setup without going into "how to do things", policy decisions and so on
  4. discussion - *why* things are setup this way and *how else* they could have been built

That separation comes from what nobody tells you about documentation.

See also question 11, below.

5. Do you have a password safe?

http://opsreportcard.com/section/5

Yes, we do.

6. Is your team's code kept in a source code control system?

http://opsreportcard.com/section/6

Mostly. There are some ad-hoc scripts here and there, but everything is being committed into git and/or Puppet as much as possible.

7. Does your team use a bug-tracking system for their own code?

http://opsreportcard.com/section/7

Yes, this bug tracker, although it's not directly connected with the Puppet source code. See #29387 for a discussion on how to publish that source.

8. In your bugs/tickets, does stability have a higher priority than new features?

http://opsreportcard.com/section/8

No. We don't really do a good job at prioritizing issues in Trac, actually.

9. Does your team write "design docs?"

http://opsreportcard.com/section/9

Not generally. I try to create issues for projects that are underway, but it's generally "just do it".

It would be important to have "design" part of the documentation, at least in the wiki. I have tried to retrofit that in projects as I discover them. THe report card suggests the following sections:

  • Overview
  • Goals
  • Non-Goals
  • Background
  • Proposed Solution
  • Alternatives Considered
  • Security
  • Disaster Recovery
  • Cost

10. Do you have a "post-mortem" process?

http://opsreportcard.com/section/10

Not really, although we haven't had a major outage since I started here, so maybe it's just something we'll do when we feel there's a significant problem?

Score: 3.5/7

Last edited 10 months ago by anarcat (previous) (diff)

comment:3 Changed 12 months ago by anarcat

Section C: Operational Practices

11. Does each service have an OpsDoc?

http://opsreportcard.com/section/11

Definitely not. Many services are documented in the ikiwiki, but those docs are mostly for the "system" side of things. Most services are not documented at all, or the documentation is spread around the Trac and ikiwiki.

Needs to be improved, definitely. Each service should have a service like the ops report card suggests:

  1. Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information.
  2. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation.
  3. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so.
  4. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on.
  5. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them.
  6. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare?
  7. SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

12. Does each service have appropriate monitoring?

http://opsreportcard.com/section/12

Most hosts have Nagios monitoring, and Prometheus was deployed recently to (re-)introduce long term graphing and trending.

Most services, however, are not directly monitored themselves - only the underlying machines are. This is part of the distinction between "systems" and "services", the latter being outside the scope of TPA work.

We might want to re-evaluate this. In particular, we should focus on creating "functional testing" kind of monitoring, where we monitor the service's endpoint (e.g. does gettor send a response when it receives an email) instead of its underlying resources (e.g. is postfix running? is there enough memory). The latter is checked everywhere, but the former has only been recently introduced, into Nagios.

13. Do you have a pager rotation schedule?

http://opsreportcard.com/section/13

No, we're all on call, all the time, although we might say we just don't have any oncall at all.

14. Do you have separate development, QA, and production systems?

http://opsreportcard.com/section/14

No.

15. Do roll-outs to many machines have a "canary process?"

http://opsreportcard.com/section/15

No.

Score: 0.5/5

Last edited 10 months ago by anarcat (previous) (diff)

comment:4 Changed 12 months ago by anarcat

Section D: Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?

http://opsreportcard.com/section/16

Yes, but not everything is in Puppet. Some hosts are not configured through Puppet at all, apart from the basic stuff. Some configuration is done manually on servers, even at the "systems" level.

Also, because only TPA has access to Puppet, some services are deployed with Ansible instead, leading to a "two systems" problem where it's more difficult for new people to join in.

17. Do automated administration tasks run under role accounts?

http://opsreportcard.com/section/17

Mostly, although I suspect some services might die horribly if/when we remove users. We generally try to create role accounts, however, so that should not be a problem.

18. Do automated processes that generate e-mail only do so when they have something to say?

http://opsreportcard.com/section/18

Definitely not. There's a lot of noise in the "root" email, which ends up in the shared sysadmin mailing list. Most sysadmins have filters to sort out that stuff, but it's still noisy to, every morning, look at those and parse if there are real emergencies.

One stopgap measure proposed is to send those emails to RT and away from the admin's inboxes. Then those emails become "pull-only", instead of "pushed" to the inboxes. See #31242 for a followup on that.

Otherwise the struggle to silence cronjob is neverending, as any linux sysadmin is aware. The rules, accoring to the report card, is:

  • If it needs human action now: Send a page/SMS.
  • If it needs action in 24 hours: Create a ticket.
  • If it is informational: Log to a file.
  • Output nothing if there is no information.

Score: 1.5/3

Last edited 10 months ago by anarcat (previous) (diff)

comment:5 Changed 10 months ago by anarcat

Section E: Fleet Management Processes

19. Is there a database of all machines?

http://opsreportcard.com/section/19

Yes, but it's somewhat spread around LDAP, Puppet and a spreadsheet. There's a ticket open to "improve the inventory" (#30273) which aims at solving the problem, possibly with the hope of merging everything in a single source of truth (most likely Puppet). There's also a ticket to have a dashboard to display that information (#31969).

20. Is OS installation automated?

http://opsreportcard.com/section/19

Somewhat. New installer scripts have been introduced for our various platforms and documentation has been established, but there's some work to be done to standardize the process further. See #31239.

21. Can you automatically patch software across your entire fleet?

http://opsreportcard.com/section/21

We have a semi-automated process: there's a magic command that can be launched manually to perform upgrades over all affected machines, requiring approving each similar change manually.

As for this recommendation:

When possible, updates should happen silently. If they require a reboot or other interruptions, users should have the ability to delay the update. However, there should be a limit; maybe 2 weeks. However the deadline should be adjustable so that emergency security fixes can happen sooner.

... it's not currently done. See #31957 for followup.

22. Do you have a PC refresh policy?

http://opsreportcard.com/section/22

If you don't have a policy about when PC will be replaced, they'll never be replaced. [By "PC" I mean the laptop and desktops that people use, not the servers.]

Strangely, I believe this should also apply to servers, which the report card seems to assume are already covered.

In our case, they are not. There was some work in Brussels to establish formal processes to manage the lifetime of systems, see #29304. There is also work underway to decommission old machines and replace them with newer ones. This crosses over the inventory work (#30272) as well.

Score: 2.5/4

Last edited 9 months ago by anarcat (previous) (diff)

comment:6 Changed 10 months ago by anarcat

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?

http://opsreportcard.com/section/23

Yes, of course we have RAID-1 everywhere, and the new cluster has DRBD on top of *that*. The reportcard suggests there are possible exceptions for this, but we make none.

24. Is the network core N+1?

http://opsreportcard.com/section/24

We generally do not manage our own network and that is delegated upstream, so yes, in a way.

25. Are your backups automated?

http://opsreportcard.com/section/25

Yes.

26. Are your disaster recovery plans tested periodically?

http://opsreportcard.com/section/26

What's a disaster recovery plan?

27. Do machines in your data center have remote power / console access?

http://opsreportcard.com/section/27

Yes, mostly.

Score: 4/5

comment:7 Changed 10 months ago by anarcat

Section G: Security Practices

28. Do Desktops/laptops/servers run self-updating, silent, anti-malware software?

No.

29. Do you have a written security policy?

No. See http://www.sans.org/security-resources/policies/ for an example.

30. Do you submit to periodic security audits?

No.

31. Can a user's account be disabled on all systems in 1 hour?

Yes, through LDAP, although some services are not directly hooked into LDAP. See #32519 for followup.

32. Can you change all privileged (root) passwords in 1 hour?

No.

Score: 0.5/5

Last edited 8 months ago by anarcat (previous) (diff)

comment:8 Changed 10 months ago by anarcat

Summary

  • Section A: Public Facing Practices: 1.5/3 (50%) tickets: #31242, #31243, #31244
  • Section B: Modern Team Practices: 3.5/7 (50%) tickets: #30880, #29387, missing: post-mortem, total puppetization, design docs, ticket prioritization of stability
  • Section C: Operational Practices: 0.5/5 (10%) tickets: none yet, missing: "ops docs" for each service, pager rotation schedule, dev/stage/prod environments, canary process
  • Section D: Automation Practices: 1.5/3 (50%) tickets: #31242, missing: reduce email noise
  • Section E: Fleet Management Processes: 2.5/4 (63%) tickets: #30273, #31969, #31239, #3157, #29304
  • Section F: Disaster Preperation Practices: 4/5 (80%) tickets: none yet, missing: disaster recovery plan
  • Section G: Security Practices: 0.5/5 (10%) tickets: #32519, missing: malware scanners, security policy, security audits, global root password rotation

Final score: 14/32 (44%)

Last edited 8 months ago by anarcat (previous) (diff)

comment:9 Changed 9 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i think this ticket is done insofar as we've answered the questionaire. there is still a lot of work to be done to get a 100% score, but that will take a long time to achieve, if ever. for now, let's consider this done and it can be kept as future reference for quiet times when we want to get started on new projects.

children tickets have been detached from this ticket so it can be closed, but they are linked in the summary.

comment:10 Changed 9 months ago by anarcat

i designed a service docs template in here:

https://help.torproject.org/tsa/howto/template/

it's quite exhaustive and most documentation don't have all fields, but it gives us a good thing to copy-paste from.

Note: See TracTickets for help on using tickets.