Opened 3 months ago

Last modified 9 days ago

#30881 assigned task

answer the opsreportcard questionnaire, AKA the "limoncelli test"

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Tom Limoncelli is the reknowned author of Time management for sysadmins and practice of network and system administration, two excellent books I recommend every sysadmin reads attentively.

He made up a 32-question test (PDF, website version on opsreportcard.com or the previous one-page HTML version) that covers the basic of a well-rounded setup. I believe we will get a good score, but going through the list will make sure we don't miss anything.

Child Tickets

TicketStatusOwnerSummaryComponent
#31243newtpa2. define how users get support, what's an emergency and what is supportedInternal Services/Tor Sysadmin Team

Change History (4)

comment:1 Changed 2 months ago by anarcat

Question 1. Are user requests tracked via a ticket system?

http://opsreportcard.com/section/1

The answer is: mostly yes. Most requests can be tracked here, in Trac, but some requests *do* come by email, on torproject-admin@tpo, and some of those *are* a little more difficult to track. Furthermore, that alias gets a lot of noise from servers, as root@ aliases are redirected there.

Because Trac is public, we don't have a good way of tracking requests that should be private as well.

Recommendation, as discussed in Stockholm: start experimenting with triaging root@ emails to RT, and possibly the rest of torproject-admin to RT as well. See #31242.

Last edited 2 months ago by anarcat (previous) (diff)

comment:2 Changed 2 months ago by anarcat

2. Are "the 3 empowering policies" defined and published?

http://opsreportcard.com/section/2

Specifically, this is three questions:

How do users get help?

Right now, this is unofficially "open a ticket in Trac", "ping us over IRC for small stuff", or "write us an email". This could be made more official somewhere.

Update: I made that official in https://help.torproject.org/tsa/doc/how-to-get-help/.

What is an emergency?

I am not sure this is formally defined.

What is supported?

We have the distinction between systems and service admins. We did talk in Stockholm about clarifying that item, so this is worth expanding further.

See #31243 for followup.

Last edited 7 weeks ago by anarcat (previous) (diff)

comment:3 Changed 2 months ago by anarcat

3. Does the team record monthly metrics?

http://opsreportcard.com/section/3

Somewhat. We now have a Prometheus server that records lots of information on the TPA machines, but it doesn't store information beyond one month. It also doesn't record more high-level metrics like:

  • how many machines do we have
  • how many support tickets we deal with
  • how many people on staff
  • etc

The monitoring systems also collect a *lot* of metrics and it might be worth creating a dashboard with the most important ones for our purposes, to get a bird eye's view of everything.

Cute dashboard doesn't seem like high priority, but I've created a ticket for long-term prometheus storage in #31244 at least, so that we can create a dashboard that looks further back in time in the future eventually.

Update: I wrote a little script to post metrics every month, and I'll made a Grafana dashboard out of this that's now the "home" dashboard for the grafana instance.

Last edited 9 days ago by anarcat (previous) (diff)

comment:4 Changed 2 months ago by anarcat

4. Do you have a "policy and procedure" wiki?

http://opsreportcard.com/section/4

Yes, in help.tpo. It might become a GitLab wiki in the future, but that's kind of an implementation detail at this point. It's good enough for now, but it *is* lacking some documentation. In particular:

Consider separating the documentation in four categories:

  1. tutorials - simple, brainless step-by-step instructions requiring no or little technical background
  2. howtos - more in-depth procedure that may require interpretation
  3. reference - how things are built, explaining the complex aspects of the setup without going into "how to do things", policy decisions and so on
  4. discussion - *why* things are setup this way and *how else* they could have been built

That separation comes from what nobody tells you about documentation.

The ops report card also suggest documenting this for every service:

  1. Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information.
  2. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation.
  3. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so.
  4. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on.
  5. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them.
  6. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare?
  7. SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

5. Do you have a password safe?

http://opsreportcard.com/section/5

Yes, we do.

6. Is your team's code kept in a source code control system?

http://opsreportcard.com/section/6

Mostly. There are some ad-hoc scripts here and there, but everything is being committed into git and/or Puppet as much as possible.

7. Does your team use a bug-tracking system for their own code?

Yes, this bug tracker.

Last edited 6 weeks ago by anarcat (previous) (diff)
Note: See TracTickets for help on using tickets.