Opened 15 months ago

Last modified 5 months ago

#31243 merge_ready task

TPA-RFC-2: define how users get support, what's an emergency and what is supported

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords: tparfc, tpa-roadmap-may
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Extract from parent ticket:

2. Are "the 3 empowering policies" defined and published?

http://opsreportcard.com/section/2

Specifically, this is three questions:

How do users get help?

Right now, this is unofficially "open a ticket in Trac", "ping us over IRC for small stuff", or "write us an email". This could be made more official somewhere.

What is an emergency?

I am not sure this is formally defined.

What is supported?

We have the distinction between systems and service admins. We did talk in Stockholm about clarifying that item, so this is worth expanding further.

Child Tickets

TicketStatusOwnerSummaryComponent
#33108closedhiroHow and when should the sysadmin team adopt a serviceInternal Services/Services Admin Team

Change History (20)

comment:1 Changed 15 months ago by anarcat

i formalized the current support channels in https://help.torproject.org/tsa/doc/how-to-get-help/

comment:2 Changed 12 months ago by anarcat

Owner: changed from tpa to anarcat
Parent ID: #30881
Status: newassigned

remove from checklist, as i want to close that ticket and it will be open forever if it depends on all the tickets generated from it.

comment:3 Changed 10 months ago by anarcat

we should setup tiers of supported services somehow. it seems a priority short list of services that should stay up is the "donation and main websites, and the incoming email forwards service". but maybe we can have three tiers, with that list being the first one.

maybe it could be:

  • first tier: donation site, main website, incoming email
  • second tier: other sysadmin services (e.g. irc bouncer)
  • third tier: current "service admin" services (e.g. gitlab)?

This doesn't mean service admins shouldn't prioritize their stuff, but it would give sysadmins the opportunity to prioritize work on the sysadmin services there.

of course, maybe we want to change that distinction between service admins and sysadmins as well. As things stand now, i'm keeping that distinction, but I doubt it will be very feasible when/if the git/trac/gitlab server crashes for me, even as a sysadmin, to pretend it's not my responsability. ;)

comment:4 Changed 9 months ago by anarcat

Status: assignedneeds_review

alright, i've reviewed our documentation on this, and we actually had a draft of something we could start with. instead of "tiers" it's based on "code red/yellow". a code "red" is a "drop everything" priority. i still include the same services in that code red, i just change the name and set the boundaries a little more clearly.

i've detailed the policy here:

https://help.torproject.org/tsa/howto/incident-response/#Support_policies

the TL;DR:

  • code red: incoming email, donation, website
  • code yellow: something that might become a code red, but is not urgent yet (e.g. trac performance problem)
  • routine: account creation, etc - everything else
  • a code yellow can be upgraded to a code red after a one week delay with team lead approval
  • we don't have 24/7 support
  • requests are processed during work hours of available staff
  • we try to schedule holidays to avoid multiple "offline" days but those can still occur
  • we support only Debian stable and oldstable (not LTS)

asked hiro for review, thanks! :)

then will push to vegas

Last edited 9 months ago by anarcat (previous) (diff)

comment:5 Changed 9 months ago by anarcat

Keywords: tparfc added
Summary: 2. define how users get support, what's an emergency and what is supportedTPARFC-2: define how users get support, what's an emergency and what is supported

comment:6 Changed 9 months ago by anarcat

Summary: TPARFC-2: define how users get support, what's an emergency and what is supportedTPA-RFC-2: define how users get support, what's an emergency and what is supported

comment:7 Changed 9 months ago by teor

Status: needs_reviewneeds_revision

The link to "gitweb performance problems (ticket 32133)" actually goes to debian's 32133.

You probably meant trac #32133.

See:
https://help.torproject.org/tsa/howto/incident-response/#Code_yellow

comment:8 Changed 9 months ago by hiro

I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.

Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.

E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.

In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.

These are just a few observations.

comment:9 Changed 9 months ago by anarcat

Status: needs_revisionneeds_review

The link to "gitweb performance problems (ticket 32133)" actually goes to debian's 32133.

You probably meant trac #32133.

oh good catch, fixed, thanks!

comment:10 Changed 9 months ago by anarcat

I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.

That's what I tried to explain in the first part, with the "work times of available staff" bit. But maybe we could expand and include your sentence above to make that crystal clear. :) I've done just that now, see if it fixed it. :)

Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.

True! that would be a good procedure to have. But for now I'd like to focus on the "oncall" side of things...

For the record, we discussed this last in stockholm and those are the relevant notes, I think:

We end up with having to keep hosts and services running long after the initial people who wanted it left. We also run some things directly as torproject-admin. We should have some list of requirements for things we (and also others) run on our infra. This list would include that sw needs to have proper releases and installation instructions and procedures, a bug tracker, some means to contact upstream, and it needs to run in the lastest Debian stable (and when there is a new Debian stable, it needs to run on that within a month or three.) There needs to be some commitment of maintainership, not only by individuals but by the project/corp, meaning a promise of recurring money to keep this service working. It's never just about setting up. We really really want at least two people who know and maintain each service. Also, this policy should apply not only to incoming services, but it should apply to all the things we run and we should regularly evaluate whether services meet them.

Extracted from https://trac.torproject.org/projects/tor/wiki/org/meetings/2019Stockholm/Notes/SysadminTeamRoadmapping

So maybe it's just a matter of spelling this out in bullet points and adding it to the support policy?

E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.

For the record, I consider gitlab to be a "service" under the "service admins" umbrella. I have explicitely pushed back on the idea of throwing TPA under that bus for now, and we will need to have a team managing gitlab if we want this thing to work at all. :)

Of course, we have a tendency of falling back to TPA when things fail in the service admins team, but at least we should have that buffer for now, until we redefine those distinctions.

In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.

Absolutely. Before we close this ticket, let's make a service admission policy, based on your comments here and the Stockholm discussion...

Do you want to draft something? You seem to have good ideas! :) Otherwise i can try to make a summary...

comment:11 Changed 9 months ago by anarcat

this is being drafted in #33108.

comment:12 Changed 9 months ago by anarcat

next steps here are:

  1. move the policy proposal into https://help.torproject.org/tsa/policy/
  2. draft improvements to factor in #33108
  3. send the draft officially to tpa at the end of the TPA-RFC-1 delay, if approved (next friday, 2020-02-14)

comment:13 Changed 8 months ago by anarcat

Keywords: tpa-roadmap-march added

comment:14 Changed 8 months ago by anarcat

Owner: changed from anarcat to hiro
Status: needs_reviewassigned

hiro has volunteered to followup on this process.

comment:15 Changed 8 months ago by hiro

Resolution: fixed
Status: assignedclosed

comment:16 Changed 8 months ago by anarcat

Resolution: fixed
Status: closedreopened

this should be submitted to a larger group before it's marked as approved, i think. following tpa-rfc-1, i think the rfc is now in the "draft" state and it should be brought up for discussion within tpa, and maybe other teams.

thanks for drafting this! :)

comment:17 Changed 7 months ago by anarcat

Keywords: tpa-roadmap-april added; tpa-roadmap-march removed
Owner: changed from hiro to anarcat
Status: reopenedassigned

i'll bring this around for wider approval, approved by hiro

comment:18 Changed 6 months ago by anarcat

Keywords: tpa-roadmap-may added; tpa-roadmap-april removed

comment:19 Changed 5 months ago by anarcat

Status: assignedneeds_review

i did a significant review of the proposal. it seemed to me that the stuff from #33108 overlapped quite a bit with the existing support levels and policies, so I started merging those. and then I realized that the "service admins" definition belongs there too, along with "how do I get help".

before you know it i had reorganized the entire thing. so I sent an email to TPA for a final approval, and plan to bring this to wider approval (tor-internal, i guess?) next week if no one in tpa objects.

comment:20 Changed 5 months ago by anarcat

Status: needs_reviewmerge_ready

approved by TPA during today's meeting, waiting another week for approval on tor-internal.

i made a small change during the meeting to include gitlab in the support channels.

Note: See TracTickets for help on using tickets.