Opened 3 months ago

Last modified 3 months ago

#27416 new defect

Automatc tickets for malformed descriptors

Reported by: teor Owned by: atagar
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/DocTor Version:
Severity: Normal Keywords: tor-dirauth
Cc: atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Continuing from https://trac.torproject.org/projects/tor/ticket/27367?replyto=14#comment:14

Replying to atagar:

Related a few past tickets where this has bitten us.

I moved these tickets to #27414, because relays should do the UTF-8 check before they upload.

But the UTF-8 check won't be enough to catch all parsing errors, so I opened #27415 so that relays parse descriptors before they upload them.

But the parsing check *still* won't be enough to catch all spec-implementation divergences. And that's ok. Sometimes we fix the spec, sometimes with fix tor, and sometimes we fix other parsers.

Just a quick note that dirauths could use Stem as a tool for rejecting malformed content. It does stricter validation than the tor binary that descriptors are conformant with the spec. I've been performing this check through DocTor since 2013, filing tickets each time more bad data makes it into the consensus...

https://gitweb.torproject.org/doctor.git/tree/descriptor_checker.py

Bad data chokes not only Stem, but metrics-lib and anything else that ingests it.

Clearly in an ideal world the tor binary itself would do better validation

See the tickets I opened and listed above. I think that's a reasonable amount of validation for the tor binary to do.

In an ideal world, all implementations and the spec would match. When there's a difference, we need to decide what to fix. That's not something we can automate.

but in the absence of that if we took advantage of Stem's validator I wouldn't need to keep filing tickets every few months. Using Stem on dirauths to reject malformed descriptors would prevent these issues upfront, saving Karsten and I hassle.

Running stem as an essential part of dirauth operation is a significant change in tor's security model. It would require a proposal.

It's also a significant change to the consistency of tor's parsing. At the moment, tor uses the same parser and validator for:

  • descriptors via authorities
  • descriptors direct to bridge clients
  • onion service descriptors

If that's a no-go I could also redirect the DocTor check I mention above to email other folks (Nick? teor? Maybe the network team list?) so I don't need to file tickets each time this comes up.

The network team list is administrative and has a closed membership. It's not designed for automatic mailing.

If the checks are safe to make public, then send them to tor-consensus-health, or create another list. (They should be safe, because the validation code is public.)

Whatever you decide, let's adopt the same strategy for the fallback checks. We should extend them beyond just nick and me.

And if there are any other DocTor checks that would benefit from being extended beyond a few people, let's do them at the same time.

Child Tickets

Change History (5)

comment:1 Changed 3 months ago by atagar

But the parsing check *still* won't be enough to catch all spec-implementation divergences. And that's ok. Sometimes we fix the spec, sometimes with fix tor, and sometimes we fix other parsers.

Yup. Agreed. :P

Running stem as an essential part of dirauth operation is a significant change in tor's security model. It would require a proposal.

Agreed, on reflection we shouldn't. In essence there's two directions we can take: use stem's validators to notify (file tickets) or enforce (block malformed descriptors). We do the former nowadays, and should not do the later without a lot of thought.

If the checks are safe to make public, then send them to tor-consensus-health, or create another list.

Will do! They're safe, the only reason I don't send them to the list is because I'm the only person that would likely act upon them. Happy to give it a shot.

There's a few spots where I've created automated notifications that follow a certain pattern...

  1. X commonly breaks.
  1. I create a monitor to email me when X occurs.
  1. When X occurs I file a ticket with the network team.

Honestly malformed descriptors are pretty minor pain point for me. They're rare (one crops up every couple months?). I'll give tor-consensus-health@ a shot, but I have a sinking suspicion I'll still need to file tickets to get traction.

Another more common annoyance for me are tor regressions. Every couple weeks the following occurs...

  1. Tor pushes a regression.
  1. Stem's Jenkins test failures email me.
  1. I wait a few days to see if it gets fixed.
  1. When it becomes clear a fix isn't coming I pull and compile the new tor codebase.
  1. I reproduce the test failure locally and construct reproduction steps that use telnet to take Stem out of the loop.
  1. I file a ticket with the network team to get it addressed.

To the network team's credit their responsiveness when we get to #6 is fantastic. Only trouble is that I'd like to take myself out of the loop on these both (tor regressions and malformed descriptors).

For malformed descriptors I'll give tor-consensus-health@ a shot, and reopen this if I still need to file a ticket.

As for tor regressions I've asked the Network team to run tor's stem test target prior to pushing but no dice so far. Just dealt with another this morning (#27446). Any thoughts on how we can drop me from the loop on these?

comment:2 Changed 3 months ago by atagar

Component: Core Tor/TorCore Tor/DocTor
Owner: set to atagar

Reassigning this to the DocTor component so I remember to make the address change.

comment:3 Changed 3 months ago by atagar

Bah, in looking back I was wrong about the frequency of tor regressions. Sorry about that.

Did a quick search and the last few are #27003, #25981, and #21300. That's pretty uncommon. None the less though each makes a couple hours of work for me so checking with Nick if we can formalize running our tests prior to pushing.

comment:4 in reply to:  1 Changed 3 months ago by teor

Replying to atagar:

...

If the checks are safe to make public, then send them to tor-consensus-health, or create another list.

Will do! They're safe, the only reason I don't send them to the list is because I'm the only person that would likely act upon them. Happy to give it a shot.

The network team has weekly rotations. Checking for consensus health issues could fit in the CI/Coverity rotation.

*But* consensus-health is really noisy. And people need to specifically subscribe to get the emails. Maybe there's a better place we can put this status?

Can we run a scheduled task in Jenkins or Travis?
(Travis can run scheduled tasks every day.)

There's a few spots where I've created automated notifications that follow a certain pattern...

  1. X commonly breaks.
  1. I create a monitor to email me when X occurs.
  1. When X occurs I file a ticket with the network team.

Honestly malformed descriptors are pretty minor pain point for me. They're rare (one crops up every couple months?). I'll give tor-consensus-health@ a shot, but I have a sinking suspicion I'll still need to file tickets to get traction.

If we can make these failures visible, a team rotation could help.

Another more common annoyance for me are tor regressions. Every couple weeks the following occurs...

  1. Tor pushes a regression.
  1. Stem's Jenkins test failures email me.

The network team member doing the CI rotation should catch these issues. But I bet that we are just checking for tor failures, not stem failures due to tor.

  1. I wait a few days to see if it gets fixed.
  1. When it becomes clear a fix isn't coming I pull and compile the new tor codebase.
  1. I reproduce the test failure locally and construct reproduction steps that use telnet to take Stem out of the loop.
  1. I file a ticket with the network team to get it addressed.

To the network team's credit their responsiveness when we get to #6 is fantastic. Only trouble is that I'd like to take myself out of the loop on these both (tor regressions and malformed descriptors).

For malformed descriptors I'll give tor-consensus-health@ a shot, and reopen this if I still need to file a ticket.

As for tor regressions I've asked the Network team to run tor's stem test target prior to pushing but no dice so far. Just dealt with another this morning (#27446). Any thoughts on how we can drop me from the loop on these?

We only recently set up the team rotations. We're going to talk about them again in Mexico. Let's make sure we put stem and consensus-health on the list.

comment:5 Changed 3 months ago by atagar

Summary: Improve descriptor validation in Tor using StemAutomatc tickets for malformed descriptors

Can we run a scheduled task in Jenkins or Travis?

Hi teor, interesting suggestions. Jenkins/Travis would be great fits for Stem integ tests to catch tor regressions but I'm unsure it would be a good fit for malformed descriptor notices.

Jenkins/Travis are great for 'x regressed, and will remain broken until action is taken', but malformed descriptors tend to be transient. Unless I'm missing something we need an asynchronous notification mechanism like a ticket to capture those.

I've conflated two similar but fundamentally different issues in this ticket. Sorry about that! Gave this more thought and think I have a decent path forward on both the issues discussed here...

  1. Malformed descriptors should result in a ticket to the network team. To streamline this I'll adjust our DocTor descriptor validity check to automatically file tickets rather than email me.

These tickets should be rate limited to 1/day (to avoid potential messes, such as from a botched sybil attack), attach the bad descriptor, and provide an explanation of why it's malformed.

This ticket will track this, and the next step is in my court.

  1. Tor regressions should be addressed by proactively running our tests prior to pushing. We already have a Jenkins job that tests after each tor and stem push, but if the network team would care to adjust it or integrate with Travis I'm game.

I've asked Nick and David to test prior to pushing in the future so for me the next step on this front is to see if I stop getting notified of tor regressions. If so, great. If not then I'll file a separate ticket to discuss improving our workflow.

We only recently set up the team rotations. We're going to talk about them
again in Mexico. Let's make sure we put stem and consensus-health on the
list.

Gotcha! Hat's off to the network team by the way. I won't be in Mexico, but you guys are doing a phenomenal job at integrating bests practices and improving our processes.

Note: See TracTickets for help on using tickets.