wiki:org/teams/NetworkTeam/CoreTorReleases/042Retrospective

# 042 bug retrospective

POSSIBLE LESSONS TO TAKE FROM THIS RETROSPECTIVE:

  • We should be writing tests for everything. "Having a test for the error case would have caught this" comes up way too much.
    • teor: +1, we should add extra time for unit testing to every ticket and proposal
  • Our lack of integation testing for certain pieces of functionality is a long-term liability. Even carefully unit testing our code would not have caught all these bugs.
    • teor: +1, we should add extra time for integration testing to every ticket and proposal
    • we should also change our PR reviews so we ask about integration testing every time
  • In many cases we have introduced testing in response to these bugs, but probably not enough.
  • Some of our bugs seem to sit around longer than we should like. There are bugs that were first fixed in 0.4.2 dating all the way back to 0.3.1.
  • Some parts of our code base are identifiably under-tested or badly held together. They include:
    • Accounting isn't integration-tested.
    • Our Windows testing and QA is severely underpowered.
    • Our seccomp2 sandbox, as it relates to the file system, is held together with twine and chewing gum.
    • We should find a way to fuzz our configuration handling code more reasonably.
      • +1, note that we have considered configs out of scope for fuzzing in the past
    • We should find a way to run even the stem tests that use the tor network -- would pointing them at a chutney network work? In any case we should be running them as part of our dev process whenever we're touching controller code.
  • Noticing what changes in a patch is essential. I think we should make pull-all start saying what files changed again.

[dgoulet] +1 ... this is something I'm missing from the normal "git pull" that tells me what changed and where. It is useful to me to know that.

[teor] +1, I check the files when I merge, but the reviewer should also be checking what files *and functions* have changed

Best practices:

  • We should clearly distinguish "state" from "configuration" or "messages". We shouldn't mix state into configuration or parsed objects.

[dgoulet] +1.

  • When writing warning messages and BUG() messages, we should think about their potential to get loud if they are hit too often, and whether a rate-limit/IF_BUG_ONCE() is appropriate.

LESSONS LEARNED/QUESTIONS ASKED ABOUT THIS RETROSPECTIVE PROCESS:

  • Maybe we selected too many tickets to look at?
  • Did we select the wrong tickets?
  • Should we ignore unreleased bugs completely?
  • Did we ignore any bugs that were important.
    • teor says: we fixed some microdesc bugs in 0.4.0, 0.4.1, or 0.4.2. Some of them didn't get backported to 0.3.5, and they're causing some issues with chutney

#Methodological notes

places to look for bugs:
    * tickets marked 'regression'
       https://trac.torproject.org/projects/tor/query?keywords=~regression+042-must&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=component&order=priority
       https://trac.torproject.org/projects/tor/query?keywords=~regression&milestone=%5ETor%3A+0.4.2.x-final&col=id&col=summary&col=type&col=status&col=priority&col=milestone&col=component&order=priority
        -- DONE.
 
    * any bugs fixed after the .1-alpha release
       [see changelog]
       -- maybe we should skip the minor ones here?  if so, it's DONE

    * any bugs marked major
       [see changelog]
        -- DONE.

Let's focus on 0.4.2.

things to look for in bugs:
    * what kind of bug it was
    * how "disruptive" was it?  
    * How did we actually find it?
    * how it might be prevented in the future
    * if it was backported to 0.2.9 or 0.3.5
    * if that backport was avoidable
    * if that backport made it into a release
      ("no" for all 0.2.9 backports, there hasn't been a release for 15 months)
    * if we think any 0.2.9 backports should have triggered a release

# Bug notes


#31375         hs: Crash in token_bucket_ctr_refill() of the INTRO2 DoS defense
 (regression)

Type: Crash from uninitialized token bucket -- division by zero?
Impact: Low: Found on master before release. Only triggers with DoS defense. And v2 HS?
Found: By dgoulet while running a relay on master
Bug backported: no.
Bug introduced in: 9f738be8937d675929b43a149d706160641a089d
Could be prevented by:
    * making token_bucket detect and work around a zero rate.
    * unit testing of init function followd by refill function.


#31793         Bug: tor_addr_is_internal() called from src/feature/dirauth/process_descs.c:447 with a non-IP address of type 0
  (regression)
Type: Nonfatal bug warning.
Impact: Low: found on master before release. Authority only.
Found: By arma while running a relay on master
Bug backported: no
Bug introduced in: d9a7d4779887dbd2cba082c2a5daa535fe0d36ce
Could be prevented by:
    * Better unit testing for dir_allow_private_addresses()
    * Integration testing with some private-network tests disabled?
    * Making tor_addr_is_internal() detect null addresses?
    * Maybe mitigated by using IF_BUG_ONCE() by default, or making BUG rate-limit by default?

#32841         sandbox error on 0.4.2.x
  (regression)
Type: seccomp2 sandbox failuer
Impact: High. Seccomp2 failed in 0.4.2.x and in in 0.4.1.7, users complained.
Found: by weasel while running with the sandbox, by users running on relays.
Bug backported: yes
Bug introduced in: a22fbab98690f802ae3bda276078cc7fc767feba
Backport avoidable: I think so, yeah.
Could be be prevented by:
    * Running integration tests with sandbox 1, always (blocked by #32817, #32722)
    * some integration tests were running on old Ubuntu versions, the sandbox failed when we upgraded to newer versions, so we turned it off (see #32721 and related tickets)
    * Getting more users to test our releases.
    * Noticing when we start using any syscall that we haven't used before.
    * Writing a list of the syscalls that we don't yet use, and grepping the code for them as part of CI
    * Rewriting sandbox to use an improved architecture

#30146         Fix coverity failures as of 04-11-2019
  (regression)
Type: memory leaks and false positives
Impact: Very low. The only true bug was a memory leak when failing to make a keys directory.
Found: By nick when going on coverity rotation
Bug backported: possibly/various
Bug introduced in: various
Could be prevented by:
    * (To the extent that coverity found these issues, they _were_ prevented.)
    * opening tickets for new coverity warnings immediately.
    * checking for coverity backlog more often.


#31002         circpadding: Middle node did not accept our padding request
  (regression)
Type: Log spam
Impact: low: log messages only.
Found: by dgoulet when running a client on alpha
Bug backported: no 


NOTE: I don't know how to classify this one: did we ever actually fix the underlying issue? It looks like we deferred it.


#31352         Jenkins failure on windows: ENETUNREACH undeclared.
  (regression)
Type: compilation failure on some but not all windows builds
Impact: very low. Nobody noticed this but us.
Found: by looking at Jenkins
Bug backported: no.
Bug introduced in: c46e99c43c4ee03
Could be prevented by:
    * Maintaining jenkins better and looking at it more frequently
    * Trying more configurations in appveyor?
    * Trying cross-compilation for windows in travis, maybe?


#31354         Compiler "note" in test_addr.c: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’
  (regression)
Type: compilation notes (less verbose than warnings (!))
Impact: Low. Only an annoyance to developers, except to the extent that their verbosity made some CI sad.
Found: by using GCC 9.1
Bug backported: no
Could be prevented by:
    * (To the extent that jenkins found it, this bug _was_ prevented.)
    * Treat merely annoying verbosity as a problem?
    * Trying more compilers in CI
    * Keeping unit test functions smaller?
    * activating practracker on unit tests, perhaps with higer limits?
    * creating helpers for repeated unit tests, but with different test data

#31495         cannot configure bridges
  (regression)
Type: Logic error due to NULL/empty confusion.
Impact: Medium; the TB team ran into this one. It would have made bridges unusable if it had shipped.
Found: by mcs when using tor-browser nightly builds
Bug backported: no.
Bug introduced in: e16b90b88a76fb82702ae26c54834aca6c591d64
Could be prevented by:
    * More integration testing, as we introduced after this with test_parseconf.
    * Never designing a type where NULL and "" have different interpretations
    * When refactoring, somehow fuzzing before- and after- to find differences?

#31578         practracker scans build directories inside the tor/ directory
  (regression)
#31796         practracker git hook warning: Unusual pattern permitted.h in testdata
  (regression.)
Type: Practracker logic error
Impact: Low; only developers noticed or could notice
Found: by teor after a failed 'make distcheck'
Bug backported: no
Bug introduced in: 17dd3167494d69b
Could be prevented by:
    * More integration on our cod-testing tools, so we're not replicationg the same functionality so much?


#31734         Add accessor functions for cb_buf, which enforce locking and unlocking
  (regression. Mislabeled? This is bugfix on 0.2.5.3)
Type: Not using locks enough;
Impact: Low. would lead to a race if two threads had backtraces at once
Found: By teor, looking at the code.
Bug backported: no
Bug introduced in: a3ab31f5dc55b2 in 0.2.5.3-alpha
Could be prevented by: 
    * Always using this accessor pattern or something similar
    * preferring thread-local storage to shared buffers
    * Avoiding static buffers

#31958         connection_dir_is_anonymous: Non-fatal assertion !(CONST_TO_OR_CIRCUIT(circ)->p_chan == NULL) failed
  (regression)
Type: BUG() warning due to circuit briefly outliving its channel; use-after-mark error.
Impact: Low. Warning only.
Found: By dgoulet, running a test relay
Bug backported: no.
Bug introduced in: ef2123c7c7 in 0.4.2.1-alpha? Or did the bug exist previously?
Could be prevented by: 
    * more disciplined rules about when to check for marked conns/circuits?

#32060         CID 1454761: wrong type passed to unlock_cb_buf()?
  (regression)
Type: Type error in handling of backtrace buffers and locking.
Impact: Low. if this had lasted it would probably have given us some crashes on backtrace cleanup
Found: by coverity scan
Bug backported: no
Bug introduced in: c23986246b970b in 0.4.2.3-alpha
Could be prevented by:
    * Prefer structs to void * whenever possible, even for small arrays.
    * Avoid substantial machinery in bugfix patches after the first alpha
    * Technically, coverity _did_ prevent this from reaching production.

#32835         regression: log subsystem not closing files on switch back to syslog
  (regression) [NOT ACTUALLY CLOSED!!!]
Type: 
Impact:
Found:
Bug backported:
Bug introduced in:
Could be prevented by:

#32961         Backport the diagnostic logs for is_possible_guard crash
  (regression) [but not actually a regression]
Type: 
Impact:
Found:
Bug backported:
Bug introduced in:
Could be prevented by:

#31408         torrc : ClientOnionAuthDir after include directives breaks client to v2 services
  (regression, during 0.4.1 or earlier but not found till 042) [not actually a regrssion; bugfix on 0.3.1.1-alpha]
Type: Logic error.
Impact: Medium. Running %include on a directory ending with an empty file would result in some config options getting ignored.
Found: by a user
Bug backported: no.
Bug introduced in: 0.3.1.1-alpha, when we added %include directories
Could be prevented by:
    * Integration testing, possibly? We probably would have missed this case.
    * Use of SLIST rather than hand-rolled pointer manipulation? Not definite.
    * Fuzzing of config file handling?


#31810         Bug: ../src/lib/process/process_unix.c:265: process_unix_exec: Assertion line should be unreached failed; aborting.
  (regression, during 0.4.1 or earlier but not found till 042)
Type: assertion failure.
Impact: Medium. Useless message when execve fails in child process.
Found: by users.
Bug backported: no.
Bug introduced in: 2e957027e28449 in 0.4.0.1-alpha
Could be prevented by:
    * Testing more failing cases
    * Avoinding tor_assert()
    * Insisting on coverage more.

#31696         Assertion failure in map-anon.c:218
  (regression, during 0.4.1 or earlier but not found till 042)
Type: Assertion crash; Failure to try a fallback or check reason for failure.
Impact: medium: crash when built with one Linux kernel configuration but run on another.
Found: by a slackware user
Bug backported: no
Bug introduced in: 8ca808f81d28bb in 0.4.0.x, which wasn't used till 0.4.1.x
Could be prevented by:
    * Testing on unusual linux kernel configurations
    * Introducing code review assumption that detected-in-headers does not mean present-in-kernel.
    * ???

#31897         util/map_anon_nofork test fails on SunOS
  (regression, during 0.4.1 or earlier but not found till 042)
Type: unit test failure, due to "char" being unsigned char.
Impact: low. unit test failure on an infrequent platform
Found: by a user.
Bug backported: no
Bug introduced in: 0.4.0.x
Could be prevented by:
    * Unit testing on SunOS
    * Unit testing on a platform where char is unsigned.
    * Not using char for buffers.

#31772         MAPADDRESS control command
  (regression, during 0.4.1 or earlier but not found till 042)
Type: Incorrect parsing table for a command.
Impact: MAPADDRESS didn't work
Found: by a user.
Bug backported: no
Bug introduced in: 0c0b869ba450363e36 in 0.4.1.1-alpha, which was about extendcircuit and should never have touched this
Could be prevented by:
    * Noticing extraneous change to mapaddress_syntax in a patch that wasn't about that
      * Smaller pull requests?
    * Running stem integration tests for the mapaddress command.

#31898         Occasional crash during shutdown when using Tor API
  (regression, during 0.4.1 or earlier but not found till 042)
Type: Logic error wrt C null pointers.
Impact: Restart-in-process would fail when certain config changes were made.
Found: by sysrqb when working with jni
Bug backported: no
Bug introduced in: 47de9c7 in 0.4.1.1-alpha.
Could be prevented by:
    * Using a language without NULL
    * Writing unit tests based on explicit description of function, not expected arguments.
    * More testing of restart-in-process with configuration changes?
    * Fuzzing configs with restart-in-process?


#29819         sandbox crash on rt_sigaction with libseccomp 0.2.4
  (marked as major)
Type: Library compatibility, sandbox crash.
Impact: Medium. unable to run with sandbox using newer versions of libseccomp.
Found: by users.
Bug backported: n/a
Bug introduced in: This was caused by a change in libseccomp semantics. Previously rules were considerd to be ordered: now they aren't.
Could be prevented by:
    * Libseccomp should probably not have implemented ordered rules semantics unintentionally, and possibly should have documented its guarantees better.
    * A simpler sandbox design might have avoided this problem.
    * ???

#32778         Initialise pubsub in Windows NT service mode  
  (regression, NT service failure)
Type: Failure to initialize new component.
Impact: Medium. NT services wouldn't start.
Found: by users.
Bug backported: no
Bug introduced in: 0.4.1.1-alpha with the introduction of pubsub.
Could be prevented by:
    * Integration-testing NT services.
    * QA on NT services
    * Dropping support for NT services
    * Reducing number of entry paths to Tor so that tor_api.h is "the only way in" (but see #32883)

#32108         tor can overrun its accountingmax if it enters soft hibernation first  
(regression during 0.4.0.1-alpha, fixed in 0.4.2.3-alpha. marked as major)
Type: Bandwidth accounting. Logic error in flags applied to second_elapsed_callback
Impact: High. Relays didn't obey accountingmax under some circumstances.
Found: Not sure who noticed it; Roger figured it out.
Bug backported:
Bug introduced in: 4bf79fa4fa99fe6 in 0.4.0.1-alpha
Could be prevented by:
    * Eliminating second_elapsed_callback?
    * Integration tests for accounting
    * More internal checks about bandwidth accounting
    * Never refactoring working code (bad idea tho)

#31548         hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints intro points  
 (marked as major)
Type: Logic error; race condition.
Impact: Medium. HS was less reliable because it was encoding intro points in descriptor before they were attempted.
Found: by dgoulet testing another patch.
Bug backported: no
Bug introduced in: HSv3 code in 0.3.2.1-alpha
Could be prevented by:
    * Clear separation of "state" and "descriptor" for HS?
    * ???

Last modified 6 months ago Last modified on Apr 23, 2020, 1:23:25 PM