Update the bridgedescs module
I have a long list of pending changes to the bridgedescs module, and I'd like to discuss how to line them up and apply them with as few disruptions as possible. Here's the list:
-
The reprocessing of archived bridge descriptor tarballs to sanitize TCP ports (#19317 (moved)) is moving forward. All tarballs until 2016-05 are reprocessed and I compared a sample of about 5% of newly sanitized descriptors to previously sanitized descriptors to ensure that results are correct. I'm currently
tar
'ing them up andxz
-compressing them, which will take another week or so. When this is done, I'll have to reprocess 2016-06 to 2016-09, which would take another week. And I'll have to deploy this new code on the main CollecTor instance, ideally as a second instance running on the same host running in parallel until all archives are reprocessed. -
We should take this opportunity of reprocessing bridge descriptors to also repackage them into one tarball per month and descriptor type. For example,
bridge-descriptors-2016-09.tar.xz
would be split up intobridge-statuses-2016-09.tar.xz
,bridge-server-descriptors-2016-09.tar.xz
, andbridge-extra-infos-2016-09.tar.xz
. This may require some changes to paths, as well as changes to thecreate-tarballs
script. Blocking on reprocessed archives. -
I still have a branch with unfinished unit tests, some of which uncovering unfixed minor bugs that won't be triggered as long as input from the bridge authority is trusted (like #20044 (moved)). Should happen before making any non-hotfix changes to the code.
-
The module seriously needs to be refactored into a more reasonable class structure and smaller, more testable methods (like #19755 (moved), but also #19621 (moved)). Not urgent, but should happen before we need to make the next non-hotfix change.
-
At some point we should rethink how we handle issues while sanitizing bridge descriptors (#19834 (moved)). Not urgent.
-
I made a few tweaks to the bridgedescs module to make it possible to reprocess large batches of tarballs without running out of memory (#19778 (moved)) or unnecessarily wasting processing time. These changes would be harmful for regular operation, so I wonder if we should add a "batch processing" configuration option to enable them. Not urgent.
Can we make a plan when to make/apply these changes, either on Trac or this weekend in Berlin?