wiki:doc/CollecTor/Improvements

https://collector.torproject.org/images/collector-logo.png https://collector.torproject.org/images/collector-wordmark.png

CollecTor Development

This is a living and changing document to accompany the current project for improving CollecTor.

Areas of Work

During the course of this project the following sections will more and more turn into descriptions and documentation. Currently, they are a mixture of very defined improvements as well as sketches and wishes and questions.

Analyze Descriptor Completeness

The analysis will be based on log-files and the downloaded files and address the following questions:

How many descriptors are missing?

How could this loss be avoided?

  • actively monitor resources like available storage space (discussion in ticket #18865).
  • verify and improve runtime statistics in order to have a clearer picture (discussion in ticket #19169).
  • Extra-info descriptors dropped b/c of parsing problems are counted as missing. This should be avoided. ticket #19170.

Next Steps

Continue analysis when sync-process is deployed.

Provide Guide Documents

These guides should be based on the previous work in Onionoo and metrics-lib. In detail

  • Contributor's Guide: create as detailed in #18733 and place the new guide in a central location, which still needs to be identified; this could be a large document in the central place and a small document in CollecTor referencing the main document. (detailed discussion in #18730)
  • Release Process (definded in #18732)
  • Installation Guide for Operators (adapt the existing document), ticket #18734

Implement the Release Process

(according to the guide above)

Design Changes

This section describes improvements that ought to make CollecTor more maintainable, testable, and more efficient.

  1. Run collector with an internal scheduler instead of using external scheduling (e.g. crontab), #19018
  2. Add shutdown hook to provide a controlled way of stopping. Discussion #19016.
  3. Some parts of CollecTor's data processing are provided by bash scripts run via crontab. These should be integrated into the java application.

Improve CollecTor Operation and Setup

Once there is the executable jar including the shutdown hook implementation CollecTor should be started as a linux service, i.e., an appropriate shell script needs to be provided.

Further Sketches of Areas for Improvements

  • store unparsable descriptors rather than discarding them
    • add local storage for descriptors that cannot be parsed for review by the service operator and later reprocessing
  • synchronization between CollecTor instances see #18910 and DescriptorDistribution
  • improve the process of creating tarballs
    • reduce memory consumption throughout
  • consider using an embedded http server in order to reduce operating complexity

Releases

Release 1.1.0

Release date: tbd

Ticket Summary Status
#18910 distributing descriptors accross CollecTor instances closed
#19822 set up a CollecTor mirror for synchronization with the main CollecTor closed
#19831 Change default for compressing descriptors to true closed
#20162 reduce configuration parameters in collector.properties closed
#20179 Require absolute path for `$TARBALLTARGETDIR` in `src/main/resources/create-tarballs.sh` closed
#20380 Expand INSTALL.md to a more complete operator's guide closed
#20408 Move index.json* to index/ subdirectory closed

Release 1.2.0

Release date: tbd

Ticket Summary Status
#8799 collector's downloads: avoid httpurl-connection closed
#19755 improve code quality of bridgedescs module closed
#19778 Bridge descriptor sanitizer runs out of memory after 13.5 days closed
#19934 CollecTor should use new metrics-lib json classes closed
#20514 CollecTor' torperf module: replace HttpURLConnection closed
#20515 CollecTor's relaydescs module should avoid httpurlconnection closed
#20516 CollecTor's exitlists module should avoid httpurlconnection closed
#21443 CollecTor does not delete exit lists after three days anymore closed
#22216 Decide whether to sanitize padding-counts lines closed
#22247 Remove deprecation warnings as soon as metrics-lib 1.7.0 is released closed
#22652 Adapt CollecTor to metrics-lib 1.9.0 closed
#22754 Reference checker should only read relay descriptors closed
#22833 Either include or retain "fingerprint" line in bridge network statuses with @type bridge-network-status 1.2 closed

Release 2.0.0

Release date: tbd

Ticket Summary Status
#20350 Replace create-tarball.sh shell script with Java module assigned

Past Releases

Release 1.0.2, October 7, 2016

Ticket Summary Severity
#19016 add shutdown hook Normal
#19317 Sanitize TCP ports in bridge descriptors Normal
#19894 print message when no module is activated Minor
#19895 make CollecTor stop after RunOnce Normal
#19924 collector.properties: base url should not be in quotes Minor
#20079 Change log thresholds from TRACE to INFO Normal

bugfix Release 1.0.1, August 22, 2016

Prevent out-of-memory error, cf. #19913.

First Release 1.0.0, August 11, 2016

Ticket Summary Severity
#18707 use java 7 Minor
#18719 provide executable jar Normal
#18727 refactor ernie before very first metrics-db release Normal
#18734 Installation Guide for Operators Normal
#18792 tweak build.xml for new tasks and java 7 Normal
#18793 add checkstyle task Normal
#18794 add cobertura task Normal
#18818 Stop using deprecated parts of metrics-lib. Normal
#18865 actively monitor resources like available storage space Normal
#18922 configure logging via properties file Normal
#18931 coding style polishing Normal
#18955 javadoc coverage checkstyle warnings Normal
#19005 make all data directories configurable Normal
#19015 use logging framework other than java.util.logging Normal
#19018 run CollecTor modules without crontab Normal
#19021 improve configuration process Normal
#19170 make parsing more robust (extra-info) Normal
#19373 write test that checks the default collector.properties Normal
#19424 remove hard coded paths and set default properties to values used on the main CollecTor instance Minor
#19615 CollecTor should confirm to style guide Normal
#19641 investigate and fix MainTest Normal
#19651 add missing scripts to collector.git Normal
#19720 CollecTor should be re-configurable without restart Normal
#19727 correct exitlist Normal
#19771 investigate halt of scheduling for one of many tasks in collector's scheduler Normal
#19776 Make minor improvements to scheduler Normal
#19813 define release process and do release of milestone 1.0.0 Normal
#19829 Update directory authority addresses to recent tor.git Normal
#19830 Check if recent directory exists before checking available space Normal
#19840 Change path defaults to match those of main CollecTor instance Normal

All Tasks in Trac

Active Tasks

Results (1 - 10 of 37)

1 2 3 4
Ticket Summary Status Priority Severity Reporter Modified
#26162 Replace Gson with Jackson in CollecTor needs_revision High Normal karsten 2 hours ago
#20350 Replace create-tarball.sh shell script with Java module assigned Medium Normal iwakeh 4 days ago
#20351 Turn the updateindex module into a function that runs after each module run assigned Medium Normal iwakeh 4 days ago
#26089 collect and archive DNS resolver data of tor exits new Medium Normal cypherpunks 9 days ago
#20224 Fix `BridgeDescriptorMappingsLimit` config option needs_revision Low Normal karsten 5 weeks ago
#20546 Implement CleanUtils class for common file system operations assigned Medium Normal iwakeh 7 weeks ago
#23421 Use persistence functionality throughout all modules accepted Medium Normal iwakeh 7 weeks ago
#25644 Write white paper about CollecTor's data processing (Sponsor13, 1) accepted Medium Normal iwakeh 8 weeks ago
#25307 Use `persist` package for writing bridge descriptors in sanitization module accepted Medium Normal iwakeh 8 weeks ago
#20421 Investigate invalid descriptors in out/ and recent/ subdirectories assigned Medium Normal karsten 3 months ago
1 2 3 4

Completed Tasks

Results (1 - 10 of 198)

1 2 3 4 5 6 7 8 9 10 11
Ticket Summary Priority Severity Reporter Modified
#25827 Adapt CollecTor to changes in metrics-lib 2.3.0 Medium Normal iwakeh 5 weeks ago
#25624 Index 'contrib' directory Medium Normal iwakeh 2 months ago
#25547 Put out CollecTor 1.5.1 High Normal karsten 2 months ago
#25525 Fix either spec or code regarding full path of sanitized webstats files High Normal karsten 2 months ago
#25522 Avoid missing identical lines during sanitization of logs High Normal iwakeh 2 months ago
#25317 Enable webstats to process large (> 2G) logfiles High Normal iwakeh 3 months ago
#25161 Fix any memory problem caused by number of log files to be imported by the webstats module Medium Normal karsten 3 months ago
#20549 Make bridge descriptor sanitization more maintainable and more modular High Normal iwakeh 3 months ago
#25100 Make CollecTor's webstats module use less RAM and wall time High Normal karsten 4 months ago
#24983 Inaccessible semi-recent consensus files Medium Normal robgjansen 4 months ago
1 2 3 4 5 6 7 8 9 10 11

Last modified 20 months ago Last modified on Oct 8, 2016, 10:35:23 AM