Opened 3 months ago

Closed 3 months ago

#30369 closed defect (fixed)

Fix regular expression in descriptor parser to correctly recognize bandwidth files

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We're using a regular expression on the first 100 characters of a descriptor to recognize bandwidth files. More specifically, if a descriptor starts with ten digits followed by a newline, we parse it as a bandwidth file. (This is ugly, but the legacy bandwidth file format doesn't give us much of a choice.)

This regular expression is broken. The regular expression we want is one that matches the first 100 characters of a descriptor, which ours didn't do.

Suggested fix:

diff --git a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index 119fe09..08ac909 100644
--- a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -132,7 +132,7 @@ public class DescriptorParserImpl implements DescriptorParser {
           sourceFile);
     } else if (fileName.contains(LogDescriptorImpl.MARKER)) {
       return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile, fileName);
-    } else if (firstLines.matches("^[0-9]{10}\\n")) {
+    } else if (firstLines.matches("(?s)[0-9]{10}\\n.*")) {
       /* Identifying bandwidth files by a 10-digit timestamp in the first line
        * breaks with files generated before 2002 or after 2286 and when the next
        * descriptor identifier starts with just a timestamp in the first line

Explanation:

  • We don't need to start the pattern with ^, because the regular expression needs to match the whole string anyway.
  • The (?s) part enables the dotall mode: "In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"
  • We need to end the pattern with .* to match any characters following the first newline, which also includes newlines due to the previously enabled dotall mode.

I'll create a branch for this in a minute.

Child Tickets

Change History (3)

comment:1 Changed 3 months ago by karsten

Status: assignedneeds_review

comment:2 Changed 3 months ago by irl

Status: needs_reviewmerge_ready

This looks good to me.

comment:3 Changed 3 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Thanks for looking! Merged. Closing.

Note: See TracTickets for help on using tickets.