Fix regular expression in descriptor parser to correctly recognize bandwidth files
We're using a regular expression on the first 100 characters of a descriptor to recognize bandwidth files. More specifically, if a descriptor starts with ten digits followed by a newline, we parse it as a bandwidth file. (This is ugly, but the legacy bandwidth file format doesn't give us much of a choice.)
This regular expression is broken. The regular expression we want is one that matches the first 100 characters of a descriptor, which ours didn't do.
Suggested fix:
diff --git a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index 119fe09..08ac909 100644
--- a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -132,7 +132,7 @@ public class DescriptorParserImpl implements DescriptorParser {
sourceFile);
} else if (fileName.contains(LogDescriptorImpl.MARKER)) {
return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile, fileName);
- } else if (firstLines.matches("^[None..None](../compare/None...None){10}\\n")) {
+ } else if (firstLines.matches("(?s)[None..None](../compare/None...None){10}\\n.*")) {
/* Identifying bandwidth files by a 10-digit timestamp in the first line
* breaks with files generated before 2002 or after 2286 and when the next
* descriptor identifier starts with just a timestamp in the first line
Explanation:
- We don't need to start the pattern with
^
, because the regular expression needs to match the whole string anyway. - The
(?s)
part enables the dotall mode: "In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)" - We need to end the pattern with
.*
to match any characters following the first newline, which also includes newlines due to the previously enabled dotall mode.
I'll create a branch for this in a minute.