Ticket #18797: BridgeNetworkStatus.g4

File BridgeNetworkStatus.g4, 9.6 KB (added by karsten, 2 years ago)

An ANTLR 4 grammar for Tor bridge network statuses

Line 
1/*************************************************************************
2 *          An ANTLR 4 grammar for Tor bridge network statuses
3 *************************************************************************
4 *
5 * Copyright 2015, The Tor Project
6 *
7 * Redistribution and use in source and binary forms, with or without
8 * modification, are permitted provided that the following conditions are
9 * met:
10 *
11 * * Redistributions of source code must retain the above copyright
12 *   notice, this list of conditions and the following disclaimer.
13 *
14 * * Redistributions in binary form must reproduce the above
15 *   copyright notice, this list of conditions and the following disclaimer
16 *   in the documentation and/or other materials provided with the
17 *   distribution.
18 *
19 * * Neither the names of the copyright owners nor the names of its
20 *   contributors may be used to endorse or promote products derived from
21 *   this software without specific prior written permission.
22 *
23 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
24 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
25 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
26 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
27 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
28 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
29 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
30 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
31 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
32 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
33 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34 *
35 *************************************************************************
36 *
37 * There are multiple goals of having a grammar for Tor descriptors
38 * available on CollecTor:
39 *
40 * 1. Translate descriptors to JSON for statistical analysis: Some tools
41 *    and databases require Tor descriptors in a standard format like
42 *    JSON.  This grammar and a parser generated from it can help making
43 *    that translation as easy as possible, also to keep future
44 *    maintenance as low as possible.
45 *
46 * 2. Provide a basis for descriptor-parsing libraries: As of late 2015,
47 *    there are three libraries for parsing Tor descriptors: metrics-lib
48 *    for Java, Stem for Python, and Zoossh for Go.  It would be
49 *    beneficial to place as much knowledge about the descriptor format
50 *    into a grammar shared by all those libraries and then generate
51 *    parsers for different languages from that grammar.
52 *
53 * 3. Serve as documentation for the Tor directory protocol specification:
54 *    Tor descriptors are already documented using a hand-written grammar,
55 *    but that may contain slight inaccuracies because it's not verified.
56 *    This grammar could fix that by either detecting inaccuracies while
57 *    trying to rewrite it to an executable grammar form or by replacing
58 *    the grammar in the specification documentation with this executable
59 *    grammar.
60 *
61 * 4. Use as basis for parsing and encoding descriptors in tor: Using some
62 *    kind of machine-readable grammar/schema for all our data formats and
63 *    having actual parsing/encoding code generated from it can reduce
64 *    crash/assertion bugs in the tor daemon.
65 *
66 * Here's how you run it:
67 *
68 * - Download this file to a local working directory.
69 *
70 * - Make sure you have Java 1.6 or higher installed, including the
71 *   compiler:
72 *   java -version
73 *   javac -version
74 *
75 * - Get the latest ANTLR 4 JAR file:
76 *   wget http://www.antlr.org/download/antlr-4.5.1-complete.jar
77 *
78 * - Run ANTLR 4 to generate Java source files from the grammar file and
79 *   compile those files:
80 *   java -jar antlr-4.5.1-complete.jar BridgeNetworkStatus.g4
81 *   javac -cp antlr-4.5.1-complete.jar BridgeNetworkStatus*.java
82 *
83 * - Get an example bridge network status from
84 *   https://collector.torproject.org/recent/bridge-descriptors/statuses/
85 *   and store it to a file called `bridge-network-status`.
86 *
87 * - Possibly truncate that file after the first couple router entries
88 *   (lines starting with "r") to make the graphical parse tree more
89 *   useful.
90 *
91 * - Open a Java application that display the parse tree:
92 *   java -cp .:antlr-4.5.1-complete.jar org.antlr.v4.gui.TestRig \
93 *           BridgeNetworkStatus document -gui bridge-network-status
94 *
95 * Open issues and questions:
96 *
97 * - Was it smart to explicitly include all those SP tokens in the rules,
98 *   or should those be discarded right away by the lexer?  The main
99 *   reason for keeping them was to stay as close to the specification as
100 *   possible, but maybe that has downsides on the other goals.
101 *
102 * - If a bridge uses a nickname (or other token that's supposed to be a
103 *   STRING) that is also a keyword like "r" or "published", things get
104 *   confusing.  Try editing the input bridge network status and observe
105 *   the result.  But those are perfectly valid nicknames, so what can we
106 *   do?  Maybe we can change the lexing rules so that keywords are only
107 *   recognized at position 0 on the line, outside of a base64 block, but
108 *   how do we convince ANTLR 4 to generate us a lexer like that?
109 *
110 * - It would be really nice to use regular expressions in the grammar to
111 *   match input more thoroughly than just ~[ \n]+, if only we can fix the
112 *   lexer troubles.  It's a pity that all that verification work would
113 *   need to be duplicated in each of the language-dependent parsers.
114 *   That kinda defeats the purpose.
115 *
116 * - Is it easy to walk the parse tree and output a JSON format *without*
117 *   having to write code for each of the rules?  Ideally, the translator
118 *   would be 20 lines of code and not grow at all if we add 10 more
119 *   descriptor types.  Do we need to change the grammar for that?
120 *
121 * - The following may turn out to be a non-issue, but some descriptors
122 *   require lines to be ordered, e.g., "accept" and "reject" lines in
123 *   server descriptors, and we'll have to retain that order in the parse
124 *   tree.  This should be similar to how we parse entries, starting with
125 *   "r" lines, but who knows.
126 *
127 * - Consider moving the list of accepted flags to the parser and just
128 *   matching any string as flag.
129 *
130 * - It's unclear whether it's even possible to use a context-free grammar
131 *   to match the Tor directory specification.  For example, the order of
132 *   "s", "w", "p", and "a" lines following an "r" line is not specified,
133 *   but some of them must not occur more than once.  Are there good
134 *   alternatives?
135 */
136grammar BridgeNetworkStatus;
137
138/* Starting with the lexer rules, because they seem to be the most
139 * limiting factor in using this grammar for Tor descriptors.
140 *
141 * In particular, we're using a single token type for content, STRING,
142 * rather than matching permitted input for nickname ([A-Za-z0-9]{1,19}),
143 * base64-encoded 160-bit strings ([A-Za-z0-9+/]{26}), ports ([0-9]{1,5}),
144 * etc.  This is because we don't want the lexer to recognize a string as
145 * one type (e.g., "123" as port) when it's really supposed to be another
146 * type (e.g., nickname).
147 *
148 * Note that we might totally be doing things wrong here. */
149SP: ' ';
150NL: '\n';
151STRING: ~[ \n]+;
152
153/* Each document, regardless of bridge network status or other, starts
154 * with annotation lines whose keywords start with @, a header with zero
155 * or more lines, and a body with zero or more router entries.  Most
156 * documents also contain a footer, but bridge network statuses don't, so
157 * that's left out here. */
158document: annotations header body;
159
160/* The only annotation line that's supported right now is @type. */
161annotations: (type)?;
162type: '@type' SP STRING SP STRING NL;
163
164/* There are two possible lines in the header, published and
165 * flag-thresholds, and empty lines are permitted, too. */
166header: (published | flagThresholds | NL)*;
167flagThresholds: 'flag-thresholds' (SP param)* NL;
168param: STRING;
169published: 'published' SP date SP time NL;
170date: STRING;
171time: STRING;
172
173/* The body contains zero or more entries, each of which starting with an
174 * "r" line and followed by "s", "w", "p", etc. lines and possibly empty
175 * lines mixed into them. */
176body: (entry | NL)*;
177entry: r (s | w | p | a | NL)*;
178
179/* The "r" line contains a lot of space-separated positional arguments.
180 * All those are matched as STRING, because the lexer (which doesn't care
181 * about this parsing rule) cannot distinguish reliably between the
182 * various allowed input strings.  For example, if the lexer sees "1234",
183 * it can't say whether that's a nickname or a port.  (It's quite possible
184 * that we're doing something wrong here and it's all our fault.) */
185r: 'r' SP nickname SP identity SP digest SP publicationDate SP
186       publicationTime SP ip SP orPort SP dirPort NL;
187nickname: STRING;
188identity: STRING;
189digest: STRING;
190publicationDate: STRING;
191publicationTime: STRING;
192ip: STRING;
193orPort: STRING;
194dirPort: STRING;
195
196/* The "s" line contains a space-separated list of previously known flag
197 * names. */
198s: 's' (SP flag )* NL;
199flag: 'Exit' | 'Fast' | 'Guard' | 'HSDir' | 'Running' | 'Stable' |
200      'Valid' | 'V2Dir';
201
202/* The "w" line contains zero or more bandwidth parameters, which can
203 * start with "Bandwidth=" or other keys, which will be left to the parser
204 * to recognize. */
205w: 'w' (SP bandwidth)* NL;
206bandwidth: STRING;
207
208/* The "p" line contains a policy ("accept" or "reject") and a
209 * comma-separated list of ports, both of which will be handled by the
210 * parser. */
211p: 'p' SP policy SP portList NL;
212policy: STRING;
213portList: STRING (',' STRING)*;
214
215/* The "a" line contains an additional IP address and port, which will be
216 * left to the parser to recognize. */
217a: 'a' SP STRING NL;
218