Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#5733 closed defect (fixed)

default.rulesets is obfuscated by not having line terminators

Reported by: cypherpunks Owned by: git@…
Priority: Low Milestone:
Component: HTTPS Everywhere/EFF-HTTPS Everywhere Version:
Severity: Keywords: lf cr crlf linebreak newline
Cc: sunspyre@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

HTTPS Everywhere 2012.5.1_1, running in Chromium on linux. I wanted to alter a rule without making a wholly new one.

default.rulesets is impossible to parse with human eyes. The lines have no line terminators so it's all in just one incredibly long string; *literally* a wall of text.

As an exception a small number of lines end in the long-deprecated CR terminator (carriage return), as for instance Macs used pre-OSX before changing to the Unix standard LF (linefeed). DOS/Windows use CRLF, which is why files saved under Unix-like OSs don't work well with notepad. Please see the Wikipedia page on newlines for more information.

https://en.wikipedia.org/wiki/Newline


File description;

2012.5.1_1/rules$ file default.rulesets
default.rulesets: UTF-8 Unicode text, with very long lines, with no line terminators

Count of lines and characters;

2012.5.1_1/rules$ wc -lc < default.rulesets
      1 1107900    # ergo 1 line of more than a million charactes

Count of occurences of CR and LF respectively;

2012.5.1_1/rules$ od -a < default.rulesets | grep -o cr | wc -l
145                # 145 CR occurences *total*
2012.5.1_1/rules$ od -a < default.rulesets | grep -o lf | wc -l
0                  # 0 LF



Regardless of which scheme to follow (Unix LF, DOS CRLF, old-old CR) there should naturally be one linebreak per line.


(Incidentally I used BR ? to break lines in this post, so if it ends up looking really weird let's all chuckle at the irony!)

Child Tickets

Attachments (1)

merge-rulesets.sh (4.8 KB) - added by cypherpunks 7 years ago.
minor improvements but really just cosmetics at this point

Download all attachments as: .zip

Change History (9)

comment:1 Changed 8 years ago by pde

Component: HTTPS Everywhere: ChromeEFF-HTTPS Everywhere

comment:2 Changed 8 years ago by pde

This is a hilarious bug report :).

The reason there are some CRs in there is because ruleset authors committed rules containing them, but the script that strips comments and whitespace from default.rulesets didn't anticipate that.

One solution to your problem is to edit rulesets by a different method:

git clone git://gitweb.torproject.org/httpseverywhere.git

<edit the ruleset in src/chrome/content/rules>

./makecrx.sh today

<uninstall the upstream .crx file; install the one you just built>

The other solution would be to send a patch against the ruleset library compression script to make it preserve a small number of judiciously selected whitespaces.  Maybe one per ruleset?  It's can be found here:

https://gitweb.torproject.org/https-everywhere.git/blob/b50b89f917497996409f87f2469237c9016237e0:/merge-rulesets.sh

comment:3 Changed 8 years ago by pde

Priority: normalminor

s/it's/it/

comment:4 Changed 7 years ago by cypherpunks

Something like that?

It flattens the entire thing into some unholy cosmic superstring, then adds appropriate linebreaks and some basic indentation on a per-tag setting.

Builds for me, at least.

Example: http://paste.ubuntu.com/1316689 (unsure of how long Ubuntu pastes stay alive)

comment:5 Changed 7 years ago by pde

This looks great. We should give it a review for possible bugs and the commit to master.

comment:6 Changed 7 years ago by pde

Owner: changed from pde to git@…
Status: newassigned

FauxFaux, any chance you could review this fancier upgrade of your earlier script?

Changed 7 years ago by cypherpunks

Attachment: merge-rulesets.sh added

minor improvements but really just cosmetics at this point

comment:7 Changed 7 years ago by pde

Resolution: fixed
Status: assignedclosed

Okay, I have merged this in git master. I also had an ugly hack translation of the old script into Python lying around, which is a lot faster and therefore useful for rapid-iteration debugging of HTTPS Everywhere things. I tweaked that to do one line per ruleset, but the shell version is much more elegant ATM.

comment:8 Changed 7 years ago by pde

Just a followup note: the pretty merge-rulesets.sh script is actually buggy, so we can't use it, but our current merge-rulesets.py script still addresses the underlying problem (not as prettily, mind) and is 5x faster.

Note: See TracTickets for help on using tickets.