Opened 6 years ago

Closed 5 years ago

#11630 closed defect (fixed)

Creating HTTPS-Everywhere's rules.sqlite is non-deterministic

Reported by: gk Owned by: erinn
Priority: Medium Milestone:
Component: Applications/Tor bundles/installation Version:
Severity: Keywords:
Cc: pde, zyan, jsha@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

While looking at the coming bundles it turned out that creating the rules.sqlite is non-deterministic. Looking closer it seems that sqlite itself is the culprit here.

Child Tickets

Attachments (3)

rulesdiff.bz2 (2.2 MB) - added by gk 6 years ago.
diff of two rules.sqlite files
sqlitehashes (1.6 KB) - added by gk 6 years ago.
Hashes of rules.sqlite files
rulesdiff_vacuum.bz2 (2.1 MB) - added by gk 5 years ago.
rules.sqlite diff of a local and a remote rules.sqlite built with VACUUMing

Change History (34)

comment:1 Changed 6 years ago by gk

Doing

sqlite3 rulesets.sqlite .dump > rules.txt

and

sqlite3 rulesets2.sqlite .dump > rules2.txt

and looking at the diff of both files shows nothing. So, there is probably no issue with rules being in the wrong order. Getting diffs of the sqlite files with

diff -u <(xxd rulesets.sqlite) <(xxd rulesets2.sqlite) > rulesdiff

shows a ton of differences though.

Changed 6 years ago by gk

Attachment: rulesdiff.bz2 added

diff of two rules.sqlite files

comment:2 Changed 6 years ago by zyan

Cc: jsha@… added

CCing Jacob in case he has time to dig into this.

comment:3 Changed 6 years ago by zyan

I managed to get rid of most of the non-determinism in rulesets.sqlite by executing VACUUM before closing the connection, yay. There's still a few bytes different near the start of the file: https://gist.github.com/diracdeltas/5af9bb793ce164b20c78

Anyone who wants to take a stab at this should look at https://www.sqlite.org/fileformat2.html for nonces/timestamps.

comment:4 Changed 6 years ago by zyan

Ok, I ended up looking at this myself. The bytes that differ in the diff above are at:

offset 27 - part of "File Change Counter"
offset 43 - part of "Schema cookie"
offset 95 - part of "version valid-for number" (which should usually be the same as the value of "File Change Counter" according to the docs)

I'm guessing (based on the name) that the Schema cookie is initialized with a random value, so it ends up being non-deterministic. It's unclear to me whether the File Change Counter values are different because the sqlite writes are non-deterministic or because they're also being initialized to non-deterministic values. Either way, it seems that it would be fine to hard-code these values in the generated database in makexpi.sh.

comment:5 in reply to:  4 Changed 6 years ago by gk

Replying to zyan:

Ok, I ended up looking at this myself. The bytes that differ in the diff above are at:

Does this still happen if you make sure that the rules.sqlite file is deleted in src/defaults *each time* after you invoked makexpi.sh and created a new .xpi? Doing that gives me exactly the same rules.sqlite file each time locally.

comment:6 Changed 6 years ago by gk

Resolution: fixed
Status: newclosed

This seems to be fixed with the execution of VACUUM before closing the connection. I generated Linux bundles several times with our Gitian setup and we get the same sha256sums each time. We are using the old HTTPS-E creation procedure with commit 9d18e71a070bdfd4c3927a7546bc18e201fa3a00 (maint-3.6) and commit d7dc1e0781aa2735e6ed98d68fed9f54f1c2a374 master.

comment:7 Changed 6 years ago by zyan

Ah, that solved it for me too. Thanks!

comment:8 Changed 6 years ago by gk

Resolution: fixed
Status: closedreopened

Doing something like

sed 's/conn.commit()/&\nconn.execute("VACUUM")/' -i utils/make-sqlite.py

in our Gitian setup does not seem to help. So, we need to look at it again.

comment:9 Changed 6 years ago by zyan

hmm, note that the latest development release already includes VACUUMing, but that shouldn't be causing problems.

comment:10 in reply to:  9 Changed 6 years ago by gk

Replying to zyan:

hmm, note that the latest development release already includes VACUUMing, but that shouldn't be causing problems.

Yes, we are using 3.5.1. I made some tests and the results are attached. "+ patch" means 3.5.1 + VACUUMing. The first of both hashes in a section is from the 32bit build the latter from the 64bit one. VACUUMing does not do anything to solve the problem. (I missed that when testing as I only compared the 32bit output of build1 against the 32bit output of build2 done locally which is identical) To make matters even more interesting I included a hash of a rules.sqlite built in a 32bit Gitian environment on a different machine. It is different compared to the hashes of my local build machine.

Changed 6 years ago by gk

Attachment: sqlitehashes added

Hashes of rules.sqlite files

comment:11 Changed 6 years ago by jsha

gk, you say that deleting the rulesets.sqlite file before rebuilding results in a reproducible build. I get the same results locally. Is there any reason not to make that the default behavior? Reusing the existing DB doesn't save any build time, since each build rewrites all of the tables.

comment:12 Changed 6 years ago by zyan

I would expect builds (after deleting rulesets.sqlite) with VACUUM to be the same across different 32 or 64 bit machines running Linux. Here's the shasum I get when running ./makexpi.sh on 3.5.1 after cherry-picking commit 37082ccf6274e46d83ad247e0d2c9236f7f338ef (to VACUUM the db) on a 64-bit debian machine:

8585325d850ed9f10de6a52f70dd2269df7ebd43  pkg/https-everywhere-3.5.1~pre.xpi

I did it a few times and got the same result, deleting ./pkg and ./src/defaults/rulesets.sqlite in between.

comment:13 in reply to:  12 Changed 6 years ago by gk

Replying to zyan:

I did it a few times and got the same result, deleting ./pkg and ./src/defaults/rulesets.sqlite in between.

Yeah, it seemed stable on the respective machine/VM. FWIW: I just did the same on my Ubuntu Precise build machines and I got 4ce2ab333ac08970a897fbb0c7b602bd07e8c91f.

comment:14 in reply to:  11 ; Changed 6 years ago by gk

Replying to jsha:

gk, you say that deleting the rulesets.sqlite file before rebuilding results in a reproducible build.

Yes, it is reprocducible on the same machine but gives different results on different machines.

I get the same results locally. Is there any reason not to make that the default behavior? Reusing the existing DB doesn't save any build time, since each build rewrites all of the tables.

Yes, we do that. There is no reuse happening, just the https-e repo and one ./makexpi.sh.

comment:15 in reply to:  14 Changed 6 years ago by gk

Replying to gk:

Replying to jsha:

gk, you say that deleting the rulesets.sqlite file before rebuilding results in a reproducible build.

Yes, it is reprocducible on the same machine but gives different results on different machines.

And I might add: this holds as well if you omit VACUUMing.

comment:16 Changed 6 years ago by jsha

Ah, now I get it. Could you post another hex diff, this time between the builds from different machines?

comment:17 in reply to:  16 Changed 5 years ago by gk

Replying to jsha:

Ah, now I get it. Could you post another hex diff, this time between the builds from different machines?

Sure, it is attached. What I did was using the 3.5.1 tag + the VACUUM patch running on a local Ubuntu 12.04 and in a remote Ubuntu 12.04 chroot. I made sure that no /pkg and no old rules.sqlite were around.

Changed 5 years ago by gk

Attachment: rulesdiff_vacuum.bz2 added

rules.sqlite diff of a local and a remote rules.sqlite built with VACUUMing

comment:18 Changed 5 years ago by erinn

Keywords: needs-triage added

comment:19 Changed 5 years ago by gk

Keywords: needs-triage removed

comment:20 Changed 5 years ago by jsha

Sorry it took me so long to look at this. The first diff that jumps out at me is offset 28-31 decimal (0x16-0x19), whose value is 1424 in the second file and 1428 (dec) in the first file. Per the docs at https://www.sqlite.org/fileformat.html, that's the in-header database size in pages. You probably don't still have the two file around, but if you do: were they different sizes?

The other thing I noticed in this diff is that there is a large chunk of the rulesets that show up as a diff, where they are in a different order. Looking at the code, that makes sense. At the time of filing this bug, we used os.walk, which does not have an order guarantee (http://stackoverflow.com/a/5667552). Now we use glob.glob, which also does not have an order guarantee (http://stackoverflow.com/a/6774404). I will write a change to sort the files before loading them, which should hopefully make at least that part deterministic. I suspect it may have mysterious add-on effects that could fix the other diffs. Let's try it and see.

comment:21 in reply to:  20 Changed 5 years ago by gk

Replying to jsha:

Sorry it took me so long to look at this. The first diff that jumps out at me is offset 28-31 decimal (0x16-0x19), whose value is 1424 in the second file and 1428 (dec) in the first file. Per the docs at https://www.sqlite.org/fileformat.html, that's the in-header database size in pages. You probably don't still have the two file around, but if you do: were they different sizes?

I only still have the local one it seems which gives me:

-rwxr-xr-x  1 thomas thomas 1462272 Jun 11 10:26 rules_local.sqlite

You can easily compare that one to your own rules.sqlite if you check out the 3.5.1 tag, add

conn.execute("VACUUM")

before the conn.close(), make sure you don't have an old rules.sqlite file in your src/defaults directory and build https-e.

The other thing I noticed in this diff is that there is a large chunk of the rulesets that show up as a diff, where they are in a different order. Looking at the code, that makes sense. At the time of filing this bug, we used os.walk, which does not have an order guarantee (http://stackoverflow.com/a/5667552). Now we use glob.glob, which also does not have an order guarantee (http://stackoverflow.com/a/6774404). I will write a change to sort the files before loading them, which should hopefully make at least that part deterministic. I suspect it may have mysterious add-on effects that could fix the other diffs. Let's try it and see.

Let me know when you are done and I'll give it a try.

comment:22 Changed 5 years ago by jsha

I've got a branch at https://github.com/EFForg/https-everywhere/pull/534/commits with the sorting change.

When I build locally on the 3.5.1 tag with the VACUUM change you specified and making sure to remove any existing rulesets.sqlite, I do get a different size than you do:

$ ls -l pkg/https-everywhere-3.5.1~pre.xpi ; sha256sum pkg/https-everywhere-3.5.1~pre.xpi
-rw-rw-r-- 1 jsha jsha 1472300 Sep 4 10:11 pkg/https-everywhere-3.5.1~pre.xpi
a1c892700452e4d710323648cec4739bb01475199e4c320ddf7f7d6962f0b2e6 pkg/https-everywhere-3.5.1~pre.xpi

It seems unlikely the size difference is caused by the ordering distance, but it's one variable to rule out. Let me know what you find with my branch merged.

comment:23 Changed 5 years ago by jsha

I was able to reproduce the non-determinism between my own 32-bit machine and a 64-bit remote machine (haven't confirmed 32 vs 64 is important. I suspect it's not). When I used the copy of make-sqlite from the branch I linked, the non-determinism goes away. The branch is committed and I suspect this is resolved on latest master. I'll wait for gk's confirmation that it is fixed, and merge the fix into the 4.0 release series branch.

comment:24 Changed 5 years ago by gk

Finally I got around testing the fix. I used 4.0.1 as it apparently contains it already. One of my build machines broke with it:

Generating sqlite DB
Traceback (most recent call last):
  File "./utils/make-sqlite.py", line 19, in <module>
    locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
  File "/home/ubuntu/install/python/lib/python2.7/locale.py", line 547, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Using C instead of en_US.UTF8 works fine, though, and makes building the rules.sqlite reproducible again. Could you modify your patch accordingly and get that into master and a new development and stable release with a signed tag respectively (4.0.2 is, alas, not signed)? Thanks for the work here.
I am leaving this ticket open until the fix lands.

comment:25 Changed 5 years ago by jsha

We've got a patch in master that switches to en_US.UTF-8 (note the dash). Does that change also work on your build machine?

We'll do another signed stable release soon. Trying to get a development release out the door that works in e10s Firefox first.

comment:26 in reply to:  25 Changed 5 years ago by gk

Replying to jsha:

We've got a patch in master that switches to en_US.UTF-8 (note the dash). Does that change also work on your build machine?

Nope. locale -a just gives me C and POSIX back.

comment:27 Changed 5 years ago by jsha

Is it practical to install the UTF-8 locale on your build box? I could try changing to the C locale but we do have some non-ASCII characters in ruleset names, so I would need to double check that the encoding comes through unscathed.

comment:28 in reply to:  27 Changed 5 years ago by gk

Replying to jsha:

Is it practical to install the UTF-8 locale on your build box? I could try changing to the C locale but we do have some non-ASCII characters in ruleset names, so I would need to double check that the encoding comes through unscathed.

I could but that would not solve our problem as the build should be reproducible by anyone (and thus on any box where the deterministic build setup is running). Not sure when the non-ASCII characters showed up but we had no issue with that before HTTPS-Everywhere 3.5.1 where we built the .xpi out of a git tag with LC_ALL=C set (we need that for HTTPS-Everywhere unrelated things but for the same reasons).

comment:29 Changed 5 years ago by jsha

My idea was that we could specify UTF8 locale as one of the prerequisites of building, e.g. in install-dev-dependencies.sh.

That said, I seem to remember I added the locale statement to ensure we get consistent sorting, so it's probably safe to change to LC_ALL=C (i.e. probably doesn't affect encoding of output).

comment:30 Changed 5 years ago by jsha

Okay, I've got a branch up. All looks dandy: https://github.com/EFForg/https-everywhere/pull/784

comment:31 Changed 5 years ago by jsha

Resolution: fixed
Status: reopenedclosed

This is fixed. Also as an FYI it turns out the sqlite version has to match in order to be reproducible. Currently I echo the sqlite version from our test script. Open to suggestions of where it would be useful to embed the sqlite version in the built .xpi.

Note: See TracTickets for help on using tickets.