Optimize the construction of details documents with field constraints

In a recent post to metrics-team@, Karsten pointed toward an expensive operation within the response builder:

Once per hour, the updater fetches new data and in the end produces JSON-formatted strings that it writes to disk. The servlet reads a (comparatively) small index to memory that it uses to handle requests, and when it builds responses, it tries hard to avoid (de-)serializing JSON.

The only situation where this fails is when [a] request [to the /details endpoint] contains the fields parameter. Only in that case we'll have to deserialize, pick the fields we want, and serialize again. I could imagine that this shows up in profiles pretty badly, and I'd love to fix this, I just don't know how.

I think we can exploit a few properties of the updater to handle this case in a more efficient manner.

It seems safe to assume that: (1) the produced response is always the concatenation of a sequence of a substrings within the written document ^[#fn1 1]^; (2) that the documents on disk are legal JSON and correctly typed (having been written by the updater, which we trust and control); and (3) that the contents of the file are trivially parsed (belonging to a restriction of JSON with known and non-redundant keys, the grammar is at most context-free).

I believe these conditions admit introducing a relatively efficient parser generator pair, one that avoids request-time de-serialisation. Given a request, the result of the parser would be a sequence of pairs of indices marking the boundaries of each field. The generator would reproduce the input, but for excluding text regions corresponding to fields excluded by the request.

No patch yet, but I've hacked together a small (inefficient mess of a..) proof of concept that hopefully illustrates the basic idea:

http://hack.rs/~vi/onionoo/IndexJSON.hs sha256: 14a09f26fadab8d989263dc76d368e41e63ba6c5279d37443878d6c1d0c87834 http://www.webcitation.org/6e3NEOLJg

% jq . 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE 
{
  "nickname": "Unnamed",
  "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
  "or_addresses": [
    "10.103.224.131:443"
  ],
  "last_seen": "2015-11-23 03:40:44",
  "first_seen": "2015-11-20 04:38:22",
  "running": false,
  "flags": [
    "Valid"
  ],
  "last_restarted": "2015-11-22 01:23:06",
  "advertised_bandwidth": 49168,
  "platform": "Tor 0.2.4.22 on Windows 8"
}
% index-json 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE 
("nickname",(2,21,22))
("hashed_fingerprint",(23,85,86))
("or_addresses",(87,123,124))
("last_seen",(125,157,158))
("first_seen",(159,192,193))
("running",(194,208,209))
("flags",(210,226,227))
("last_restarted",(228,265,266))
("advertised_bandwidth",(267,294,295))
("platform",(296,333,333))
% cut -c1 -c23-158 -c194- 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE  | jq .
{
  "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
  "or_addresses": [
    "10.103.224.131:443"
  ],
  "last_seen": "2015-11-23 03:40:44",
  "running": false,
  "flags": [
    "Valid"
  ],
  "last_restarted": "2015-11-22 01:23:06",
  "advertised_bandwidth": 49168,
  "platform": "Tor 0.2.4.22 on Windows 8"
}

What do you think?

,, [=#fn1 ^1^] There's a factor of surprise in the treatment of nullable properties, but it turns out that the existing behaviour works in our favour. GSON removes 'null'ed fields in writing documents to disk; e.g. note the absence of an AS number here:

% pwd
/srv/onionoo.torproject.org/onionoo/out/details
% jq . $(ls | shuf -n1)
{
 "nickname": "Unnamed",
 "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE",
 "or_addresses": [
   "10.190.9.13:443"
 ],
 "last_seen": "2015-12-16 22:41:56",
 "first_seen": "2015-11-11 21:01:43",
 "running": true,
 "flags": [
   "Fast",
   "Valid"
 ],
 "last_restarted": "2015-12-16 02:13:40",
 "advertised_bandwidth": 59392,
 "platform": "Tor 0.2.4.23 on Windows 8"
}

,, But it also excludes them from /details responses, even when specified by name using the 'fields' parameter:

% curl -s 'http://onionoo.local/details?lookup=CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE&fields=hashed_fingerprint,as_number' | jq .bridges[]
{
  "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE"
}

,,So it doesn't seem necessary to add any text atop the persisted serialisation, even in this case.

Trac:
Username: fmap

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information