Opened 3 years ago

Last modified 10 months ago

#17939 assigned enhancement

Optimize the construction of details documents with field constraints

Reported by: fmap Owned by: metrics-team
Priority: Low Milestone:
Component: Metrics/Onionoo Version:
Severity: Minor Keywords:
Cc: karsten, virgil Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In a recent post to metrics-team@, Karsten pointed toward an expensive operation within the response builder:

Once per hour, the updater fetches new data and in the end produces JSON-formatted strings that it writes to disk. The servlet reads a (comparatively) small index to memory that it uses to handle requests, and when it builds responses, it tries hard to avoid (de-)serializing JSON.

The only situation where this fails is when [a] request [to the /details endpoint] contains the fields parameter. Only in that case we'll have to deserialize, pick the fields we want, and serialize again. I could imagine that this shows up in profiles pretty badly, and I'd love to fix this, I just don't know how.

I think we can exploit a few properties of the updater to handle this case in a more efficient manner.

It seems safe to assume that: (1) the produced response is always the concatenation of a sequence of a substrings within the written document 1; (2) that the documents on disk are legal JSON and correctly typed (having been written by the updater, which we trust and control); and (3) that the contents of the file are trivially parsed (belonging to a restriction of JSON with known and non-redundant keys, the grammar is at most context-free).

I believe these conditions admit introducing a relatively efficient parser generator pair, one that avoids request-time de-serialisation. Given a request, the result of the parser would be a sequence of pairs of indices marking the boundaries of each field. The generator would reproduce the input, but for excluding text regions corresponding to fields excluded by the request.

No patch yet, but I've hacked together a small (inefficient mess of a..) proof of concept that hopefully illustrates the basic idea:

http://hack.rs/~vi/onionoo/IndexJSON.hs
sha256: 14a09f26fadab8d989263dc76d368e41e63ba6c5279d37443878d6c1d0c87834
http://www.webcitation.org/6e3NEOLJg

% jq . 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE 
{
  "nickname": "Unnamed",
  "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
  "or_addresses": [
    "10.103.224.131:443"
  ],
  "last_seen": "2015-11-23 03:40:44",
  "first_seen": "2015-11-20 04:38:22",
  "running": false,
  "flags": [
    "Valid"
  ],
  "last_restarted": "2015-11-22 01:23:06",
  "advertised_bandwidth": 49168,
  "platform": "Tor 0.2.4.22 on Windows 8"
}
% index-json 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE 
("nickname",(2,21,22))
("hashed_fingerprint",(23,85,86))
("or_addresses",(87,123,124))
("last_seen",(125,157,158))
("first_seen",(159,192,193))
("running",(194,208,209))
("flags",(210,226,227))
("last_restarted",(228,265,266))
("advertised_bandwidth",(267,294,295))
("platform",(296,333,333))
% cut -c1 -c23-158 -c194- 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE  | jq .
{
  "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
  "or_addresses": [
    "10.103.224.131:443"
  ],
  "last_seen": "2015-11-23 03:40:44",
  "running": false,
  "flags": [
    "Valid"
  ],
  "last_restarted": "2015-11-22 01:23:06",
  "advertised_bandwidth": 49168,
  "platform": "Tor 0.2.4.22 on Windows 8"
}

What do you think?


1 There's a factor of surprise in the treatment of nullable properties, but it turns out that the existing behaviour works in our favour. GSON removes 'null'ed fields in writing documents to disk; e.g. note the absence of an AS number here:

% pwd
/srv/onionoo.torproject.org/onionoo/out/details
% jq . $(ls | shuf -n1)
{
 "nickname": "Unnamed",
 "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE",
 "or_addresses": [
   "10.190.9.13:443"
 ],
 "last_seen": "2015-12-16 22:41:56",
 "first_seen": "2015-11-11 21:01:43",
 "running": true,
 "flags": [
   "Fast",
   "Valid"
 ],
 "last_restarted": "2015-12-16 02:13:40",
 "advertised_bandwidth": 59392,
 "platform": "Tor 0.2.4.23 on Windows 8"
}


But it *also* excludes them from /details responses, even when specified by name using the 'fields' parameter:

% curl -s 'http://onionoo.local/details?lookup=CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE&fields=hashed_fingerprint,as_number' | jq .bridges[]
{
  "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE"
}


So it doesn't seem necessary to add any text atop the persisted serialisation, even in this case.

Child Tickets

Change History (9)

comment:1 Changed 3 years ago by virgil

More efficient response construction would improve the load times for Roster.  Ergo I am in favor.

comment:2 Changed 3 years ago by fmap

(I should note that these indices could reasonably be held in memory with the node index, further unburdening the response builder. Also, when I said "avoids request-time de-serialisation", I meant "avoids request-time serialisation", though moving the parsing away from request time will help both.)

comment:3 Changed 3 years ago by karsten

Thanks for starting this discussion. I'm in favor of taking Gson out of the loop for two reasons: it's a potential performance bottleneck (though I never measured that), and it's a maintenance nightmare because it's just to easy to miss a new details document field in that hacked part of the code.

Regarding the approach, I'd favor one that doesn't require keeping anything new in memory but instead process details document contents on the fly. We'll have to read a details document from disk if we want to include part of it in a response anyway, and once it's in memory it's cheap to create such an index where fields start and end and only pick the ones we want. It could just be that Gson adds some overhead that we could avoid here. And of course the current approach has the downside of being hard to maintain, which we could fix. Maybe we can try out different approaches and compare them with respect to performance and robustness?

Bonus points: we could use this new approach to allow the fields parameter for other documents than details documents.

comment:4 in reply to:  3 Changed 3 years ago by fmap

I'm in favor of taking Gson out of the loop for two reasons: it's a potential performance bottleneck (though I never measured that), and it's a maintenance nightmare because it's just to easy to miss a new details document field in that hacked part of the code.

Regarding a performance bottleneck: an eyeball of the flame graph we've been discussing on the list suggests 'formatNodeStatus' spends on average ten times more time producing 'details' than any other document type. It looks like about five percent of total CPU time over the sample, but there's a few too many divorced frames to be sure (and I've lost the raw data somewhere.) I'll make another recording later and report back with more precise details.

Regarding the approach, I'd favor one that doesn't require keeping anything new in memory but instead process details document contents on the fly. We'll have to read a details document from disk if we want to include part of it in a response anyway, and once it's in memory it's cheap to create such an index where fields start and end and only pick the ones we want.

That sounds reasonable.

It could just be that Gson adds some overhead that we could avoid here. And of course the current approach has the downside of being hard to maintain, which we could fix. Maybe we can try out different approaches and compare them with respect to performance and robustness?

Do you mean avoiding Gson in producing a boundary index? I think there's more to it than the performance overhead of a redundant parse. In populating its result, the parser I referenced is sensitive to structure that JSON parsers typically aren't: the length of what the JSON spec calls 'structural characters' (/[:,\[\]{}]/), as well as that of the (variable length) whitespace allowed to surround them. I don't see anything in the GSON user guide that would admit intelligent interpretation of those tokens, and they're critical (in the general case at least) to the precise determination of boundaries. That said.. given that written documents don't presently include whitespace around structural tokens, it should be possible (assuming Gson can retain the initial field ordering) to derive the right coordinates from a serialisation into a JSON ADT. But that approach strikes me as frail and indirect.

Though I worry I might've misread your message. Do have other approaches in mind to produce a boundary index? Or perhaps you meant only to benchmark the proposed implementation against the existing one?

Bonus points: we could use this new approach to allow the fields parameter for other documents than details documents.

Sounds good.

comment:5 Changed 3 years ago by karsten

I'm afraid I don't fully understand your last comment. My earlier suggestion was to read the entire JSON string to memory and only keep the parts with matching field names; but without using Gson to deserialize the string, copying the fields we care about into a new object, and serializing that object using Gson again. My hope was that we could write a simple text-based parser that only returns indexes where fields start and end in the string which we could then use to feed the parts we care about into a StringBuilder, basically skipping the serializing step. We might be able to build this using some lower-level Gson stuff or maybe even entirely without Gson. I'm not sure if you're saying above that such an approach would be too fragile, but I could see that being the case. Worth investigating, maybe.

But here's another approach that still uses Gson, but possibly in a more efficient (and less hacky) way:

import java.util.HashSet;
import java.util.Set;

import com.google.gson.ExclusionStrategy;
import com.google.gson.FieldAttributes;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

public class Deserializer {
  static class Inner {
    String innerField;
  }
  static class Outer {
    String outerField;
    Inner innerObject;
  }
  static class FieldsExclusionStrategy implements ExclusionStrategy {
    Class<?> declaringClass;
    Set<String> includedFieldNames;
    public FieldsExclusionStrategy(Class<?> clazz,
        Set<String> includedFieldNames) {
      this.declaringClass = clazz;
      this.includedFieldNames = includedFieldNames;
    }
    public boolean shouldSkipField(FieldAttributes arg0) {
      return this.declaringClass.equals(arg0.getDeclaringClass()) &&
          !includedFieldNames.contains(arg0.getName());
    }
    public boolean shouldSkipClass(Class<?> arg0) {
      return false;
    }
  }
  public static void main(String[] args) {
    Outer t = new Outer();
    t.outerField = "some text";
    t.innerObject = new Inner();
    t.innerObject.innerField = "some inner text";
    Gson gson = new GsonBuilder().create();
    String json = gson.toJson(t);
    System.out.println(json);

    Set<String> keepFieldNames = new HashSet<String>();
    keepFieldNames.add("innerObject");
    gson = new GsonBuilder().setExclusionStrategies(
        new FieldsExclusionStrategy(Outer.class,
            keepFieldNames)).create();
    System.out.println(gson.toJson(gson.fromJson(json, Outer.class)));
  }
}

I could imagine that this approach is more efficient, because Gson can skip lots of fields and because we're not creating a separate object and copying over fields. And it's less ugly code, because we don't have to add new code for each field we're adding. Oh, and it works for all document types. But it's still using Gson for deserializing and serializing. If this looks promising, let's benchmark it.

Thanks!

comment:6 Changed 2 years ago by virgil

This ticket might be dead.  But the proposal was to remove GSON entirely from the serializing.  You have:

(1) the full document

(2) a pre-computed list of the start and end-points of each type of entry.

Then when you receive a request, you look up the desired start and end points we want in our result, and simply pull out all data between those offsets, and then to the user.  At request-time, no GSON is involved.

comment:7 Changed 2 years ago by karsten

Not dead, just dormant.

How would you precompute that list for (2)?

And how would you store it in memory, given that there are 13k relays and bridges in Onionoo's current index and that this approach should still work with reasonable memory requirements when there are twice as many relays and bridges?

If it's infeasible to store that list in memory, is there a way to compute this list on the fly that is faster than one deserialization and one serialization with Gson?

comment:8 Changed 10 months ago by karsten

Summary: Optimise the construction of details documents with field constraintsOptimize the construction of details documents with field constraints

Tweak summary.

comment:9 Changed 10 months ago by karsten

Owner: set to metrics-team
Status: newassigned
Note: See TracTickets for help on using tickets.