Changes between Version 7 and Version 8 of Ticket #15844, comment 8

May 10, 2015, 11:09:13 PM (5 years ago)


  • Ticket #15844, comment 8

    v7 v8  
    1212 * spawn n-helpers in a pool, dynamically-sized based on load and resources available
    1416 * helpers are distributed across requests to hide latency of big requests
    1620 * receive a request on the front-end
    1824 * associate a request with n-helper
    2028 * each helper is associated with a request and a range of data
    2232 * helpers are scheduled to prevent overloading the server
    2436 * for each helper apply function on range of data
    2640 * helper returns results then re-joins the pool
    2844 * repeat for n-helper associated with a request until all results gathered
    30481c) A thought about introducing describing notation to the search pattern itself. A marker of sorts to encode the search type. In the absence of any marker perform a most general search. `@foo@`, or something like it, could transform into a contact-specific search. This would avoid changes to parametrization and keep backwards compatibility. There wouldn't be much difference in processing a request; either way you have to parse the parameters. The thinking here is that the current parameterization will become more painful as Onionoo is extended.
    4159 * A database makes possible efficient storage, and management of a large set of data.
    4363 * It will enable writes and reads to overlap. Importing new data while providing existing data becomes possible.
    4567 * Built in replication means creating copies of the current state is less of a hassle.
    4771 * Enables a highly extensible protocol. Effectively turning Onionoo into a sort of ORM for Tor data.
    4975 * Onionoo can prepare the response while waiting on the data.
    5179The Java deployment is a pain to scale, postgreSQL is designed to scale independently from the Java powered engine behind the scenes. It's a compromise.
    5684 * Slow client connections (slowloris), big requests.
    5888 * Long duration requests tie up resources and reduce availability for other requests.
    6092 * If the record that matches a search is at the end of a returned data-set, there might alot of data in between. A mitigation, in the case of having the match at the beginning, is to cancel.
    6296The reality, in an extensible Onionoo protocol, is that some queries will take longer. It's impossible to make every database query conform. What impact will this have on the current Onionoo deployment? Whatever concerns this may raise need to be addressed !'''before!''' the database transition.
    67101 * Improvements to substring search
    69105 * Decouple response creation from the data
    71109 * Indexing parts of a field
    73113 * Increasing relevance of results
    75117 * Currently deployed worst-case analysis
    77121And in addition to the above.
    79123 * Limiting searchable period of time per request. You might not want a single request to search the past several years, at least not at once. On the other-hand several searches, each covering n-months, might be more acceptable.
    81127 * Complex queries involving multiple joins, conditions, predicates to simulate extending the protocol. For example a search involving a range in first-seen or last-seen.
    83131 * HSTORE-like fields, for example, like adding whois data. It can be added to the database as retrieved. Doesn't need to be retrieved immediately. Doesn't need any sort of search. Can be compressed. You know it's going to happen, eventually, because this kind of data can change over time.
    85135 * Bin the data. Like !CollecTor (not exactly). Implicitly restricts size of data to recent and longer interval.
    87139What other tests will help with the transition?
    126178 * An id is included as a preliminary key for generality. It's there to make relationships clear between tables that may be added later (for other documents). It's much easier to index a long integer than, say, an entire fingerprint (or in base64). At least that's the premise. That being said the router fingerprint will end up indexed by the query optimizer anyway.
    128  * This form doesn't consider the possibility of storing changes to the flags for a given router (determined by a fingerprint) over time. It only stores the current flags for a router-fingerprint. Maybe a change such as having a foreign key in Router point to an id of Router_flag, which encodes the current flags together in a single record. A record for each instance of current flags for a router. Update the foreign key in Router on changes.
     182 * This form doesn't consider the possibility of storing changes to the flags for a given router (determined by a fingerprint) over time. It only stores the current flags for a router-fingerprint. Maybe a change such as having an HSTORE in Router map an id (of Router_flag) to a hash of the flags. Router_flag would have the id and encoded flags in a single record. Upon detecting changes in flags create a new record in Router_flags and update the HSTORE of Router. A date field in Router_flag might also be used to keep track of a last-seen.
    130186 * A realistic schema would mean the router family needs to be filled in during a second pass.
    132190So how does this look for the search query? I'll run some `EXPLAIN (ANALYZE, BUFFERS)`''' '''to add to the discussion after filling in the tables with some data. While I'm at it I'll also start some indexing and substring tests. I'm not sure how to set-the-bar for performance. What should I look for when performing a query with `servers` (router), `addresses` (ip), and `fingerprints`? As-fast-as-possible comes to mind :D