Changes between Version 7 and Version 8 of Ticket #15844, comment 8


Ignore:
Timestamp:
May 10, 2015, 11:09:13 PM (5 years ago)
Author:
leeroy
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #15844, comment 8

    v7 v8  
    1212 * spawn n-helpers in a pool, dynamically-sized based on load and resources available
    1313
     14
     15
    1416 * helpers are distributed across requests to hide latency of big requests
     17
     18
    1519
    1620 * receive a request on the front-end
    1721
     22
     23
    1824 * associate a request with n-helper
     25
     26
    1927
    2028 * each helper is associated with a request and a range of data
    2129
     30
     31
    2232 * helpers are scheduled to prevent overloading the server
     33
     34
    2335
    2436 * for each helper apply function on range of data
    2537
     38
     39
    2640 * helper returns results then re-joins the pool
    2741
     42
     43
    2844 * repeat for n-helper associated with a request until all results gathered
     45
     46
    2947
    30481c) A thought about introducing describing notation to the search pattern itself. A marker of sorts to encode the search type. In the absence of any marker perform a most general search. `@foo@`, or something like it, could transform into a contact-specific search. This would avoid changes to parametrization and keep backwards compatibility. There wouldn't be much difference in processing a request; either way you have to parse the parameters. The thinking here is that the current parameterization will become more painful as Onionoo is extended.
     
    4159 * A database makes possible efficient storage, and management of a large set of data.
    4260
     61
     62
    4363 * It will enable writes and reads to overlap. Importing new data while providing existing data becomes possible.
     64
     65
    4466
    4567 * Built in replication means creating copies of the current state is less of a hassle.
    4668
     69
     70
    4771 * Enables a highly extensible protocol. Effectively turning Onionoo into a sort of ORM for Tor data.
    4872
     73
     74
    4975 * Onionoo can prepare the response while waiting on the data.
     76
     77
    5078
    5179The Java deployment is a pain to scale, postgreSQL is designed to scale independently from the Java powered engine behind the scenes. It's a compromise.
     
    5684 * Slow client connections (slowloris), big requests.
    5785
     86
     87
    5888 * Long duration requests tie up resources and reduce availability for other requests.
    5989
     90
     91
    6092 * If the record that matches a search is at the end of a returned data-set, there might alot of data in between. A mitigation, in the case of having the match at the beginning, is to cancel.
     93
     94
    6195
    6296The reality, in an extensible Onionoo protocol, is that some queries will take longer. It's impossible to make every database query conform. What impact will this have on the current Onionoo deployment? Whatever concerns this may raise need to be addressed !'''before!''' the database transition.
     
    67101 * Improvements to substring search
    68102
     103
     104
    69105 * Decouple response creation from the data
     106
     107
    70108
    71109 * Indexing parts of a field
    72110
     111
     112
    73113 * Increasing relevance of results
    74114
     115
     116
    75117 * Currently deployed worst-case analysis
     118
     119
    76120
    77121And in addition to the above.
     
    79123 * Limiting searchable period of time per request. You might not want a single request to search the past several years, at least not at once. On the other-hand several searches, each covering n-months, might be more acceptable.
    80124
     125
     126
    81127 * Complex queries involving multiple joins, conditions, predicates to simulate extending the protocol. For example a search involving a range in first-seen or last-seen.
     128
     129
    82130
    83131 * HSTORE-like fields, for example, like adding whois data. It can be added to the database as retrieved. Doesn't need to be retrieved immediately. Doesn't need any sort of search. Can be compressed. You know it's going to happen, eventually, because this kind of data can change over time.
    84132
     133
     134
    85135 * Bin the data. Like !CollecTor (not exactly). Implicitly restricts size of data to recent and longer interval.
     136
     137
    86138
    87139What other tests will help with the transition?
     
    126178 * An id is included as a preliminary key for generality. It's there to make relationships clear between tables that may be added later (for other documents). It's much easier to index a long integer than, say, an entire fingerprint (or in base64). At least that's the premise. That being said the router fingerprint will end up indexed by the query optimizer anyway.
    127179
    128  * This form doesn't consider the possibility of storing changes to the flags for a given router (determined by a fingerprint) over time. It only stores the current flags for a router-fingerprint. Maybe a change such as having a foreign key in Router point to an id of Router_flag, which encodes the current flags together in a single record. A record for each instance of current flags for a router. Update the foreign key in Router on changes.
     180
     181
     182 * This form doesn't consider the possibility of storing changes to the flags for a given router (determined by a fingerprint) over time. It only stores the current flags for a router-fingerprint. Maybe a change such as having an HSTORE in Router map an id (of Router_flag) to a hash of the flags. Router_flag would have the id and encoded flags in a single record. Upon detecting changes in flags create a new record in Router_flags and update the HSTORE of Router. A date field in Router_flag might also be used to keep track of a last-seen.
     183
     184
    129185
    130186 * A realistic schema would mean the router family needs to be filled in during a second pass.
     187
     188
    131189
    132190So how does this look for the search query? I'll run some `EXPLAIN (ANALYZE, BUFFERS)`''' '''to add to the discussion after filling in the tables with some data. While I'm at it I'll also start some indexing and substring tests. I'm not sure how to set-the-bar for performance. What should I look for when performing a query with `servers` (router), `addresses` (ip), and `fingerprints`? As-fast-as-possible comes to mind :D