Opened 5 years ago

Closed 15 months ago

#13803 closed defect (wontfix)

Better and more efficient database schema

Reported by: otr Owned by: hellais
Priority: Medium Milestone:
Component: Archived/Ooni Version:
Severity: Normal Keywords: archived-closed-2018-07-04
Cc: kudrom, hellais, otr Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The current mongo db database schema has some problems that do not allow for efficient querying in certain cases.

Curently the main two collections in the DB are "reports" and "measurements" and each measurement references the report by id.

This makes analysis on a per country basis not so easy, e.g.:

  1. query for all measurements that have inconsistencies
  2. iterate over measurements and query report_id (and probe_cc)

This results in many unecessary queries.

Ideally we would like to have a schema that allows use to one query in order to get all measurements based on report AND measurement attributes ;
the result would be just a list of measurements.

Child Tickets

Change History (3)

comment:1 Changed 5 years ago by otr

kudrom already suggested the use of subdocuments. Basicually this would merge the reports and measurements collections. The measurements become just an array in each report's entry.

An implementation of this for the publish module of ooni pipeline is here:

https://github.com/olliwolli/ooni-pipeline/blob/feature/nested_mongodb/ooni/pipeline/task/publish.py#L35

The export step would also need to be adapted. Ideally by using find() only to get subdocuments (not the whole report which would contain ALL measurements of the report).

comment:2 Changed 4 years ago by tyler

Severity: Normal

Hello,
I will be working with @hellais to improve the database schema used by ooni-api and ooni-pipeline to allow for more efficient queries - to be more specific, we're going to be moving to a relational model which uses PostgreSQL, and materialized views to allow for analytic queries to be performed.

To be short, and sweet, I've proposed the use of a star schema to store individual test metrics with materialized views to store test-specific metrics (i.e. a materialized view for the dns_consistency test results, and one for bridge_reachability and so on, and so forth).

The suggested means of maintaining these materialized views is to create an indexed view which rebuilds when the underlying tables change. Feedback is welcome, and always appreciated.

comment:3 Changed 15 months ago by teor

Keywords: archived-closed-2018-07-04 added
Resolution: wontfix
Status: newclosed

Close all tickets in archived components

Note: See TracTickets for help on using tickets.