Beaconplus Data / Query Model

The Progenetix / Beaconplus query model utilises the GA4GH core data model for genomic and (biomedical, procedural) queries and data delivery.

The GA4GH data model for genomics recommends the use of a principle object hierarchy, consisting of

In the Progenetix backend we mirror the GA4GH data model in the storage system, consisting of the corresponding

collections of MongoDB databases. These collections are addressed by scoped queries. Since the current Beacon query model only supports variant queries (“BeaconAlleleRequest”) and filters, we apply pre-parsing steps for mapping the filter values to the correct attributes and collections (see below).

Variant query (VarQ)

The Variant Query is a standard Beacon v1.1 BeaconAlleleRequest, including support for ranges, wildcards and structural variants (DUP, DEL, BND).

Callset Query (CsQ)

Callsets are queried only indirectly, e.g. as data aggregation target using the variants.callset_id values or biosamples.id => callsets.biosample_id matches.

Biosample Query (BiosQ)

Biosamples contain information about biological parameters (e.g. histology, organ site), procedural parameters (e.g. external identifiers, geographic origin) and clinical data specific to the sample (e.g. age at sample collection).

Individual Query (IndQ)

In the GA4GH core data model, Individuals (or Subjects) as data objects contain information pertaining to the whole organism. Typical attributes for use in Beaconplus queries would e.g. be genotypic sex and phenotypic information.

Filters

Filters represent a way to allow the resource provider to direct “self-scoped” query values to the corresponding attributes in their backend resource. In the Progenetix implementation, a lookup table followed by scope assignment is used to map prefixed filter values to the correct attributes and collections:

  1. Use the prefix to determine the full attribute * filters=ncit:C4033 - query attribute biocharacteristics.type.id for value ncit:C4033 * filters=pubmed:28966033 - query attribute external_references.type.id for value pubmed:28966033
  2. Match the full attribute to the correct scope (i.e. collection, query domain)

The list below shows a selection from the configuration file (YAML):

filter_prefix_mappings:
  ncit:
    parameter: 'biocharacteristics.type.id'
  HPO:
    parameter: 'biocharacteristics.type.id'
  pubmed:
    parameter: 'external_references.type.id'
  cellosaurus:
    parameter: 'external_references.type.id'
  EFO:
    parameter: 'provenance.material.type.id'
  city:
    parameter: 'provenance.geo.city'
    remove_prefix: true
  wes:
    parameter: 'counts.wes'
    remove_prefix: true

The different scopes (i.e. collections) have pre-defined attributes that can be queried. For example, a filter filters=ncit:C4033 will be resolved to biocharacteristics.type.id=ncit:C4033 and the biocharacteristics.type.id attribute will match a parameter (or its alias) in the biosamples scope, generating a query of { "biocharacteristics.type.id" : "ncit:C4033" } against the biosamples collection.

scopes:
  biosamples:
    scope: biosamples
    parameters:
      id:
        paramkey: 'biosamples.id'
        dbkey: 'id'
        alias:
          - 'biosamples-id'
          - 'id'
        pattern: '^.+?\w+?.+?$'
        type: array
      biocharacteristics.type.id:
        paramkey: 'biosamples.biocharacteristics.type.id'
        dbkey: 'biocharacteristics.type.id'
        alias:
          - 'biosamples-biocharacteristics-type-id'
          - 'biocharacteristics.type.id'
          - 'biocharacteristics-type-id'
        pattern: '^(\w+[\:\-$])?\w*?\d(?:[\w\-\.]+?)?'
        type: array
      external_references.type.id:
        paramkey: 'biosamples.external_references.type.id'
        alias:
          - 'biosamples-external_references-type-id'
          - 'external_references.type.id'
          - 'external_references-type-id'
        pattern: '^(\w+[\:\-$])?\w.?(?:[\w\-\.]+?)?'
        type: array

Pre-selection for Aggregation

2019-06-26