Every dataset in the Attester catalog is scored on a transparent, published rubric. The Attester Quality Score (AQS) is a weighted composite of four internal dimensions and is cross-referenced against external gold-standard databases to produce a concordance signal. Scores and rationales are shown in full on every dataset's provenance tab.
All dimensions are scored 0–100. The composite is rounded to the nearest integer. Scores are recomputed when datasets are updated or re-audited.
Average non-null percentage across all schema fields in the dataset.
Computed as: mean(100 − null_rate) across all columns. Fields with intentional sparsity (e.g. an adverse_event field that is null when no AE occurred) are noted in the rationale — the raw null rate is still used so buyers see the true data density.
Incomplete data is the most common reason AI training pipelines fail silently. A dataset with 67% null on a key field will produce a biased model without an obvious error signal.
Degree to which fields use community-standard ontology identifiers and normalized types.
Scored on a rubric: rsIDs, HGNC gene symbols, RxNorm drug names, ICD-10-CM codes, UniProt ACs, ChEMBL IDs, PDB IDs, canonical SMILES — each present earns points. Custom institutional IDs, free-text fields where a standard exists, and type inconsistencies deduct points.
Proprietary identifiers prevent dataset joining, break downstream pipelines, and create silent mapping errors. Standard IDs mean your data works out of the box with public databases.
Provenance tier of the upstream data source.
Source provenance is the foundation of legal risk assessment. Buyers' legal teams will ask where the data originated — this dimension makes that auditable.
Rigor of the curation pipeline from raw source to delivered schema.
Raw data dumps require 3–6 months of engineering before they're usable for model training. Curation depth predicts how much preprocessing your team won't have to do.
For each relevant dataset, we run concordance checks against authoritative external databases. Where those databases carry restrictive commercial licenses (OncoKB, PharmGKB, OMIM), we use them exclusively as validation oracles — internal quality gates — and never redistribute their content. The output we publish is a concordance percentage, not their data.
This creates a feedback loop: synthetic or curated data is generated → validated against the gold standard → discordant records are flagged and corrected → data quality improves → the concordance score updates. As gold standards release new versions, we re-validate and update scores transparently.
| Source | Organization | Scope | License / Usage | Status |
|---|---|---|---|---|
| OncoKB | Memorial Sloan Kettering | Somatic variant oncogenicity and therapeutic implications for oncology drug datasets | Restrictive commercial — used as validation oracle only | Scheduled |
| PharmGKB | Stanford University | Pharmacogenomics variant-drug phenotype annotations for PGx datasets | CC BY-SA 4.0 — used as validation oracle only | Scheduled |
| OMIM | Johns Hopkins University | Gene-disease associations for rare disease variant datasets | Restrictive redistribution — used as validation oracle only | Scheduled |
| CPIC Guidelines | CPIC Consortium | Pharmacogenomics drug-gene pair annotations | CC0 public domain — used as both source and validator | Active |
| ClinVar | NCBI / NIH | Variant pathogenicity classifications across genomics datasets | Public domain — used as both source and validator | Active |
| FDA Drug Labels | U.S. FDA | Drug name normalization and PGx label concordance | Public domain (17 U.S.C. § 105) — used as source and validator | Active |
| ChEMBL | EMBL-EBI | Bioactivity, ADMET, and molecular property concordance | CC BY-SA 4.0 — used as source and validator | Active |
| UniProt/Swiss-Prot | UniProt Consortium | Protein target identity and function annotation | CC BY 4.0 — used as source and validator | Active |
Questions about methodology? Contact [email protected]