← Catalog / Methodology

Attester Quality Score

Every dataset in the Attester catalog is scored on a transparent, published rubric. The Attester Quality Score (AQS) is a weighted composite of four internal dimensions and is cross-referenced against external gold-standard databases to produce a concordance signal. Scores and rationales are shown in full on every dataset's provenance tab.

Composite formula

AQS =25%Completeness+20%Schema Standardization+30%Source Trust+25%Curation Depth

All dimensions are scored 0–100. The composite is rounded to the nearest integer. Scores are recomputed when datasets are updated or re-audited.

Dimensions

Completeness

25% weight

Average non-null percentage across all schema fields in the dataset.

How it's computed

Computed as: mean(100 − null_rate) across all columns. Fields with intentional sparsity (e.g. an adverse_event field that is null when no AE occurred) are noted in the rationale — the raw null rate is still used so buyers see the true data density.

Why it matters

Incomplete data is the most common reason AI training pipelines fail silently. A dataset with 67% null on a key field will produce a biased model without an obvious error signal.

Schema Standardization

20% weight

Degree to which fields use community-standard ontology identifiers and normalized types.

How it's computed

Scored on a rubric: rsIDs, HGNC gene symbols, RxNorm drug names, ICD-10-CM codes, UniProt ACs, ChEMBL IDs, PDB IDs, canonical SMILES — each present earns points. Custom institutional IDs, free-text fields where a standard exists, and type inconsistencies deduct points.

Why it matters

Proprietary identifiers prevent dataset joining, break downstream pipelines, and create silent mapping errors. Standard IDs mean your data works out of the box with public databases.

Source Trust

30% weight

Provenance tier of the upstream data source.

How it's computed

95–100 — Public domain or CC0 (FDA, NIH, ClinVar, CPIC, PubChem, US Government works)
88–94 — Licensed institutional or peer-reviewed with commercial rights (GTEx, ChEMBL, MIMIC-IV DUA)
80–87 — Curated derivative or research-license source used as validator only
65–79 — Synthetic data without external validation

Why it matters

Source provenance is the foundation of legal risk assessment. Buyers' legal teams will ask where the data originated — this dimension makes that auditable.

Curation Depth

25% weight

Rigor of the curation pipeline from raw source to delivered schema.

How it's computed

90–100 — Expert-validated sample (n ≥ 1,000) + automated QC pipeline
80–89 — Automated pipeline with domain-specific QC (deduplication, normalization, outlier detection)
70–79 — Automated extraction with basic normalization
50–69 — Minimal transformation of raw source data

Why it matters

Raw data dumps require 3–6 months of engineering before they're usable for model training. Curation depth predicts how much preprocessing your team won't have to do.

Gold-standard validation loop

For each relevant dataset, we run concordance checks against authoritative external databases. Where those databases carry restrictive commercial licenses (OncoKB, PharmGKB, OMIM), we use them exclusively as validation oracles — internal quality gates — and never redistribute their content. The output we publish is a concordance percentage, not their data.

This creates a feedback loop: synthetic or curated data is generated → validated against the gold standard → discordant records are flagged and corrected → data quality improves → the concordance score updates. As gold standards release new versions, we re-validate and update scores transparently.

Source	Organization	Scope	License / Usage	Status
OncoKB	Memorial Sloan Kettering	Somatic variant oncogenicity and therapeutic implications for oncology drug datasets	Restrictive commercial — used as validation oracle only	Scheduled
PharmGKB	Stanford University	Pharmacogenomics variant-drug phenotype annotations for PGx datasets	CC BY-SA 4.0 — used as validation oracle only	Scheduled
OMIM	Johns Hopkins University	Gene-disease associations for rare disease variant datasets	Restrictive redistribution — used as validation oracle only	Scheduled
CPIC Guidelines	CPIC Consortium	Pharmacogenomics drug-gene pair annotations	CC0 public domain — used as both source and validator	Active
ClinVar	NCBI / NIH	Variant pathogenicity classifications across genomics datasets	Public domain — used as both source and validator	Active
FDA Drug Labels	U.S. FDA	Drug name normalization and PGx label concordance	Public domain (17 U.S.C. § 105) — used as source and validator	Active
ChEMBL	EMBL-EBI	Bioactivity, ADMET, and molecular property concordance	CC BY-SA 4.0 — used as source and validator	Active
UniProt/Swiss-Prot	UniProt Consortium	Protein target identity and function annotation	CC BY 4.0 — used as source and validator	Active

Score interpretation

90–100

Production-ready

Suitable for commercial AI training with minimal preprocessing. High confidence in completeness, provenance, and curation rigor.

80–89

Research-grade

Strong dataset with one or two documented limitations. Review the dimension breakdown to understand where to focus preprocessing effort.

70–79

Baseline

Usable for research and experimentation. Expect field gaps or source trust caveats. Rationale explains the specific tradeoffs.

< 70

Specialist use

Significant known limitations that affect general-purpose use. Included for specialist applications where the limitations are understood.

Transparency commitments

Scores and dimension rationales are published in full on every dataset's provenance tab — no black-box numbers.
Gold-standard concordance checks use restricted databases as validation oracles only. We publish which ones apply to each dataset and their concordance percentage once validated.
Null rates are never smoothed or hidden. If a field has 41% null, that is shown and explained.
Scores update when datasets are re-curated, re-audited, or when upstream gold standards release new versions.
The weighting formula on this page is the exact formula used to compute every score in the catalog.

Questions about methodology? Contact [email protected]