← Catalog / Methodology

Attester Quality Score

Every dataset in the Attester catalog is scored on a transparent, published rubric. The Attester Quality Score (AQS) is a weighted composite of four internal dimensions and is cross-referenced against external gold-standard databases to produce a concordance signal. Scores and rationales are shown in full on every dataset's provenance tab.

Composite formula

AQS =25%Completeness+20%Schema Standardization+30%Source Trust+25%Curation Depth

All dimensions are scored 0–100. The composite is rounded to the nearest integer. Scores are recomputed when datasets are updated or re-audited.

Dimensions

Completeness
25% weight

Average non-null percentage across all schema fields in the dataset.

How it's computed

Computed as: mean(100 − null_rate) across all columns. Fields with intentional sparsity (e.g. an adverse_event field that is null when no AE occurred) are noted in the rationale — the raw null rate is still used so buyers see the true data density.

Why it matters

Incomplete data is the most common reason AI training pipelines fail silently. A dataset with 67% null on a key field will produce a biased model without an obvious error signal.

Schema Standardization
20% weight

Degree to which fields use community-standard ontology identifiers and normalized types.

How it's computed

Scored on a rubric: rsIDs, HGNC gene symbols, RxNorm drug names, ICD-10-CM codes, UniProt ACs, ChEMBL IDs, PDB IDs, canonical SMILES — each present earns points. Custom institutional IDs, free-text fields where a standard exists, and type inconsistencies deduct points.

Why it matters

Proprietary identifiers prevent dataset joining, break downstream pipelines, and create silent mapping errors. Standard IDs mean your data works out of the box with public databases.

Source Trust
30% weight

Provenance tier of the upstream data source.

How it's computed
  • 95–100 — Public domain or CC0 (FDA, NIH, ClinVar, CPIC, PubChem, US Government works)
  • 88–94 — Licensed institutional or peer-reviewed with commercial rights (GTEx, ChEMBL, MIMIC-IV DUA)
  • 80–87 — Curated derivative or research-license source used as validator only
  • 65–79 — Synthetic data without external validation
Why it matters

Source provenance is the foundation of legal risk assessment. Buyers' legal teams will ask where the data originated — this dimension makes that auditable.

Curation Depth
25% weight

Rigor of the curation pipeline from raw source to delivered schema.

How it's computed
  • 90–100 — Expert-validated sample (n ≥ 1,000) + automated QC pipeline
  • 80–89 — Automated pipeline with domain-specific QC (deduplication, normalization, outlier detection)
  • 70–79 — Automated extraction with basic normalization
  • 50–69 — Minimal transformation of raw source data
Why it matters

Raw data dumps require 3–6 months of engineering before they're usable for model training. Curation depth predicts how much preprocessing your team won't have to do.

Gold-standard validation loop

For each relevant dataset, we run concordance checks against authoritative external databases. Where those databases carry restrictive commercial licenses (OncoKB, PharmGKB, OMIM), we use them exclusively as validation oracles — internal quality gates — and never redistribute their content. The output we publish is a concordance percentage, not their data.

This creates a feedback loop: synthetic or curated data is generated → validated against the gold standard → discordant records are flagged and corrected → data quality improves → the concordance score updates. As gold standards release new versions, we re-validate and update scores transparently.

SourceOrganizationScopeLicense / UsageStatus
OncoKBMemorial Sloan KetteringSomatic variant oncogenicity and therapeutic implications for oncology drug datasetsRestrictive commercial — used as validation oracle onlyScheduled
PharmGKBStanford UniversityPharmacogenomics variant-drug phenotype annotations for PGx datasetsCC BY-SA 4.0 — used as validation oracle onlyScheduled
OMIMJohns Hopkins UniversityGene-disease associations for rare disease variant datasetsRestrictive redistribution — used as validation oracle onlyScheduled
CPIC GuidelinesCPIC ConsortiumPharmacogenomics drug-gene pair annotationsCC0 public domain — used as both source and validatorActive
ClinVarNCBI / NIHVariant pathogenicity classifications across genomics datasetsPublic domain — used as both source and validatorActive
FDA Drug LabelsU.S. FDADrug name normalization and PGx label concordancePublic domain (17 U.S.C. § 105) — used as source and validatorActive
ChEMBLEMBL-EBIBioactivity, ADMET, and molecular property concordanceCC BY-SA 4.0 — used as source and validatorActive
UniProt/Swiss-ProtUniProt ConsortiumProtein target identity and function annotationCC BY 4.0 — used as source and validatorActive

Score interpretation

90–100
Production-ready
Suitable for commercial AI training with minimal preprocessing. High confidence in completeness, provenance, and curation rigor.
80–89
Research-grade
Strong dataset with one or two documented limitations. Review the dimension breakdown to understand where to focus preprocessing effort.
70–79
Baseline
Usable for research and experimentation. Expect field gaps or source trust caveats. Rationale explains the specific tradeoffs.
< 70
Specialist use
Significant known limitations that affect general-purpose use. Included for specialist applications where the limitations are understood.

Transparency commitments

Questions about methodology? Contact [email protected]