alphagenome.models.variant_scorers.tidy_scores#
- alphagenome.models.variant_scorers.tidy_scores(scores, match_gene_strand=True, include_extended_metadata=True)[source]#
Formats scores into a tidy (long) pandas DataFrame.
This function reformats variant scores into a more readable DataFrame with one score per row. This function supports both scores generated from variant scorers (e.g., the output of
score_variant
orscore_variants
) or interval scorers (e.g., the output ofscore_interval
orscore_intervals
).It handles both variant/interval-centric scoring (producing one score row per variant/interval-track pair) and gene-centric scoring (one score row per variant/interval-gene-track combination).
The function accepts these score input types:
A sequence of AnnData objects (e.g., output of
score_variant
with one or more scorers).A nested sequence of AnnData objects (e.g., output of
score_variants
with multiple variants).
Scores from multiple scorers or multiple variants/intervals are concatenated together into a single pandas DataFrame, containing the union of all applicable columns across scorers.
- Parameters:
scores (
Sequence
[AnnData
] |Sequence
[Sequence
[AnnData
]]) – Scoring output as either a sequence of AnnData objects, or a nested sequence of AnnData objects (for example, the outputs ofscore_variant
andscore_variants
, respectively).match_gene_strand (
bool
(default:True
)) – If True (and using gene-centric scoring), rows with mismatched gene and track strands are removed.include_extended_metadata (
bool
(default:True
)) – Argument passed totidy_anndata
to include additional metadata columns where available.
- Returns:
variant_id (when applicable): Variant of interest (e.g. chr22:36201698:A>C).
scored_interval: Genomic interval scored (e.g. chr22:36100000-36300000).
gene_id: ENSEMBL gene identifier without version number. (e.g. ENSG00000100342), or None if not applicable.
gene_name: HGNC gene symbol (e.g. APOL1), or None if not applicable.
gene_type: Gene biotype (e.g. protein_coding, lncRNA), or None if not applicable.
gene_strand: Strand of the gene (‘+’, ‘-’, ‘.’), or None if not applicable.
output_type: Type of the output from the model (e.g. RNA_SEQ, DNASE).
interval_scorer (when applicable): Name of the interval scorer used.
variant_scorer (when applicable): Name of the variant scorer used.
track_name: Name of the output track (e.g. UBERON:0036149 total RNA-seq).
track_strand: Strand of the track (‘+’, ‘-’, or ‘.’).
ontology_curie: Ontology term for the cell type or tissue of the track (e.g. UBERON:0036149), or NaN if not applicable.
gtex_tissue: Name of the gtex tissue (e.g. Liver), or NaN if not applicable.
Assay title: Subtype of the assay (e.g., total RNA-seq), or NaN if not applicable.
biosample_name: Name of the biosample (e.g. liver), or NaN if not applicable.
biosample_type: Type of biosample (e.g. ‘tissue’ or ‘primary cell’), or NaN if not applicable.
transcription_factor: Name of the transcription factor (e.g. ‘CTCF’), or NaN if not applicable.
histone_mark: Name of the histological mark (e.g. ‘H3K4ME3’), or NaN if not applicable.
raw_score: Raw variant score.
quantile_score (when applicable): Quantile score.
- Return type:
pd.DataFrame with columns
- Raises:
ValueError – If the input is not a valid type (sequence of AnnData or nested sequence of AnnDatas).