alphagenome.models.variant_scorers.tidy_scores

alphagenome.models.variant_scorers.tidy_scores#

alphagenome.models.variant_scorers.tidy_scores(scores, match_gene_strand=True, include_extended_metadata=True)[source]#

Formats scores into a tidy (long) pandas DataFrame.

This function reformats variant scores into a more readable DataFrame with one score per row. This function supports both scores generated from variant scorers (e.g., the output of score_variant or score_variants) or interval scorers (e.g., the output of score_interval or score_intervals).

It handles both variant/interval-centric scoring (producing one score row per variant/interval-track pair) and gene-centric scoring (one score row per variant/interval-gene-track combination).

The function accepts these score input types:

A sequence of AnnData objects (e.g., output of score_variant with one or more scorers).
A nested sequence of AnnData objects (e.g., output of score_variants with multiple variants).

Scores from multiple scorers or multiple variants/intervals are concatenated together into a single pandas DataFrame, containing the union of all applicable columns across scorers.

Parameters:

scores (Sequence[AnnData] | Sequence[Sequence[AnnData]]) – Scoring output as either a sequence of AnnData objects, or a nested sequence of AnnData objects (for example, the outputs of score_variant and score_variants, respectively).
match_gene_strand (bool (default: True)) – If True (and using gene-centric scoring), rows with mismatched gene and track strands are removed.
include_extended_metadata (bool (default: True)) – Argument passed to tidy_anndata to include additional metadata columns where available.

Returns:

variant_id (when applicable): Variant of interest (e.g. chr22:36201698:A>C).
scored_interval: Genomic interval scored (e.g. chr22:36100000-36300000).
gene_id: ENSEMBL gene identifier without version number. (e.g. ENSG00000100342), or None if not applicable.
gene_name: HGNC gene symbol (e.g. APOL1), or None if not applicable.
gene_type: Gene biotype (e.g. protein_coding, lncRNA), or None if not applicable.
gene_strand: Strand of the gene (‘+’, ‘-’, ‘.’), or None if not applicable.
output_type: Type of the output from the model (e.g. RNA_SEQ, DNASE).
interval_scorer (when applicable): Name of the interval scorer used.
variant_scorer (when applicable): Name of the variant scorer used.
track_name: Name of the output track (e.g. UBERON:0036149 total RNA-seq).
track_strand: Strand of the track (‘+’, ‘-’, or ‘.’).
ontology_curie: Ontology term for the cell type or tissue of the track (e.g. UBERON:0036149), or NaN if not applicable.
gtex_tissue: Name of the gtex tissue (e.g. Liver), or NaN if not applicable.
Assay title: Subtype of the assay (e.g., total RNA-seq), or NaN if not applicable.
biosample_name: Name of the biosample (e.g. liver), or NaN if not applicable.
biosample_type: Type of biosample (e.g. ‘tissue’ or ‘primary cell’), or NaN if not applicable.
transcription_factor: Name of the transcription factor (e.g. ‘CTCF’), or NaN if not applicable.
histone_mark: Name of the histological mark (e.g. ‘H3K4ME3’), or NaN if not applicable.
raw_score: Raw variant score.
quantile_score (when applicable): Quantile score.

Return type:

pd.DataFrame with columns

Raises:

ValueError – If the input is not a valid type (sequence of AnnData or nested sequence of AnnDatas).

alphagenome.models.variant_scorers.tidy_scores

Contents

alphagenome.models.variant_scorers.tidy_scores#