alphagenome.models.variant_scorers.tidy_scores

Contents

alphagenome.models.variant_scorers.tidy_scores#

alphagenome.models.variant_scorers.tidy_scores(scores, match_gene_strand=True, include_extended_metadata=True)[source]#

Formats scores into a tidy (long) pandas DataFrame.

This function reformats variant scores into a more readable DataFrame with one score per row. This function supports both scores generated from variant scorers (e.g., the output of score_variant or score_variants) or interval scorers (e.g., the output of score_interval or score_intervals).

It handles both variant/interval-centric scoring (producing one score row per variant/interval-track pair) and gene-centric scoring (one score row per variant/interval-gene-track combination).

The function accepts these score input types:

  • A sequence of AnnData objects (e.g., output of score_variant with one or more scorers).

  • A nested sequence of AnnData objects (e.g., output of score_variants with multiple variants).

Scores from multiple scorers or multiple variants/intervals are concatenated together into a single pandas DataFrame, containing the union of all applicable columns across scorers.

Parameters:
  • scores (Sequence[AnnData] | Sequence[Sequence[AnnData]]) – Scoring output as either a sequence of AnnData objects, or a nested sequence of AnnData objects (for example, the outputs of score_variant and score_variants, respectively).

  • match_gene_strand (bool (default: True)) – If True (and using gene-centric scoring), rows with mismatched gene and track strands are removed.

  • include_extended_metadata (bool (default: True)) – Argument passed to tidy_anndata to include additional metadata columns where available.

Returns:

  • variant_id (when applicable): Variant of interest (e.g. chr22:36201698:A>C).

  • scored_interval: Genomic interval scored (e.g. chr22:36100000-36300000).

  • gene_id: ENSEMBL gene identifier without version number. (e.g. ENSG00000100342), or None if not applicable.

  • gene_name: HGNC gene symbol (e.g. APOL1), or None if not applicable.

  • gene_type: Gene biotype (e.g. protein_coding, lncRNA), or None if not applicable.

  • gene_strand: Strand of the gene (‘+’, ‘-’, ‘.’), or None if not applicable.

  • output_type: Type of the output from the model (e.g. RNA_SEQ, DNASE).

  • interval_scorer (when applicable): Name of the interval scorer used.

  • variant_scorer (when applicable): Name of the variant scorer used.

  • track_name: Name of the output track (e.g. UBERON:0036149 total RNA-seq).

  • track_strand: Strand of the track (‘+’, ‘-’, or ‘.’).

  • ontology_curie: Ontology term for the cell type or tissue of the track (e.g. UBERON:0036149), or NaN if not applicable.

  • gtex_tissue: Name of the gtex tissue (e.g. Liver), or NaN if not applicable.

  • Assay title: Subtype of the assay (e.g., total RNA-seq), or NaN if not applicable.

  • biosample_name: Name of the biosample (e.g. liver), or NaN if not applicable.

  • biosample_type: Type of biosample (e.g. ‘tissue’ or ‘primary cell’), or NaN if not applicable.

  • transcription_factor: Name of the transcription factor (e.g. ‘CTCF’), or NaN if not applicable.

  • histone_mark: Name of the histological mark (e.g. ‘H3K4ME3’), or NaN if not applicable.

  • raw_score: Raw variant score.

  • quantile_score (when applicable): Quantile score.

Return type:

pd.DataFrame with columns

Raises:

ValueError – If the input is not a valid type (sequence of AnnData or nested sequence of AnnDatas).