FAQ

FAQ#

Frequently asked questions.

Model inputs#

How do I make predictions for a specific genomic region?#

You can define any region in either the human or mouse genome, and use the API to predict various outputs. See the quick start colab for a demonstration.

How do I specify a genomic region?#

Using the genome.Interval class, which is initialized with a chromosome, a start, and an end position.

Note

AlphaGenome classes such as genome.Interval uses 0-based indexing, consistent with the underlying Python implementations.

This means an genome.Interval includes the base pair at the start position up to the base pair at the end-1 position.

For example, to specify the first base pair of chromosome 1, use genome.Interval('chr1', 0, 1). This interval has a width of 1, and contains only the base pair at the first position of chromosome 1.

To interpret interval overlaps, remember that 0-based indexing excludes the base pair at the end position itself, such that genome.Interval('chr1', 0, 1).overlaps(genome.Interval('chr1', 1, 2)) returns False.

What are the reference genome versions used by the model?#

We use human genome assembly hg38 (GRCh38.p13.genome.fa) and mouse assembly mm10 (GRCm38.p6.genome.fa). For other genome builds (such as hg19, for example), the LiftOver tool can be used to convert from hg38 coordinates to the desired assembly.

Can I make a prediction for any arbitrary DNA sequence?#

Yes, you can make predictions for any sequence, provided it is within the range of sequence lengths supported by the model. Note that model predictions have only been evaluated using sequences that vary by a relatively small amount from the reference genome (SNPs and indels), so very large differences from the human reference genome (for example, structural variants, sequences with a large amount of padding, synthetic sequences, or artificial DNA constructs) may result in predictions that are not as reliable.

Can I make predictions for DNA from other species?#

Yes, with the caveat that the model has only been trained on mouse and human DNA. Prediction quality is likely to degrade as evolutionary distance from these two species increases, but note that this has not been formally benchmarked.

What is the longest sequence the model can take as input?#

1MB (precisely 2^20 base-pairs long). Other sequence lengths are also supported: ~16KB, ~100KB, ~500KB.

How do I request predictions for a sequence with a length that is not in the list of supported lengths?#

You can use genome.Interval.resize to crop or expand your sequence length to the nearest supported length.

Note that .resize expands sequences using the actual surrounding genomic data, not by adding padding.

Model outputs#

How many tracks are there per output type and what do they represent?#

This varies from 5 to over 600. Each of the tracks refers to a particular cell-type or tissue, as well as other properties, such as strand or a specific transcription factor (for the CHIP_TF output type). See the output metadata documentation for a full list of the output types.

How do I find out what tissue or cell-type an output ‘track’ refers to?#

Using the navigating data ontologies notebook, you can look at the output metadata where biosample names and ontology CURIEs (IDs) for each track are described.

What is an ontology CURIE?#

CURIEs (Compact Uniform Resource Identifiers) are standardized, abbreviated codes (e.g., ‘UBERON:0001114’ for liver) that uniquely identify specific ontology terms.

Where are your ontology CURIEs sourced from?#

We source these from the IDs provided in the source training data. We also restricted the ontology types to UBERON, CL, CLO and EFO, following ENCODE practices. We recommend using EBI’s Ontology Lookup Service to understand relationships between the ontology IDs for different tracks.

What is strandedness?#

DNA is double-stranded, meaning that there are two nucleotide strands that form the double helix. By convention, one of those molecules is designated the forward, or positive strand (5’->3’), and the other is designated the reverse, or negative strand (3’->5’).

Genomic assays can either be unstranded or stranded (also called strand-specific).

Unstranded assays return results that do not distinguish whether a measurement came from the positive or negative strand. Certain assays do not generate stranded information – for example, ATAC-seq generates unstranded accessibility information.
Stranded (or strand-specific) assays annotate each measurement as coming from the positive or negative strand. This is important for transcriptional assays to distinguish between strand-specific transcripts (for example, two transcripts that share a transcriptional start site but are on different strands).

Note

Not all RNA-seq samples will be stranded, especially those that are from older experiments. For example, GTEx RNA-seq data is unstranded.

For more general information about the difference between non-stranded and stranded protocols and how to interpret them, there is a helpful tutorial here.

How is strandedness handled in model outputs?#

In the model output metadata, we use the following symbols to designate the strand of a track:

positive: +
negative: -
unstranded .

For assays that were performed in a stranded (or strand-specific) manner, the assay will have two tracks per cell or tissue type: one for the positive (+) and another for the negative (-) strand.

For unstranded assays, there will be a single track per cell or tissue type, annotated as unstranded (.).

We provide convenience operations for manipulating TrackData based on strand information, such as filter_to_negative_strand(), etc.

How can I save the model outputs?#

For variant effect predictions: We recommend converting the scores into a pandas DataFrame. This DataFrame can then be easily exported to a common file format, such as a CSV file, for use with other tools or for record-keeping. Specific instructions and examples for this process are provided in our ‘Variant Scoring UI’ tutorial.

For genome track predictions (e.g., RNA-seq levels): The predicted track data is provided as NumPy arrays within TrackData objects. These arrays can be directly saved to disk using standard NumPy functions, such as numpy.save (for saving a single array to a .npy file) or numpy.savez_compressed (for saving multiple arrays into a single compressed .npz file).

What are some of the limitations of the model?#

AlphaGenome has several key limitations:

Tissue-specificity and long-range interactions: While AlphaGenome shows improvements in these areas compared to previous models, accurately capturing tissue-specific effects and long-range genomic interactions remains challenging for deep learning models in genomics, requiring further research.
Species scope: The model is trained and evaluated on human and mouse DNA. Its performance on DNA from other species has not been determined.
Personal genomes: The model has not yet been benchmarked for predicting individual (personal) human genomes.
Molecular scope: AlphaGenome predicts the molecular consequences of genetic variations. Its direct applicability to complex trait analysis is limited, as these traits also involve broader biological processes (e.g., gene function, development, environmental factors) beyond the model’s primary focus.
Unphased training and single sequence input: The model processes a single DNA sequence at a time and is therefore not inherently ‘diploid-aware’. It was trained using unphased data, meaning it could not learn to distinguish between alleles inherited from the mother versus the father. Consequently, its variant effect predictions do not inherently model heterozygous states (i.e., the presence of both a reference and a variant allele at a site simultaneously).

Visualizing predictions#

How do I visualize the predicted output?#

You can use any tool to visualize the numerical output, but we provide a Python visualization library so you can easily visualize the output immediately. You can use our visualization basics guide and see examples of how to plot different modalities in our visualizing predictions tutorial.

Can I design my own visualizations to work with this library?#

Yes. The returned figures are based on matplotlib, so should be extendible. Additionally, you can choose to work with the raw output data and design your own visualizations.

Where are the plotted transcript annotations from?#

Transcript annotations are sourced from standard Gene Transfer Format (GTF) files from GENCODE: the hg38 reference assembly (release 46) for human and the mm10 reference assembly (release M23) for mouse.

Am I limited to only plotting protein-coding genes, and only the longest transcript?#

No. If you wish to include other gene types or all transcripts (not just the longest), you can remove the respective calls to gene_annotation.filter_protein_coding(gtf) and gene_annotation.filter_to_longest_transcript(gtf) in your code. Note that including more transcripts can make the plot appear busy; you can adjust the fig_height parameter of the TranscriptAnnotation plot component to improve legibility.

Variant scoring#

How do I define a variant?#

By creating a Variant object.

Note

As mentioned above, AlphaGenome classes such as Variant use 0-indexing, and Variant’s start() and end() contain 0-indexed values.

However, most variants in public databases, such as dbSNP, are provided as 1-indexed.

To enable compatibility with these annotations, the Variant object is initialized with a 1-indexed position attribute, which is then converted to 0-indexing internally. (i.e., start() returns position - 1).

See the Variant docstring for more details.

Are there tools to help me define variants, and run inference for them?#

See the scoring and visualizing a single variant notebook which walks through how to define a Variant object and perform inference. Batch inference over many variants can be performed using the batch variant scoring notebook which takes a variant call file (VCF) as input.

Can I pass any sequence to `reference_bases` or does it have to match the reference genome sequence at the variant location?#

You can pass any sequence to reference_bases. Note that predict_variant() is agnostic to the alleles in the reference genome, but rather uses the REF/ALT alleles specified by the user.

Are variant predictions for insertions and deletions (indels) supported?#

Yes. We use left-alignment to specify indels. See Variant for more details. For scoring indels, we adopt SpliceAI’s [Jaganathan et al., 2019] indel alignment strategy: inserted bases are summarized by taking the maximum value over the inserted segment, while deleted bases are treated as having zero signal in the ALT context, thereby enabling consistent positional comparisons.

Which variant scorer should I use for a given modality?#

In practice, you can use most variant scoring strategies for any modality. However, we provide a recommendation for the best strategies based on our evaluations in the variant scoring documentation.

Can I write my own variant scoring strategy?#

We do not currently support users writing their own variant scoring strategy. However, since variant scoring is simply aggregating REF and ALT track predictions, you can write your own methods for handling these values.

What is the difference between a ‘quantile_score’ and ‘raw_score’?#

The ‘raw_score’ is the output for a particular variant scoring strategy. However, different tracks and modalities yield scores that are on different scales. For instance, the Splice Sites Usage scorer returns values between 0 and 1, whereas the Gene Expression (RNA-seq) scorer returns negative or positive values without bounds. To facilitate comparisons across tracks and different variant scoring strategies, we use an empirical quantiles approach (see [Avsec et al., 2025] for full details). Briefly, we estimate a background distribution for each variant scorer and track using scores for common variants (MAF>0.01 in any GnomAD v3 population). We can then convert any ‘raw score’ into a ‘quantile score’, representing its rank within this background distribution. E.g. a variant with a quantile score of 0.99 has a score equivalent to the 99th percentile of common variants. This provides a measure of predicted impact that is standardized to the same scale across different variant scorers and tracks. The maximum (or minimum) value never exceeds 0.999990 (or -0.999990), due to the number of variants used to compute the quantiles (~300K). Because of this, we recommend using quantile scores as an indicator of whether the raw score is unusually large, and use the ‘raw scores’ as a measure of magnitude of the effect for a given scorer and track.

For signed variant scores (which indicate effect direction like up-regulation or down-regulation), their [0,1] quantile probabilities – derived directly from the rank order of the original signed raw scores – are linearly transformed to a [-1,1] range. This rescaling ensures the quantile score reflects the directionality of the raw score. For instance, the 0th percentile (representing the most negative raw scores) maps to -1, the 50th percentile (raw scores around zero) to 0, and the 100th percentile (most positive raw scores) to +1.

Note that quantile scores are only available for the suite of recommended scorers.

Other#

What terms of use apply to AlphaGenome outputs?#

The AlphaGenome API is provided for non-commercial use only and is subject to the AlphaGenome Terms of Service. Outputs generated by AlphaGenome should not be used for the training of other machine learning models.

How should I cite AlphaGenome?#

If you use AlphaGenome in your research, please cite using:

@article{alphagenome,
  title={{AlphaGenome}: advancing regulatory variant effect prediction with a unified {DNA} sequence model},
  author={Avsec, {\v Z}iga and Latysheva, Natasha and Cheng, Jun and Novati, Guido and Taylor, Kyle R. and Ward, Tom and Bycroft, Clare and Nicolaisen, Lauren and Arvaniti, Eirini and Pan, Joshua and Thomas, Raina and Dutordoir, Vincent and Perino, Matteo and De, Soham and Karollus, Alexander and Gayoso, Adam and Sargeant, Toby and Mottram, Anne and Wong, Lai Hong and Drot{\'a}r, Pavol and Kosiorek, Adam and Senior, Andrew and Tanburn, Richard and Applebaum, Taylor and Basu, Souradeep and Hassabis, Demis and Kohli, Pushmeet},
  year={2025},
  doi={https://doi.org/10.1101/2025.06.25.661532},
  publisher={Cold Spring Harbor Laboratory},
  journal={bioRxiv}
}

Who should I contact with issues, enquiries and feedback?#

Submit bugs and any code-related issues on GitHub. For general feedback, questions about usage, and/or feature requests, please use the community forum – it’s actively monitored by our team so you’re likely to find answers and insights faster. If you can’t find what you’re looking for, please get in touch with the AlphaGenome team at alphagenome@google.com and we will be happy to assist you with questions. We’re working hard to answer all inquiries but there may be a short delay in our response due to the high volume we are receiving.

FAQ

Contents

FAQ#

Model inputs#

How do I make predictions for a specific genomic region?#

How do I specify a genomic region?#

What are the reference genome versions used by the model?#

Can I make a prediction for any arbitrary DNA sequence?#

Can I make predictions for DNA from other species?#

What is the longest sequence the model can take as input?#

How do I request predictions for a sequence with a length that is not in the list of supported lengths?#

Model outputs#

How many tracks are there per output type and what do they represent?#

How do I find out what tissue or cell-type an output ‘track’ refers to?#

What is an ontology CURIE?#

Where are your ontology CURIEs sourced from?#

What is strandedness?#

How is strandedness handled in model outputs?#

How can I save the model outputs?#

What are some of the limitations of the model?#

Visualizing predictions#

How do I visualize the predicted output?#

Can I design my own visualizations to work with this library?#

Where are the plotted transcript annotations from?#

Am I limited to only plotting protein-coding genes, and only the longest transcript?#

Variant scoring#

How do I define a variant?#

Are there tools to help me define variants, and run inference for them?#

Can I pass any sequence to reference_bases or does it have to match the reference genome sequence at the variant location?#

Are variant predictions for insertions and deletions (indels) supported?#

Which variant scorer should I use for a given modality?#

Can I write my own variant scoring strategy?#

What is the difference between a ‘quantile_score’ and ‘raw_score’?#

Other#

What terms of use apply to AlphaGenome outputs?#

How should I cite AlphaGenome?#

Who should I contact with issues, enquiries and feedback?#

Can I pass any sequence to `reference_bases` or does it have to match the reference genome sequence at the variant location?#