FAQ#
Frequently asked questions.
Model inputs#
How do I make predictions for a specific genomic region?#
You can define any region in either the human or mouse genome, and use the API to predict various outputs. See the quick start colab for a demonstration.
How do I specify a genomic region?#
Using the genome.Interval
class,
which is initialized with a chromosome, a start, and an end position.
Note
AlphaGenome classes such as genome.Interval
uses 0-based indexing, consistent with the underlying Python implementations.
This means an
genome.Interval
includes the base
pair at the start
position up to the base pair at the end-1
position.
For example, to specify the first base pair of chromosome 1, use
genome.Interval('chr1', 0, 1)
. This interval has a width of 1, and contains
only the base pair at the first position of chromosome 1.
To interpret interval overlaps, remember that 0-based indexing excludes the base
pair at the end
position itself, such that
genome.Interval('chr1', 0, 1).overlaps(genome.Interval('chr1', 1, 2))
returns False
.
What are the reference genome versions used by the model?#
We use human genome assembly hg38 (GRCh38.p13.genome.fa) and mouse assembly mm10 (GRCm38.p6.genome.fa). For other genome builds (such as hg19, for example), the LiftOver tool can be used to convert from hg38 coordinates to the desired assembly.
Can I make a prediction for any arbitrary DNA sequence?#
Yes, you can make predictions for any sequence, provided it is within the range of sequence lengths supported by the model. Note that model predictions have only been evaluated using sequences that vary by a relatively small amount from the reference genome (SNPs and indels), so very large differences from the human reference genome (for example, structural variants, sequences with a large amount of padding, synthetic sequences, or artificial DNA constructs) may result in predictions that are not as reliable.
Can I make predictions for DNA from other species?#
Yes, with the caveat that the model has only been trained on mouse and human DNA. Prediction quality is likely to degrade as evolutionary distance from these two species increases, but note that this has not been formally benchmarked.
What is the longest sequence the model can take as input?#
1MB (precisely 2^20 base-pairs long). Other sequence lengths are also supported: ~2KB, ~16KB, ~100KB, ~500KB.
How do I request predictions for a sequence with a length that is not in the list of supported lengths?#
You can use
genome.Interval.resize
to crop
or expand your sequence length to the nearest supported length.
Note that .resize
expands sequences using the actual surrounding genomic data,
not by adding padding.
Model outputs#
How many tracks are there per output type and what do they represent?#
This varies from 5 to over 600. Each of the tracks refers to a particular
cell-type or tissue, as well as other properties, such as strand or a specific
transcription factor (for the CHIP_TF
output type). See the
output metadata documentation
for a full list of the output types.
How do I find out what tissue or cell-type an output ‘track’ refers to?#
Using the navigating data ontologies notebook, you can look at the output metadata where biosample names and ontology CURIEs (IDs) for each track are described.
What is an ontology CURIE?#
CURIEs (Compact Uniform Resource Identifiers) are standardized, abbreviated codes (e.g., ‘UBERON:0001114’ for liver) that uniquely identify specific ontology terms.
Where are your ontology CURIEs sourced from?#
We source these from the IDs provided in the source training data. We also restricted the ontology types to UBERON, CL, CLO and EFO, following ENCODE practices. We recommend using EBI’s Ontology Lookup Service to understand relationships between the ontology IDs for different tracks.
What is strandedness?#
DNA is double-stranded, meaning that there are two nucleotide strands that form the double helix. By convention, one of those molecules is designated the forward, or positive strand (5’->3’), and the other is designated the reverse, or negative strand (3’->5’).
Genomic assays can either be unstranded or stranded (also called strand-specific).
Unstranded assays return results that do not distinguish whether a measurement came from the positive or negative strand. Certain assays do not generate stranded information – for example, ATAC-seq generates unstranded accessibility information.
Stranded (or strand-specific) assays annotate each measurement as coming from the positive or negative strand. This is important for transcriptional assays to distinguish between strand-specific transcripts (for example, two transcripts that share a transcriptional start site but are on different strands).
Note
Not all RNA-seq samples will be stranded, especially those that are from older experiments. For example, GTEx RNA-seq data is unstranded.
For more general information about the difference between non-stranded and stranded protocols and how to interpret them, there is a helpful tutorial here.
How is strandedness handled in model outputs?#
In the model output metadata, we use the following symbols to designate the strand of a track:
positive:
+
negative:
-
unstranded
.
For assays that were performed in a stranded (or strand-specific) manner, the
assay will have two tracks per cell or tissue type: one for the positive (+
)
and another for the negative (-
) strand.
For unstranded assays, there will be a single track per cell or tissue type,
annotated as unstranded (.
).
We provide convenience operations for manipulating
TrackData
based on strand information,
such as
filter_to_negative_strand()
, etc.
How can I save the model outputs?#
For variant effect predictions: We recommend converting the scores into a pandas DataFrame. This DataFrame can then be easily exported to a common file format, such as a CSV file, for use with other tools or for record-keeping. Specific instructions and examples for this process are provided in our ‘Variant Scoring UI’ tutorial.
For genome track predictions (e.g., RNA-seq levels): The predicted track data
is provided as NumPy arrays within TrackData objects. These arrays can be
directly saved to disk using standard NumPy functions, such as numpy.save
(for
saving a single array to a .npy
file) or numpy.savez_compressed
(for saving
multiple arrays into a single compressed .npz
file).
What are some of the limitations of the model?#
AlphaGenome has several key limitations:
Tissue-specificity and long-range interactions: While AlphaGenome shows improvements in these areas compared to previous models, accurately capturing tissue-specific effects and long-range genomic interactions remains challenging for deep learning models in genomics, requiring further research.
Species scope: The model is trained and evaluated on human and mouse DNA. Its performance on DNA from other species has not been determined.
Personal genomes: The model has not yet been benchmarked for predicting individual (personal) human genomes.
Molecular scope: AlphaGenome predicts the molecular consequences of genetic variations. Its direct applicability to complex trait analysis is limited, as these traits also involve broader biological processes (e.g., gene function, development, environmental factors) beyond the model’s primary focus.
Unphased training and single sequence input: The model processes a single DNA sequence at a time and is therefore not inherently ‘diploid-aware’. It was trained using unphased data, meaning it could not learn to distinguish between alleles inherited from the mother versus the father. Consequently, its variant effect predictions do not inherently model heterozygous states (i.e., the presence of both a reference and a variant allele at a site simultaneously).
Visualizing predictions#
How do I visualize the predicted output?#
You can use any tool to visualize the numerical output, but we provide a Python visualization library so you can easily visualize the output immediately. You can use our visualization basics guide and see examples of how to plot different modalities in our visualizing predictions tutorial.
Can I design my own visualizations to work with this library?#
Yes. The returned figures are based on matplotlib, so should be extendible. Additionally, you can choose to work with the raw output data and design your own visualizations.
Where are the plotted transcript annotations from?#
Transcript annotations are sourced from standard Gene Transfer Format (GTF) files from GENCODE: the hg38 reference assembly (release 46) for human and the mm10 reference assembly (release M23) for mouse.
Am I limited to only plotting protein-coding genes, and only the longest transcript?#
No. If you wish to include other gene types or all transcripts (not just the
longest), you can remove the respective calls to
gene_annotation.filter_protein_coding(gtf)
and
gene_annotation.filter_to_longest_transcript(gtf)
in your code. Note that
including more transcripts can make the plot appear busy; you can adjust the
fig_height
parameter of the TranscriptAnnotation
plot component to improve
legibility.
Variant scoring#
How do I define a variant?#
By creating a Variant
object.
Note
As mentioned above, AlphaGenome classes such as
Variant
use 0-indexing, and Variant’s
start()
and
end()
contain 0-indexed values.
However, most variants in public databases, such as dbSNP, are provided as 1-indexed.
To enable compatibility with these annotations, the
Variant
object is initialized with a
1-indexed position
attribute, which is
then converted to 0-indexing internally. (i.e.,
start()
returns
position
- 1).
See the Variant
docstring for more details.
Are there tools to help me define variants, and run inference for them?#
See the
scoring and visualizing a single variant notebook
which walks through how to define a Variant
object and perform inference. Batch inference over many variants can be
performed using the
batch variant scoring notebook which takes a
variant call file (VCF) as input.
Can I pass any sequence to reference_bases
or does it have to match the reference genome sequence at the variant location?#
You can pass any sequence to
reference_bases
. Note that
predict_variant()
is agnostic to
the alleles in the reference genome, but rather uses the REF/ALT alleles
specified by the user.
Are variant predictions for insertions and deletions (indels) supported?#
Yes. We use left-alignment to specify indels. See
Variant
for more details. For scoring indels,
we adopt SpliceAI’s [Jaganathan et al., 2019] indel alignment strategy: inserted bases
are summarized by taking the maximum value over the inserted segment, while
deleted bases are treated as having zero signal in the ALT
context, thereby
enabling consistent positional comparisons.
Which variant scorer should I use for a given modality?#
In practice, you can use most variant scoring strategies for any modality. However, we provide a recommendation for the best strategies based on our evaluations in the variant scoring documentation.
Can I write my own variant scoring strategy?#
We do not currently support users writing their own variant scoring strategy. However, since variant scoring is simply aggregating REF and ALT track predictions, you can write your own methods for handling these values.
What is the difference between a ‘quantile_score’ and ‘raw_score’?#
The ‘raw_score’ is the output for a particular variant scoring strategy. However, different tracks and modalities yield scores that are on different scales. For instance, the Splice Sites Usage scorer returns values between 0 and 1, whereas the Gene Expression (RNA-seq) scorer returns negative or positive values without bounds. To facilitate comparisons across tracks and different variant scoring strategies, we use an empirical quantiles approach (see [Avsec et al., 2025] for full details). Briefly, we estimate a background distribution for each variant scorer and track using scores for common variants (MAF>0.01 in any GnomAD v3 population). We can then convert any ‘raw score’ into a ‘quantile score’, representing its rank within this background distribution. E.g. a variant with a quantile score of 0.99 has a score equivalent to the 99th percentile of common variants. This provides a measure of predicted impact that is standardized to the same scale across different variant scorers and tracks. The maximum (or minimum) value never exceeds 0.999990 (or -0.999990), due to the number of variants used to compute the quantiles (~300K). Because of this, we recommend using quantile scores as an indicator of whether the raw score is unusually large, and use the ‘raw scores’ as a measure of magnitude of the effect for a given scorer and track.
For signed variant scores (which indicate effect direction like up-regulation or down-regulation), their [0,1] quantile probabilities – derived directly from the rank order of the original signed raw scores – are linearly transformed to a [-1,1] range. This rescaling ensures the quantile score reflects the directionality of the raw score. For instance, the 0th percentile (representing the most negative raw scores) maps to -1, the 50th percentile (raw scores around zero) to 0, and the 100th percentile (most positive raw scores) to +1.
Note that quantile scores are only available for the suite of recommended scorers.
Other#
What terms of use apply to AlphaGenome outputs?#
The AlphaGenome API is provided for non-commercial use only and is subject to the AlphaGenome Terms of Service. Outputs generated by AlphaGenome should not be used for the training of other machine learning models.
How should I cite AlphaGenome?#
If you use AlphaGenome in your research, please cite using:
@misc{alphagenome,
title={AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model},
author={Avsec, {\v Z}iga and Latysheva, Natasha and Cheng, Jun and Novati, Guido and Taylor, Kyle R. and Ward, Tom and Bycroft, Clare and Nicolaisen, Lauren and Arvaniti, Eirini and Pan, Joshua and Thomas, Raina and Dutordoir, Vincent and Perino, Matteo and De, Soham and Karollus, Alexander and Gayoso, Adam and Sargeant, Toby and Mottram, Anne and Hong Wong, Lai and Drot\'ar, Pavol and Kosiorek, Adam and Senior, Andrew and Tanburn, Richard and Applebaum, Taylor and Basu, Souradeep and Hassabis, Demis and Kohli, Pushmeet},
url={https://storage.googleapis.com/deepmind-media/papers/alphagenome.pdf},
year={2025},
}
Who should I contact with issues, enquiries and feedback?#
Submit bugs and any code-related issues on GitHub. For general feedback, questions about usage, and/or feature requests, please use the community forum – it’s actively monitored by our team so you’re likely to find answers and insights faster. If you can’t find what you’re looking for, please get in touch with the AlphaGenome team at alphagenome@google.com and we will be happy to assist you with questions. We’re working hard to answer all inquiries but there may be a short delay in our response due to the high volume we are receiving.