Model output metadata#
AlphaGenome returns predictions for 11 different output types, covering a variety of modalities. Here we provide details about the human model outputs and associated metadata to help users make informed decisions about the parameters of their API requests (e.g., ontology term; output types).
For further details on dataset processing and precise definitions of each output type, including their respective units and normalization methods, please refer to the Methods section of the AlphaGenome paper.
Table 1: Descriptions of output types predicted by AlphaGenome.
OutputType name |
Description |
Units |
Resolution |
Unique biosamples |
Total tracks |
---|---|---|---|---|---|
RNA expression as measured by RNA-seq. Includes a mixture of PolyA+ RNA and Total RNA assays. Some tracks are also stranded. |
Normalized read signal |
1bp |
285 |
667 |
|
RNA expression at transcription start-sites as measured by Cap Analysis Gene Expression (CAGE) assay. |
Normalized read signal |
1bp |
264 |
546 |
|
RNA expression at transcription start-sites as measured by Precision Run-On sequencing and capping (PROCAP) assay. |
Normalized read signal |
1bp |
6 |
12 |
|
Chromatin accessibility as measured by DNase I hypersensitive sites sequencing (DNase-seq) assay. |
Normalized insertion signal |
1bp |
305 |
305 |
|
Chromatin accessibility as measured by the transposase-accessible chromatin (ATAC-seq) assay. |
Normalized insertion signal |
1bp |
167 |
167 |
|
Relative abundance of histone modification marks as measured by chromatin immunoprecipitation (ChIP-seq) for 24 different markers e.g. H3k27ac (see ENCODE documentation). |
Fold-change over control |
128bp |
219 |
1116 |
|
Relative abundance of DNA-bound transcription factors as measured by ChIP-seq targeting 43 different proteins (see ENCODE documentation). |
Fold-change over control |
128bp |
163 |
1617 |
|
Predicted location of donor or acceptor splice sites, for both the positive and negative strand, expressed as a probability (higher numbers indicate higher probability of the base being a splice site). |
Predicted probability |
1bp |
NA |
4 |
|
Splice junction spliced read counts, as measured by RNA-Seq. Predictions are for all possible pairings of at most 512 donors and 512 acceptors from each strand in the requested interval, where the position of donors and acceptors along the input sequence is given by predictions of splice site positions. |
Normalized junction signal |
1bp |
282 |
734 |
|
Fraction of transcripts using a splice site, as measured by RNA-seq. All reads that span a given splice site are considered, and we predict the fraction of these that use the site (donor or acceptor). |
Fraction |
1bp |
282 |
734 |
|
Relative frequency of physical contact between pairwise positions (symmetric), derived from chromatin contact maps (Micro-C and Hi-C assays). Values are coarse-grained and normalized by removing the off-diagonal power law decay (as also done in Zhou, J. 2022). |
Log-fold over genomic distance-based expectation |
2048bp |
12 |
28 |
Track metadata#
To access the metadata describing each track for human outputs use:
output_metadata = dna_model.output_metadata(
organism=dna_client.Organism.HOMO_SAPIENS
)
Each predicted output type (e.g., RNA_SEQ) contains metadata in a
DataFrame
: output_metadata.rna_seq
Each row of the DataFrame
corresponds to a ‘track’, and each
column contains key information for biological interpretation such as:
name
: Name of the track. Example:CL:0000047 polyA plus RNA-seq
.strand
Strand of the track, either positive (+
), negative (+
), or unstranded (.
).ontology_curie
: A string ID representing the ontology term corresponding to the biosample. Example:CL:0000100
.biosample_name
: Plain text description of the biosample. Example:motor neuron
.
For a full list of metadata columns available for each output type, please see the navigating data ontologies notebook, which demonstrates how to access and browse track metadata.
Note
For SPLICE_JUNCTION
outputs the strand information is a property of
a junction rather than a track, so the metadata for this output type will show
half as many rows as reported in the above table.
Additional track metadata#
Some output types contain additional columns. For example,
OutputMetadata.rna_seq
and
OutputMetadata.splice_sites
also contain a
gtex_tissue
column, which is populated for the tracks that make predictions
for the tissues sampled in the
GTEx project
Consortium [2020].
Note
For one tissue, ’Brain - Cerebellar hemisphere’, we used an alternative Uberon ID to that was provided in the GTEx documentation (‘UBERON:0002037’), to reflect Uberon’s ID for cerebellar hemisphere: ‘UBERON:0002245’.