Model output metadata

Model output metadata#

AlphaGenome returns predictions for 11 different output types, covering a variety of modalities. Here we provide details about the human model outputs and associated metadata to help users make informed decisions about the parameters of their API requests (e.g., ontology term; output types).

For further details on dataset processing and precise definitions of each output type, including their respective units and normalization methods, please refer to the Methods section of the AlphaGenome paper.

Table 1: Descriptions of output types predicted by AlphaGenome.

OutputType name

Description

Units

Resolution

Unique biosamples

Total tracks

RNA_SEQ

RNA expression as measured by RNA-seq. Includes a mixture of PolyA+ RNA and Total RNA assays. Some tracks are also stranded.

Normalized read signal

1bp

285

667

CAGE

RNA expression at transcription start-sites as measured by Cap Analysis Gene Expression (CAGE) assay.

Normalized read signal

1bp

264

546

PROCAP

RNA expression at transcription start-sites as measured by Precision Run-On sequencing and capping (PROCAP) assay.

Normalized read signal

1bp

6

12

DNASE

Chromatin accessibility as measured by DNase I hypersensitive sites sequencing (DNase-seq) assay.

Normalized insertion signal

1bp

305

305

ATAC

Chromatin accessibility as measured by the transposase-accessible chromatin (ATAC-seq) assay.

Normalized insertion signal

1bp

167

167

CHIP_HISTONE

Relative abundance of histone modification marks as measured by chromatin immunoprecipitation (ChIP-seq) for 24 different markers e.g. H3k27ac (see ENCODE documentation).

Fold-change over control

128bp

219

1116

CHIP_TF

Relative abundance of DNA-bound transcription factors as measured by ChIP-seq targeting 43 different proteins (see ENCODE documentation).

Fold-change over control

128bp

163

1617

SPLICE_SITES

Predicted location of donor or acceptor splice sites, for both the positive and negative strand, expressed as a probability (higher numbers indicate higher probability of the base being a splice site).

Predicted probability

1bp

NA

4

SPLICE_JUNCTIONS

Splice junction spliced read counts, as measured by RNA-Seq. Predictions are for all possible pairings of at most 512 donors and 512 acceptors from each strand in the requested interval, where the position of donors and acceptors along the input sequence is given by predictions of splice site positions.

Normalized junction signal

1bp

282

734

SPLICE_SITE_USAGE

Fraction of transcripts using a splice site, as measured by RNA-seq. All reads that span a given splice site are considered, and we predict the fraction of these that use the site (donor or acceptor).

Fraction

1bp

282

734

CONTACT_MAPS

Relative frequency of physical contact between pairwise positions (symmetric), derived from chromatin contact maps (Micro-C and Hi-C assays). Values are coarse-grained and normalized by removing the off-diagonal power law decay (as also done in Zhou, J. 2022).

Log-fold over genomic distance-based expectation

2048bp

12

28

Track metadata#

To access the metadata describing each track for human outputs use:

output_metadata = dna_model.output_metadata(
    organism=dna_client.Organism.HOMO_SAPIENS
)

Each predicted output type (e.g., RNA_SEQ) contains metadata in a DataFrame: output_metadata.rna_seq

Each row of the DataFrame corresponds to a ‘track’, and each column contains key information for biological interpretation such as:

  • name: Name of the track. Example: CL:0000047 polyA plus RNA-seq.

  • strand Strand of the track, either positive (+), negative (+), or unstranded (.).

  • ontology_curie: A string ID representing the ontology term corresponding to the biosample. Example: CL:0000100.

  • biosample_name: Plain text description of the biosample. Example: motor neuron.

For a full list of metadata columns available for each output type, please see the navigating data ontologies notebook, which demonstrates how to access and browse track metadata.

Note

For SPLICE_JUNCTION outputs the strand information is a property of a junction rather than a track, so the metadata for this output type will show half as many rows as reported in the above table.

Additional track metadata#

Some output types contain additional columns. For example, OutputMetadata.rna_seq and OutputMetadata.splice_sites also contain a gtex_tissue column, which is populated for the tracks that make predictions for the tissues sampled in the GTEx project Consortium [2020].

Note

For one tissue, ’Brain - Cerebellar hemisphere’, we used an alternative Uberon ID to that was provided in the GTEx documentation (‘UBERON:0002037’), to reflect Uberon’s ID for cerebellar hemisphere: ‘UBERON:0002245’.