Visualizing predictions#

AlphaGenome can make predictions for various different modalities. We built a visualization library to aid interpretation of the outputs from the model. In this tutorial notebook you will:

  • Learn how to visualize different data modalities for a specific genomic interval (and any variants) from model output.

  • Understand how to use the primary functionality of the visualization library alphagenome.visualization.

We also recommend visiting “Visualization Basics” for an overview of the library and its primary functionality.

Tip

Open this tutorial in Google colab for interactive viewing.

# @title Install AlphaGenome

# @markdown Run this cell to install AlphaGenome.
from IPython.display import clear_output
! pip install alphagenome
clear_output()

Setup and imports#

from alphagenome.data import gene_annotation, genome, track_data, transcript
from alphagenome.models import dna_client
from alphagenome.visualization import plot_components
from google.colab import userdata
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Load model and auxiliary objects#

dna_model = dna_client.create(userdata.get('ALPHA_GENOME_API_KEY'))
# Load metadata objects for human.
output_metadata = dna_model.output_metadata(
    organism=dna_client.Organism.HOMO_SAPIENS
)
# Load gene annotations (from GENCODE).
gtf = pd.read_feather(
    'https://storage.googleapis.com/alphagenome/reference/gencode/'
    'hg38/gencode.v46.annotation.gtf.gz.feather'
)

# Filter to protein-coding genes and highly supported transcripts.
gtf_transcript = gene_annotation.filter_transcript_support_level(
    gene_annotation.filter_protein_coding(gtf), ['1']
)

# Extractor for identifying transcripts in a region.
transcript_extractor = transcript.TranscriptExtractor(gtf_transcript)

# Also define an extractor that fetches only the longest transcript per gene.
gtf_longest_transcript = gene_annotation.filter_to_longest_transcript(
    gtf_transcript
)
longest_transcript_extractor = transcript.TranscriptExtractor(
    gtf_longest_transcript
)

Gene expression#

Gene expression is primarily captured by the model outputs RNA_SEQ and CAGE. Here is an example of expression predictions for colon tissue in a genomic interval containing the gene APOL4 . Positions of the longest transcript per gene for genes in this interval are shown at the top:

# Define interval to make predictions for (used throughout this tutorial).
# Note that the interval width must be one of the supported sequence lengths.
interval = genome.Interval('chr22', 36_150_498, 36_252_898).resize(
    dna_client.SEQUENCE_LENGTH_1MB
)

# Define the tissues/cell-types to predict expression for.
ontology_terms = [
    'UBERON:0001159',  # Colon - Sigmoid.
    'UBERON:0001155',  # Colon - Transverse.
]

# Make predictions.
output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.CAGE,
    },
    ontology_terms=ontology_terms,
)

# Extract the longest transcripts per gene for this interval.
longest_transcripts = longest_transcript_extractor.extract(interval)

Build the plot by constructing multiple plotting panels or ‘components’ and feeding these into the plot_components.plot() function. Refer to visualization basics guide for a description of available plotting components.

# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.Tracks(
            tdata=output.rna_seq,
            ylabel_template='RNA_SEQ: {biosample_name} ({strand})\n{name}',
        ),
        plot_components.Tracks(
            tdata=output.cage,
            ylabel_template='CAGE: {biosample_name} ({strand})\n{name}',
        ),
    ],
    interval=interval,
    title='Predicted RNA Expression (RNA_SEQ, CAGE) for colon tissue',
)

Note that the positive and negative stranded tracks have clearly different signals. Let’s check which genes are transcribed from the positive strand:

[t.info['gene_name'] for t in longest_transcripts if t.strand == '+']
['APOL5', 'APOL1']

For both of the positive stranded RNA_SEQ tracks, we can see that RNA-seq levels are highest around APOL1. Similarly for CAGE, the peaks on the positive strand fall around the start of this gene. We do not see peaks around APOL5 as this gene is not expressed in colon tissue (GTEx).

Visualize the effect of a variant#

A previously identified variant in this region (chr22_36201698_A_C) affects both the expression and splicing of the APOL4 gene. Specifically, the alternative allele (C) is linked to reduced APOL4 expression.

To visualize what AlphaGenome predicts for this variant, we can:

  • Compute predictions for the REF and ALT sequences using dna_model.predict_variant

  • Remove positive-stranded tracks (as APOL4 is transcribed from the negative DNA strand)

  • Zoom in to the region around the gene APOL4

  • Highlight the location of the variant using plot_components.VariantAnnotation

  • Increase the relative height of the transcripts section to better view the gene structure as annotated by GENCODE

# Define the variant of interest.
# TODO(b/369592335): Upate to chr22_36201698_A_C once we have a better .from_str() method.
variant_string = 'chr22:36201698:A>C'
variant = genome.Variant.from_str(variant_string)
variant
Variant(chromosome='chr22', position=36201698, reference_bases='A', alternate_bases='C', name='')
# Make predictions for sequences containing the REF and ALT alleles.
output = dna_model.predict_variant(
    interval=interval,
    variant=variant,
    requested_outputs={
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.CAGE,
    },
    ontology_terms=ontology_terms,
)
# Zoom in on the region around APOL4.
apol4_interval = gene_annotation.get_gene_interval(gtf, gene_symbol='APOL4')

# Add 1KB on either side of the gene body.
apol4_interval.resize_inplace(apol4_interval.width + 1000)
# Define the colors for REF and ALT predictions.
ref_alt_colors = {'REF': 'dimgrey', 'ALT': 'red'}

# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        # RNA-seq tracks.
        plot_components.OverlaidTracks(
            tdata={
                'REF': output.reference.rna_seq.filter_to_nonpositive_strand(),
                'ALT': output.alternate.rna_seq.filter_to_nonpositive_strand(),
            },
            colors=ref_alt_colors,
            ylabel_template='{biosample_name} ({strand})\n{name}',
        ),
        # CAGE track.
        plot_components.OverlaidTracks(
            tdata={
                'REF': output.reference.cage.filter_to_nonpositive_strand(),
                'ALT': output.alternate.cage.filter_to_nonpositive_strand(),
            },
            colors=ref_alt_colors,
            ylabel_template='{biosample_name} ({strand})\n{name}',
        ),
    ],
    annotations=[plot_components.VariantAnnotation([variant])],
    interval=apol4_interval,
    title='Effect of variant on predicted RNA Expression in colon tissue',
)

You can see here that the predicted RNA-seq values for the alternative allele are much lower across the gene (ALT lines are lower than REF lines), which is also reflected in the bottom CAGE track. Additionally, you can see where the variant might be inducing an exon skipping event (in the exon where the ALT line is zero but the REF has a peak).

Plot custom annotation (e.g., polyadenylation sites)#

# Run predictions with an additional ontology to highlight differences across
# tissues.
ontology_terms = [
    'UBERON:0001159',  # Colon - Sigmoid.
    'UBERON:0002048',  # lung.
]

# Make predictions.
output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={
        dna_client.OutputType.RNA_SEQ,
    },
    ontology_terms=ontology_terms,
)

There are 3 annotated pA sites overlapping APOL4 transcripts, as defined by PolyADBv3 (Wang et al. 2018). We can define these locations as 1bp intervals in hg38 coordinates.

# Define pA sites in hg38 coordinates.
apol4_pAs = [genome.Interval('chr22', 36_189_128, 36_189_129, '-'),
             genome.Interval('chr22', 36_190_089, 36_190_090, '-'),
             genome.Interval('chr22', 36_190_144, 36_190_145, '-')]

# Define plotting interval as first/last pA plus some offset distance.

offset = 200
pA_interval = genome.Interval('chr22',
                              36_189_128 - offset,
                              36_190_145 + offset,
                              '-')

# Define intervals annotation.
pA_labels = ['pA_3', 'pA_2', 'pA_1']
# Build plot.
# NOTE: Depending on annotation distance and interval zoom some annotations may # appear as overlapping.

plot = plot_components.plot(
    [plot_components.TranscriptAnnotation(longest_transcripts),
        # RNA-seq tracks.
        plot_components.Tracks(
            tdata = output.rna_seq.filter_to_negative_strand(),
            ylabel_template = 'RNA_SEQ: {biosample_name} ({strand})\n{name}',
            shared_y_scale = True,
            )
    ],
    annotations=[
        plot_components.IntervalAnnotation(apol4_pAs,
                                           alpha = 1,
                                           labels = pA_labels,
                                           label_angle = 90)],
    interval = pA_interval,
    title='APOL4 polyadenylation sites annotation',

)

Chromatin accessibility#

Chromatin accessibility is primarily captured by the model outputs DNASE and ATAC. Here is an example of predictions for the same genomic interval containing the gene APOL4. There are a large variety of tissues and cell-types available, but let’s focus on plotting predictions for intestinal tract tissues:

# List of IDs corresponding to various intestinal tissues.
ontology_terms = [
    'UBERON:0000317',
    'UBERON:0001155',
    'UBERON:0001157',
    'UBERON:0001159',
    'UBERON:0004992',
    'UBERON:0008971',
]
# Make predictions.
output = dna_model.predict_interval(
    interval,
    requested_outputs={
        dna_client.OutputType.DNASE,
        dna_client.OutputType.ATAC,
    },
    ontology_terms=ontology_terms,
)

Chromatin accessibility can indicate regions of regulatory activity. Let’s use plot_components.IntervalAnnotation() to plot additional annotation on putative promoter regions from GENCODE human or mouse to see if they overlap regions predicted to be accessible:

# Two putative promoter intervals from Ensembl.
promoter_intervals = [
    genome.Interval(
        'chr22', 36_201_799, 36_202_681, name='Ensembl_promoter:ENSR00001367790'
    ),
    genome.Interval(
        'chr22', 36_204_705, 36_205_330, name='Ensembl_promoter:ENSR00001367792'
    ),
]
# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.Tracks(
            tdata=output.dnase,
            ylabel_template='DNASE: {biosample_name} ({strand})\n{name}',
        ),
        plot_components.Tracks(
            tdata=output.atac,
            ylabel_template='ATAC: {biosample_name} ({strand})\n{name}',
        ),
    ],
    # Plot an 8kb window around the variant.
    interval=variant.reference_interval.resize(8000),
    annotations=[
        plot_components.VariantAnnotation([variant]),
        plot_components.IntervalAnnotation(promoter_intervals),
    ],
    title='Predicted chromatin accessibility (DNASE, ATAC) for colon tissue',
)

Some observations from this plot:

  • Elevated DNASE and ATAC signals overlap both putative promoter regions highlighted in grey, but additional chromatin accessibility peaks suggest there may be other regulatory elements in this region.

  • DNASE and ATAC signals are similar but not identical, reflecting differences in assay protocol.

  • Accessibility is not especially high around the variant (orange line).

Splicing#

Splicing effects are primarily captured by the model outputs SPLICE_SITES, SPLICE_SITE_USAGE, and SPLICE_JUNCTIONS. Here is an example of splicing predictions in a genomic interval containing the gene APOL4.

Since RNA_SEQ data also captures splicing patterns, we can plot it together with the splicing outputs, allowing us to see how predictions from different modalities are related:

# List of IDs corresponding to various intestinal tissues.
ontology_terms = [
    'UBERON:0001157',
    'UBERON:0001159',
]
# Make predictions for splicing outputs and RNA_SEQ.
output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.SPLICE_SITES,
        dna_client.OutputType.SPLICE_SITE_USAGE,
        dna_client.OutputType.SPLICE_JUNCTIONS,
    },
    ontology_terms=ontology_terms,
)

Note that SPLICE_SITES are tissue-agnostic predictions, so the ontology_terms filter is not applied to it.

output.splice_sites.metadata
name strand
0 donor +
1 acceptor +
2 donor -
3 acceptor -
# Build plot.
# Since APOL4 is on the negative DNA strand, we use `filter_negative_strand` to
# consider only negative stranded splice predictions.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.Tracks(
            tdata=output.splice_sites.filter_to_negative_strand(),
            ylabel_template='SPLICE SITES: {name} ({strand})',
        ),
    ],
    interval=apol4_interval,
    title='Predicted splicing effects for colon tissue',
)

Note how the model predicts acceptor splice sites near the beginnings of exons and donor splice sites near the ends of exons, whereas splice site usage predictions have peaks on both sides of exons.

We can visualize the junction split read predictions as arcs using plot_components.Sashimi(). We can also visualize the effect of the variant by showing both RNA_SEQ and splicing predictions on the same plot:

# Make predictions for REF and ALT alleles.
output = dna_model.predict_variant(
    interval=interval,
    variant=variant,
    requested_outputs={
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.SPLICE_SITES,
        dna_client.OutputType.SPLICE_SITE_USAGE,
        dna_client.OutputType.SPLICE_JUNCTIONS,
    },
    ontology_terms=ontology_terms,
)
# Get all transcripts, not just the longest one per gene.
transcripts = transcript_extractor.extract(interval)

ref_output = output.reference
alt_output = output.alternate

# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(transcripts),
        plot_components.Sashimi(
            ref_output.splice_junctions
            .filter_to_strand('-')
            .filter_by_tissue('Colon_Transverse'),
            ylabel_template='Reference {biosample_name} ({strand})\n{name}',
        ),
        plot_components.Sashimi(
            alt_output.splice_junctions
            .filter_to_strand('-')
            .filter_by_tissue('Colon_Transverse'),
            ylabel_template='Alternate {biosample_name} ({strand})\n{name}',
        ),
        plot_components.OverlaidTracks(
            tdata={
                'REF': ref_output.rna_seq.filter_to_nonpositive_strand(),
                'ALT': alt_output.rna_seq.filter_to_nonpositive_strand(),
            },
            colors=ref_alt_colors,
            ylabel_template='RNA_SEQ: {biosample_name} ({strand})\n{name}',
        ),
        plot_components.OverlaidTracks(
            tdata={
                'REF': ref_output.splice_sites.filter_to_nonpositive_strand(),
                'ALT': alt_output.splice_sites.filter_to_nonpositive_strand(),
            },
            colors=ref_alt_colors,
            ylabel_template='SPLICE SITES: {name} ({strand})',
        ),
        plot_components.OverlaidTracks(
            tdata={
                'REF': (
                    ref_output.splice_site_usage.filter_to_nonpositive_strand()
                ),
                'ALT': (
                    alt_output.splice_site_usage.filter_to_nonpositive_strand()
                ),
            },
            colors=ref_alt_colors,
            ylabel_template=(
                'SPLICE SITE USAGE: {biosample_name} ({strand})\n{name}'
            ),
        ),
    ],
    interval=apol4_interval,
    annotations=[plot_components.VariantAnnotation([variant])],
    title='Predicted REF vs. ALT effects of variant in colon tissue',
)

ChIP-Histone#

Histone modification marks are captured by the CHIP_HISTONE output. Here is an example of predictions in the same interval and tissues for histone marks thought to indicate key regulatory effects (e.g., H3K4me3).

Some additional considerations for visualizing CHIP_HISTONE predictions:

  • ChIP-Histone (and ChIP-TF) predictions are returned at a coarser resolution (128bp), so we need to adjust the interval size plotted to be compatible (i.e., its width must be a multiple of 128). This is done automatically by the plotting library.

  • Not all biosamples have predictions for all histone marks, but four of the major ones (H3K4me3, H3K4me1, H3K27ac, H3K27me3 and H3K36me3) are available for 40% of biosamples. Let’s see which histone marks are covered by biosamples corresponding to colon tissue:

output_metadata.chip_histone[
    output_metadata.chip_histone['biosample_name'].str.contains('colon')
]
name strand Assay title ontology_curie biosample_name biosample_type biosample_life_stage histone_mark data_source endedness genetically_modified
769 UBERON:0000317 Histone ChIP-seq H3K27ac . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K27ac encode single False
770 UBERON:0000317 Histone ChIP-seq H3K27me3 . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K27me3 encode single False
771 UBERON:0000317 Histone ChIP-seq H3K36me3 . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K36me3 encode single False
772 UBERON:0000317 Histone ChIP-seq H3K4me1 . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K4me1 encode single False
773 UBERON:0000317 Histone ChIP-seq H3K4me3 . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K4me3 encode single False
774 UBERON:0000317 Histone ChIP-seq H3K9ac . Histone ChIP-seq UBERON:0000317 colonic mucosa tissue adult H3K9ac encode single False
835 UBERON:0001157 Histone ChIP-seq H3K27ac . Histone ChIP-seq UBERON:0001157 transverse colon tissue adult H3K27ac encode single False
836 UBERON:0001157 Histone ChIP-seq H3K27me3 . Histone ChIP-seq UBERON:0001157 transverse colon tissue adult H3K27me3 encode single False
837 UBERON:0001157 Histone ChIP-seq H3K36me3 . Histone ChIP-seq UBERON:0001157 transverse colon tissue adult H3K36me3 encode single False
838 UBERON:0001157 Histone ChIP-seq H3K4me1 . Histone ChIP-seq UBERON:0001157 transverse colon tissue adult H3K4me1 encode single False
839 UBERON:0001157 Histone ChIP-seq H3K4me3 . Histone ChIP-seq UBERON:0001157 transverse colon tissue adult H3K4me3 encode single False
840 UBERON:0001159 Histone ChIP-seq H3K27ac . Histone ChIP-seq UBERON:0001159 sigmoid colon tissue adult H3K27ac encode single False
841 UBERON:0001159 Histone ChIP-seq H3K27me3 . Histone ChIP-seq UBERON:0001159 sigmoid colon tissue adult H3K27me3 encode single False
842 UBERON:0001159 Histone ChIP-seq H3K36me3 . Histone ChIP-seq UBERON:0001159 sigmoid colon tissue adult H3K36me3 encode single False
843 UBERON:0001159 Histone ChIP-seq H3K4me1 . Histone ChIP-seq UBERON:0001159 sigmoid colon tissue adult H3K4me1 encode single False
844 UBERON:0001159 Histone ChIP-seq H3K4me3 . Histone ChIP-seq UBERON:0001159 sigmoid colon tissue adult H3K4me3 encode single False
1038 UBERON:0004992 Histone ChIP-seq H3K4me1 . Histone ChIP-seq UBERON:0004992 mucosa of descending colon tissue adult H3K4me1 encode single False
1039 UBERON:0004992 Histone ChIP-seq H3K4me3 . Histone ChIP-seq UBERON:0004992 mucosa of descending colon tissue adult H3K4me3 encode single False
1040 UBERON:0004992 Histone ChIP-seq H3K9me3 . Histone ChIP-seq UBERON:0004992 mucosa of descending colon tissue adult H3K9me3 encode single False
1095 UBERON:0012489 Histone ChIP-seq H3K27ac . Histone ChIP-seq UBERON:0012489 muscle layer of colon tissue adult H3K27ac encode single False
1096 UBERON:0012489 Histone ChIP-seq H3K36me3 . Histone ChIP-seq UBERON:0012489 muscle layer of colon tissue adult H3K36me3 encode single False
1097 UBERON:0012489 Histone ChIP-seq H3K4me1 . Histone ChIP-seq UBERON:0012489 muscle layer of colon tissue adult H3K4me1 encode single False
1098 UBERON:0012489 Histone ChIP-seq H3K4me3 . Histone ChIP-seq UBERON:0012489 muscle layer of colon tissue adult H3K4me3 encode single False
1099 UBERON:0012489 Histone ChIP-seq H3K9ac . Histone ChIP-seq UBERON:0012489 muscle layer of colon tissue adult H3K9ac encode single False
# List of IDs corresponding to various colon tissues in `CHIP_HISTONE` output.
ontology_terms = [
    'UBERON:0000317',
    'UBERON:0001155',
    'UBERON:0001157',
    'UBERON:0001159',
]

# Make predictions.
output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={dna_client.OutputType.CHIP_HISTONE},
    ontology_terms=ontology_terms,
)

We can extract the locations of transcription start sites as annotated by GENCODE using gene_annotation.extract_tss(), and plot these using plot_components.IntervalAnnotation():

gtf_tss = gene_annotation.extract_tss(gtf_longest_transcript)

tss_as_intervals = [
    genome.Interval(
        chromosome=row.Chromosome,
        start=row.Start,
        end=row.End + 1000,  # Add extra 1Kb so the TSSs are visible.
        name=row.gene_name,
    )
    for _, row in gtf_tss.iterrows()
]

We can also reorder tracks so they are grouped by histone mark, and also specify colors based on their histone mark.

reordered_chip_histone = output.chip_histone.select_tracks_by_index(
    output.chip_histone.metadata.sort_values('histone_mark').index
)

histone_to_color = {
    'H3K27AC': '#e41a1c',
    'H3K36ME3': '#ff7f00',
    'H3K4ME1': '#377eb8',
    'H3K4ME3': '#984ea3',
    'H3K9AC': '#4daf4a',
    'H3K27ME3': '#ffc0cb',
}

track_colors = (
    reordered_chip_histone.metadata['histone_mark']
    .map(lambda x: histone_to_color.get(x.upper(), '#000000'))
    .values
)
# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.Tracks(
            tdata=reordered_chip_histone,
            ylabel_template=(
                'CHIP HISTONE: {biosample_name} ({strand})\n{histone_mark}'
            ),
            filled=True,
            track_colors=track_colors,
        ),
    ],
    interval=interval,
    annotations=[
        plot_components.IntervalAnnotation(
            tss_as_intervals, alpha=0.5, colors='blue'
        )
    ],
    despine_keep_bottom=True,
    title='Predicted histone modification markers in colon tissue',
)

We can see that many of the predicted histone modification peaks overlap with transcription start sites, especially markers associated with promoters (H3K4me3, H3K27ac, and H3K9ac). Other markers such as H3K36me3 and H3K4me1 are more associated with gene bodies and enhancers, so these signals are less pronounced around the TSSs.

ChIP-TF#

Transcription factor (TF) binding patterns are captured by the model’s CHIP_TF output. We will visualize these in a similar way to CHIP_HISTONE predictions, with one difference: we need to modify the ontology terms we request in order to get good coverage of both biosamples of interest and multiple TFs. Many biosamples have predictions for two major TFs, namely CTCF and POLR2A, but some cell-lines (HepG2 and K562) have coverage over many more TFs (501 and 269 respectively).

Here, we will also demonstrate two helpful utilities: selecting specific tracks from a TrackData object and aggregating predictions across tracks in a TrackData object.

ontology_terms = [
    'UBERON:0001159',  # Sigmoid colon.
    'UBERON:0001157',  # Transverse colon.
    'EFO:0002067',  # K562.
    'EFO:0001187',  # HepG2.
]

output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={dna_client.OutputType.CHIP_TF},
    ontology_terms=ontology_terms,
)

Say we then only want to work with the K562 and HepG2 part of the output. We can accomplish this by calling filter_tracks:

ontology_terms = [
    'EFO:0002067',  # K562.
    'EFO:0001187',  # HepG2.
]

output_chip_tf = output.chip_tf.filter_tracks(
    (output.chip_tf.metadata['ontology_curie'].isin(ontology_terms)).values
)
len(output_chip_tf.metadata)
845

In these 845 tracks across the 2 cell types, the maximum predicted values vary considerably:

max_predictions = output_chip_tf.metadata[
    ['ontology_curie', 'biosample_name', 'transcription_factor']
].copy()

max_predictions.loc[:, 'max_prediction'] = output_chip_tf.values.max(axis=0)
max_predictions.sort_values('max_prediction', ascending=False).reset_index(
    drop=True
)
ontology_curie biosample_name transcription_factor max_prediction
0 EFO:0002067 K562 PKNOX1 20608.0
1 EFO:0002067 K562 POLR2G 20608.0
2 EFO:0001187 HepG2 RBFOX2 19328.0
3 EFO:0002067 K562 GABPB1 18304.0
4 EFO:0001187 HepG2 REST 15488.0
... ... ... ... ...
840 EFO:0001187 HepG2 TCF12 182.0
841 EFO:0001187 HepG2 GPBP1L1 173.0
842 EFO:0002067 K562 ZNF778 159.0
843 EFO:0002067 K562 XRCC4 149.0
844 EFO:0002067 K562 COPS2 149.0

845 rows × 4 columns

We can filter the tracks down to only those with a maximum value of at least 10000:

print(f'Number of tracks before filtering: {len(output_chip_tf.metadata)}')

output_filtered = output_chip_tf.filter_tracks(
    output_chip_tf.values.max(axis=0) > 8000
)
print(f'Number of tracks after filtering: {len(output_filtered.metadata)}')
Number of tracks before filtering: 845
Number of tracks after filtering: 14

This is a more manageable number of tracks to visualize. Let’s build up the plot as before:

# Build plot.
plot_components.plot(
    components=[
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.Tracks(
            tdata=output_filtered,
            ylabel_template=(
                'CHIP TF: {biosample_name} ({strand})\n{transcription_factor}'
            ),
            filled=True,
        ),
    ],
    interval=interval,
    title='Predicted TF-binding in K562 and HepG2 cell-lines.',
    despine_keep_bottom=True,
    annotations=[
        plot_components.IntervalAnnotation(
            tss_as_intervals, alpha=0.3, colors='blue'
        )
    ],
)
plt.show()

In this figure, we see that CTCF binding sites strongly coincide with those of the cohesin complex protein RAD21. Binding locations for the same protein (e.g., RBFOX2, REST) appear largely the same across these two cell types, with some change in magnitude. Peaks in predicted POLR2G binding, a component of RNA Polymerase II, typically occur near TSS sites.

We can use the same process to select tracks with high peaks anywhere within the APOL4 gene interval:

# Compute max predicted values per track in the APOL4 gene interval.
max_predictions = output.chip_tf.slice_by_interval(
    apol4_interval, match_resolution=True
).values.max(axis=0)

# Filter to the 10 tracks with the highest predictions.
output_filtered = output.chip_tf.filter_tracks(
    (max_predictions >= np.sort(max_predictions)[-10])
)

# Build plot.
plot_components.plot(
    [
        plot_components.TranscriptAnnotation(transcripts),
        plot_components.Tracks(
            tdata=output_filtered,
            ylabel_template=(
                'CHIP TF: {biosample_name} ({strand})\n{transcription_factor}'
            ),
            filled=True,
        ),
    ],
    interval=apol4_interval,
    annotations=[plot_components.IntervalAnnotation(promoter_intervals)],
    despine_keep_bottom=True,
    title='Predicted TF-binding in K562, HepG2, and sigmoid colon.',
)
plt.show()

We observe a strong predicted CTCF peak near the second-to-last exon (counting from the right), which is predicted across different tissues. We also see peaks for (phosphorylated) POL2 binding around one of the putative promoters.

We can average the signal for a transcription factor across tissues to simplify the plot. For example, to plot the mean CTCF signal across tissues, we first need to construct a new TrackData object:

mean_ctcf = output_filtered.values[
    :, output_filtered.metadata['transcription_factor'] == 'CTCF'
].mean(axis=1)

# Construct a new TrackData object from the mean values.
tdata_mean_ctcf = track_data.TrackData(
    values=mean_ctcf[:, None],
    metadata=pd.DataFrame(
        {'transcription_factor': ['CTCF'], 'name': ['mean'], 'strand': ['.']}
    ),
    interval=output_filtered.interval,
    resolution=output_filtered.resolution,
)

And then plot as usual:

plot_components.plot(
    [
        plot_components.TranscriptAnnotation(transcripts),
        plot_components.Tracks(
            tdata=tdata_mean_ctcf,
            ylabel_template='{name} {transcription_factor}',
            filled=True,
        ),
    ],
    interval=apol4_interval,
    annotations=[plot_components.IntervalAnnotation(promoter_intervals)],
    despine_keep_bottom=True,
    title='Predicted CTCF binding (mean across cell types)',
)
plt.show()

Contact maps#

Relative frequency of physical contacts between pairwise genetic positions are captured by the model’s CONTACT_MAPS output. This output type is provided at even coarser resolution (2048bp), so we can only consider intervals that span distances larger than this.

Let’s make the contact map predictions for a colon cell line:

ontology_terms = [
    'EFO:0002824',  # HCT116 colon carcinoma cell line.
]

output = dna_model.predict_interval(
    interval=interval,
    requested_outputs={dna_client.OutputType.CONTACT_MAPS},
    ontology_terms=ontology_terms,
)

And plot the predicted DNA-DNA contacts:

plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation(longest_transcripts),
        plot_components.ContactMaps(
            tdata=output.contact_maps,
            ylabel_template='{biosample_name}\n{name}',
            cmap='autumn_r',
            vmax=1.0,
        ),
    ],
    interval=interval,
    title='Predicted contact maps',
)
plt.show()

Three prominent interacting regions, potentially representing topologically-associated domains (TADs), are visible as blocks. This interaction signal appears to be stronger in the first plot.

Conclusion#

Congratulations! You have now completed the tour of the visualization library and learned how to visualize the core modalities predicted by AlphaGenome.