Navigating data ontologies#
Tip
Open this tutorial in Google colab for interactive viewing.
# @title Install AlphaGenome
# @markdown Run this cell to install AlphaGenome.
from IPython.display import clear_output
! pip install alphagenome
clear_output()
Setup and imports#
from alphagenome.models import dna_client
from google.colab import data_table
import pandas as pd
from google.colab import userdata
data_table.enable_dataframe_formatter()
Interactively view output metadata#
First, we load the model.
dna_model = dna_client.create(userdata.get('ALPHA_GENOME_API_KEY'))
output_metadata = dna_model.output_metadata(
dna_client.Organism.HOMO_SAPIENS
).concatenate()
Click Filter
on the upper right hand side of the interactive dataframe and type a cell or tissue name like “brain” into the Search by all fields box
to find the ontology_curie
term corresponding to a tissue and output type of interest:
output_metadata
name | strand | Assay title | ontology_curie | biosample_name | biosample_type | biosample_life_stage | data_source | endedness | genetically_modified | output_type | gtex_tissue | histone_mark | transcription_factor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CL:0000084 ATAC-seq | . | ATAC-seq | CL:0000084 | T-cell | primary_cell | adult | encode | paired | False | OutputType.ATAC | NaN | NaN | NaN |
1 | CL:0000100 ATAC-seq | . | ATAC-seq | CL:0000100 | motor neuron | in_vitro_differentiated_cells | adult | encode | paired | False | OutputType.ATAC | NaN | NaN | NaN |
2 | CL:0000236 ATAC-seq | . | ATAC-seq | CL:0000236 | B cell | primary_cell | adult | encode | paired | False | OutputType.ATAC | NaN | NaN | NaN |
3 | CL:0000623 ATAC-seq | . | ATAC-seq | CL:0000623 | natural killer cell | primary_cell | adult | encode | paired | False | OutputType.ATAC | NaN | NaN | NaN |
4 | CL:0000624 ATAC-seq | . | ATAC-seq | CL:0000624 | CD4-positive, alpha-beta T cell | primary_cell | adult | encode | paired | False | OutputType.ATAC | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7 | ENCSR182QNJ | - | PRO-cap | EFO:0001099 | Caco-2 | cell_line | NaN | encode | NaN | False | OutputType.PROCAP | NaN | NaN | NaN |
8 | ENCSR740IPL | - | PRO-cap | EFO:0002067 | K562 | cell_line | NaN | encode | NaN | False | OutputType.PROCAP | NaN | NaN | NaN |
9 | ENCSR797DEF | - | PRO-cap | EFO:0002819 | Calu3 | cell_line | NaN | encode | NaN | False | OutputType.PROCAP | NaN | NaN | NaN |
10 | ENCSR801ECP | - | PRO-cap | CL:0002618 | endothelial cell of umbilical vein | primary_cell | NaN | encode | NaN | False | OutputType.PROCAP | NaN | NaN | NaN |
11 | ENCSR860TYZ | - | PRO-cap | EFO:0001200 | MCF 10A | cell_line | NaN | encode | NaN | False | OutputType.PROCAP | NaN | NaN | NaN |
5563 rows × 14 columns
How many tracks are there per output type?#
# Count human tracks
human_tracks = (
dna_model.output_metadata(dna_client.Organism.HOMO_SAPIENS)
.concatenate()
.groupby('output_type')
.size()
.rename('# Human tracks')
)
# Count mouse tracks
mouse_tracks = (
dna_model.output_metadata(dna_client.Organism.MUS_MUSCULUS)
.concatenate()
.groupby('output_type')
.size()
.rename('# Mouse tracks')
)
pd.concat([human_tracks, mouse_tracks], axis=1).astype(pd.Int64Dtype())
# Human tracks | # Mouse tracks | |
---|---|---|
output_type | ||
OutputType.ATAC | 167 | 18 |
OutputType.CAGE | 546 | 188 |
OutputType.DNASE | 305 | 67 |
OutputType.RNA_SEQ | 667 | 173 |
OutputType.CHIP_HISTONE | 1116 | 183 |
OutputType.CHIP_TF | 1617 | 127 |
OutputType.SPLICE_SITES | 4 | 4 |
OutputType.SPLICE_SITE_USAGE | 734 | 180 |
OutputType.SPLICE_JUNCTIONS | 367 | 90 |
OutputType.CONTACT_MAPS | 28 | 8 |
OutputType.PROCAP | 12 | <NA> |
Note that PROCAP
outputs are not available for mouse.