alphagenome.data.transcript.Transcript#

class alphagenome.data.transcript.Transcript(exons, cds=None, start_codon=None, stop_codon=None, transcript_id=None, gene_id=None, protein_id=None, uniprot_id=None, info=<factory>)[source]#

Represents transcript object containing attributes from a GTF file.

A transcript is a region of DNA that encodes a single RNA molecule. The Transcript dataclass contains attributes that describe the structure and content of a transcript, namely:

exons#

A list of genome.Interval`s representing exons within transcript. Each `Transcript must contain exons.

cds#

An optional list of `genome.Interval`s representing coding sequences (CDS) within a transcript. CDS include start codon and exclude top codon.

start_codon#

An optional list of `genome.Interval`s representing a single start codon. Start codons can be split by introns, therefore might have more than one genomic interval. Some coding transcripts are missing start codons, e.g., ENST00000455638.6.

stop_codon#

An optional list of `genome.Interval`s representing a single stop codon. Stop codon can be split by introns, therefore might have more than one genomic interval. Some transcripts coding transcripts are missing stop codons, e.g., ENST00000574051.5.

transcript_id#

An optional string representing a transcript id.

gene_id#

An optional string representing a protein id.

protein_id#

An optional string representing a protein id which is encoded by the transcript.

uniprot_id#

An optional UniprotKB-AC id string.

info#

a dictionary of additional information on a transcript.

chromosome#

chromosome name on which the transcript is present. Must be the same for all genomic intervals within a transcript.

is_mitochondrial#

whether the transcript is on the mitochondria chromosome.

strand_int#

strand on which transcript is present as an int. -1 for negative strand +1 for positive strand

strand#

strand (positive or negative) on which transcript is present. Must be the same for all genomic intervals within a transcript.

is_negative_strand#

a boolean value indicating whether transcript is on negative strand.

is_positive_strand#

a boolean value indicating whether transcript is on positive strand.

transcript_interval#

a genomic interval of a transcript.

selenocysteines#

a list of intervals where selenocysteines are present within a transcript.

selenocysteine_pos_in_protein#

a list of 0-based positions of selenocysteines in protein encoded by the transcript.

is_coding#

a value indicating whether a Transcript contains coding sequences (CDS) or not.

cds_including_stop_codon#

a list of CDS and stop_codon intervals with overlapping intervals merged.

utr5#

A list of genomic intervals representing 5’ untranslated region. 5’ UTR doesn’t include start codon. There may be no 5’ UTR present in the transcript or UTRs can be split by introns.

utr3#

A list of genomic intervals representing 3’ untranslated region. 3’ UTR doesn’t include stop codon. There may be no 3’ UTR present in the transcript or UTRs can be split by introns.

splice_regions#

a list of splice regions within a transcript.

splice_donor_sites#

a list of splice donor sites. Commonly, the RNA sequence that is removed ends with AG at its 3′ end.

splice_acceptor_sites#

a list of splice acceptor sites. Commonly, the RNA sequence that is removed begins with the dinucleotide GU at its 5′ end.

splice_donors#

a list of splice donors. The first nucleotide of the intron (0-based).

splice_acceptors#

a list of splice acceptors. The last nucleotide of the intron (0-based).

Attributes#

Table

cds

cds_including_stop_codon

Obtains coding sequences including stop codon.

chromosome

Gets the chromosome name on which the transcript is present.

gene_id

introns

Get a list of intron intervals.

is_coding

is_mitochondrial

Gets whether the transcript is on the mitochondria chromosome.

is_negative_strand

is_positive_strand

protein_id

selenocysteine_pos_in_protein

selenocysteines

splice_acceptor_sites

splice_acceptors

splice_donor_sites

splice_donors

splice_regions

Obtains and returns splice regions of a transcript.

start_codon

stop_codon

strand

Gets the strand on which the transcript is present.

strand_int

Gets the strand as an integer.

transcript_id

transcript_interval

Gets a genomic interval of a transcript.

uniprot_id

utr3

utr5

exons

info

Transcript.cds: list[Interval] | None = None#
Transcript.cds_including_stop_codon#

Obtains coding sequences including stop codon.

By default gtf files exclude stop codons from CDS while gff include stop codons within coding sequences.

Transcript.chromosome#

Gets the chromosome name on which the transcript is present.

Returns:

The chromosome name.

Transcript.gene_id: str | None = None#
Transcript.introns#

Get a list of intron intervals.

Returns:

A list of genomic intervals representing introns, where a single intron junction is an interval spanning between two adjacent exons.

Transcript.is_coding#
Transcript.is_mitochondrial#

Gets whether the transcript is on the mitochondria chromosome.

Returns:

True if the transcript is on the mitochondria chromosome, False otherwise.

Transcript.is_negative_strand#
Transcript.is_positive_strand#
Transcript.protein_id: str | None = None#
Transcript.selenocysteine_pos_in_protein#
Transcript.selenocysteines#
Transcript.splice_acceptor_sites#
Transcript.splice_acceptors#
Transcript.splice_donor_sites#
Transcript.splice_donors#
Transcript.splice_regions#

Obtains and returns splice regions of a transcript.

splice region (SO:0001630) is “within 1-3 bases of the exon or 3-8 bases of the intron.

Transcript.start_codon: list[Interval] | None = None#
Transcript.stop_codon: list[Interval] | None = None#
Transcript.strand#

Gets the strand on which the transcript is present.

Returns:

The strand (positive or negative).

Transcript.strand_int#

Gets the strand as an integer.

Returns:

-1 for negative strand, +1 for positive strand, 0 for unknown.

Transcript.transcript_id: str | None = None#
Transcript.transcript_interval#

Gets a genomic interval of a transcript.

Returns:

A genomic interval of a transcript where transcript start is equal to the first exon start and transcript end is equal to the last exon end.

Transcript.uniprot_id: str | None = None#
Transcript.utr3#
Transcript.utr5#
Transcript.exons: list[Interval]#
Transcript.info: dict[str, Any]#

Methods#

Table

fix_truncation(transcript)

Fixes CDS start and stop positions to be within coding frame.

from_gtf_df(transcript_df[, ignore_info, ...])

Initialises Trancript object from a given transcript dataframe.

offset_in_cds(genome_position)

Return the offset within the set of CDS exons of genome_position.

classmethod Transcript.fix_truncation(transcript)[source]#

Fixes CDS start and stop positions to be within coding frame.

Parameters:

transcript (Transcript) – a transcript to fix.

Return type:

Transcript

Returns:

New transcript with set start/stop codons and fixed CDS if the total length of CDS is > 6. Returns a copy of original transcript otherwise.

classmethod Transcript.from_gtf_df(transcript_df, ignore_info=True, fix_truncation=False)[source]#

Initialises Trancript object from a given transcript dataframe.

Parameters:
  • transcript_df (DataFrame) – Dataframe representing a transcript. The dataframe must contain a single transcript.

  • ignore_info (bool (default: True)) – If True, other columns in transcript_df won’t be added to the info field, except transcript_type and selenocysteines.

  • fix_truncation (bool (default: False)) – Whether or not apply truncation fixation to CDS.

Return type:

Transcript

Returns:

Initialised Transcript object.

Raises:
  • ValueError – if the dataframe provided is invalid (no or more than one

  • transcript, transcript has inconsistent strand or chromosome,

  • transcript doesn't contain exons, CDS are not within exons, etc.)

Transcript.offset_in_cds(genome_position)[source]#

Return the offset within the set of CDS exons of genome_position.

Parameters:

genome_position (int) – A coordinate presumed to be on the same chromosome as this transcript.

Return type:

int | None

Returns:

The offset of genome_position from the start of the CDS, accounting for strand, or None, if genome_position does not overlap the CDS.