alphagenome.data.transcript.Transcript

alphagenome.data.transcript.Transcript#

class alphagenome.data.transcript.Transcript(exons, cds=None, start_codon=None, stop_codon=None, transcript_id=None, gene_id=None, protein_id=None, uniprot_id=None, info=<factory>)[source]#

Represents transcript object containing attributes from a GTF file.

A transcript is a region of DNA that encodes a single RNA molecule. The Transcript dataclass contains attributes that describe the structure and content of a transcript, namely:

exons#: A list of genome.Interval`s representing exons within transcript. Each `Transcript must contain exons.

cds#: An optional list of `genome.Interval`s representing coding sequences (CDS) within a transcript. CDS include start codon and exclude top codon.

start_codon#: An optional list of `genome.Interval`s representing a single start codon. Start codons can be split by introns, therefore might have more than one genomic interval. Some coding transcripts are missing start codons, e.g., ENST00000455638.6.

stop_codon#: An optional list of `genome.Interval`s representing a single stop codon. Stop codon can be split by introns, therefore might have more than one genomic interval. Some transcripts coding transcripts are missing stop codons, e.g., ENST00000574051.5.

transcript_id#: An optional string representing a transcript id.

gene_id#: An optional string representing a protein id.

protein_id#: An optional string representing a protein id which is encoded by the transcript.

uniprot_id#: An optional UniprotKB-AC id string.

info#: a dictionary of additional information on a transcript.

chromosome#: chromosome name on which the transcript is present. Must be the same for all genomic intervals within a transcript.

is_mitochondrial#: whether the transcript is on the mitochondria chromosome.

strand_int#: strand on which transcript is present as an int. -1 for negative strand +1 for positive strand

strand#: strand (positive or negative) on which transcript is present. Must be the same for all genomic intervals within a transcript.

is_negative_strand#: a boolean value indicating whether transcript is on negative strand.

is_positive_strand#: a boolean value indicating whether transcript is on positive strand.

transcript_interval#: a genomic interval of a transcript.

selenocysteines#: a list of intervals where selenocysteines are present within a transcript.

selenocysteine_pos_in_protein#: a list of 0-based positions of selenocysteines in protein encoded by the transcript.

is_coding#: a value indicating whether a Transcript contains coding sequences (CDS) or not.

cds_including_stop_codon#: a list of CDS and stop_codon intervals with overlapping intervals merged.

utr5#: A list of genomic intervals representing 5’ untranslated region. 5’ UTR doesn’t include start codon. There may be no 5’ UTR present in the transcript or UTRs can be split by introns.

utr3#: A list of genomic intervals representing 3’ untranslated region. 3’ UTR doesn’t include stop codon. There may be no 3’ UTR present in the transcript or UTRs can be split by introns.

splice_regions#: a list of splice regions within a transcript.

splice_donor_sites#: a list of splice donor sites. Commonly, the RNA sequence that is removed ends with AG at its 3′ end.

splice_acceptor_sites#: a list of splice acceptor sites. Commonly, the RNA sequence that is removed begins with the dinucleotide GU at its 5′ end.

splice_donors#: a list of splice donors. The first nucleotide of the intron (0-based).

splice_acceptors#: a list of splice acceptors. The last nucleotide of the intron (0-based).

Attributes#

Table

`cds`
`cds_including_stop_codon`	Obtains coding sequences including stop codon.
`chromosome`	Gets the chromosome name on which the transcript is present.
`gene_id`
`introns`	Get a list of intron intervals.
`is_coding`
`is_mitochondrial`	Gets whether the transcript is on the mitochondria chromosome.
`is_negative_strand`
`is_positive_strand`
`protein_id`
`selenocysteine_pos_in_protein`
`selenocysteines`
`splice_acceptor_sites`
`splice_acceptors`
`splice_donor_sites`
`splice_donors`
`splice_regions`	Obtains and returns splice regions of a transcript.
`start_codon`
`stop_codon`
`strand`	Gets the strand on which the transcript is present.
`strand_int`	Gets the strand as an integer.
`transcript_id`
`transcript_interval`	Gets a genomic interval of a transcript.
`uniprot_id`
`utr3`
`utr5`
`exons`
`info`

Transcript.cds: list[Interval] | None = None#

Transcript.cds_including_stop_codon#

Obtains coding sequences including stop codon.

By default gtf files exclude stop codons from CDS while gff include stop codons within coding sequences.

Transcript.chromosome#

Gets the chromosome name on which the transcript is present.

Returns:: The chromosome name.

Transcript.gene_id: str | None = None#

Transcript.introns#

Get a list of intron intervals.

Returns:: A list of genomic intervals representing introns, where a single intron junction is an interval spanning between two adjacent exons.

Transcript.is_coding#

Transcript.is_mitochondrial#

Gets whether the transcript is on the mitochondria chromosome.

Returns:: True if the transcript is on the mitochondria chromosome, False otherwise.

Transcript.is_negative_strand#

Transcript.is_positive_strand#

Transcript.protein_id: str | None = None#

Transcript.selenocysteine_pos_in_protein#

Transcript.selenocysteines#

Transcript.splice_acceptor_sites#

Transcript.splice_acceptors#

Transcript.splice_donor_sites#

Transcript.splice_donors#

Transcript.splice_regions#

Obtains and returns splice regions of a transcript.

splice region (SO:0001630) is “within 1-3 bases of the exon or 3-8 bases of the intron.

Transcript.start_codon: list[Interval] | None = None#

Transcript.stop_codon: list[Interval] | None = None#

Transcript.strand#

Gets the strand on which the transcript is present.

Returns:: The strand (positive or negative).

Transcript.strand_int#

Gets the strand as an integer.

Returns:: -1 for negative strand, +1 for positive strand, 0 for unknown.

Transcript.transcript_id: str | None = None#

Transcript.transcript_interval#

Gets a genomic interval of a transcript.

Returns:: A genomic interval of a transcript where transcript start is equal to the first exon start and transcript end is equal to the last exon end.

Transcript.uniprot_id: str | None = None#

Transcript.utr3#

Transcript.utr5#

Transcript.exons: list[Interval]#

Transcript.info: dict[str, Any]#

Methods#

Table

`fix_truncation`(transcript)	Fixes CDS start and stop positions to be within coding frame.
`from_gtf_df`(transcript_df[, ignore_info, ...])	Initialises Trancript object from a given transcript dataframe.
`offset_in_cds`(genome_position)	Return the offset within the set of CDS exons of `genome_position`.

classmethod Transcript.fix_truncation(transcript)[source]#

Fixes CDS start and stop positions to be within coding frame.

Parameters:: transcript (Transcript) – a transcript to fix.
Return type:: Transcript
Returns:: New transcript with set start/stop codons and fixed CDS if the total length of CDS is > 6. Returns a copy of original transcript otherwise.

classmethod Transcript.from_gtf_df(transcript_df, ignore_info=True, fix_truncation=False)[source]#

Initialises Trancript object from a given transcript dataframe.

Parameters:

transcript_df (DataFrame) – Dataframe representing a transcript. The dataframe must contain a single transcript.
ignore_info (bool (default: True)) – If True, other columns in transcript_df won’t be added to the info field, except transcript_type and selenocysteines.
fix_truncation (bool (default: False)) – Whether or not apply truncation fixation to CDS.

Return type:

Transcript

Returns:

Initialised Transcript object.

Raises:

ValueError – if the dataframe provided is invalid (no or more than one
transcript, transcript has inconsistent strand or chromosome, –
transcript doesn't contain exons, CDS are not within exons, etc.) –

Transcript.offset_in_cds(genome_position)[source]#

Return the offset within the set of CDS exons of genome_position.

Parameters:: genome_position (int) – A coordinate presumed to be on the same chromosome as this transcript.
Return type:: int | None
Returns:: The offset of genome_position from the start of the CDS, accounting for strand, or None, if genome_position does not overlap the CDS.

alphagenome.data.transcript.Transcript

Contents

alphagenome.data.transcript.Transcript#

Attributes#

Methods#