How variant scoring works#

A genomic variant is a difference identified in an individual’s genome sequence when compared to the reference genome sequence. Many genomic variants likely have no appreciable impact, but it can be challenging to identify those that do have a particular molecular effect. AlphaGenome predictions can be used to score variants and help bridge this gap.

To do so, the variant is treated as a pair of sequences: reference (REF) and alternate (ALT). The variant effect is estimated by comparing AlphaGenome predictions for these two sequences across different modalities returned by the model.

Detailed steps#

Variant scoring is implemented as follows:

Make REF and ALT predictions for given modality#

Variant scoring begins by generating predictions for both the reference and alternative alleles of a variant, restricted to a given modality of interest (ex: RNA-SEQ, ATAC, etc.).

The model input at this stage are REF and ALT sequences, whose sequence interval contains the variant of interest.

Make `REF` and `ALT` predictions for given modality.

Optional - perform indel alignment#

For insertion or deletion (indel) variants, the ALT allele’s prediction profile is aligned to the REF allele’s coordinate space. Inserted bases are summarized by taking the maximum value over the inserted segment, while deleted bases are treated as having zero signal in the ALT context, thereby enabling consistent positional comparisons.

Apply spatial mask#

A spatial mask defines regions of interest within the interval containing the variant. This mask can be centered on the variant or encompass a gene (gene body, exons, or TSS, based on annotations from a GTF file).

At this stage, values outside of the mask are discarded.

Apply spatial mask.

Aggregate spatially and compute ALT - REF#

Aggregation occurs at this stage, which includes the following:

  • reduction along the spatial axis, using mean or sum, etc.

  • (optional) scaling, such as a \(log\) or \(l^2\) transform.

  • difference between ALT - REF.

The final outcome is a single scalar value per track.

Aggregate spatially and compute `ALT - REF`.

Note

Aggregation logic is encapsulated in the options listed in AggregationType.

The naming of the options reflects the order of operations of each of the above steps, with the right-most operation applied first to the model predictions.

For example, DIFF_SUM_LOG2, applies a log transform, then a sum, to track data. It then returns the difference between ALT - REF.

Some aggregation options may apply the exact same steps, but in a different order.

Regardless of the order of operations, each aggregation type returns one single scalar value per track.

Optional - aggregate tracks#

After variant scoring is completed, optional track selection and additional aggregation can be applied.

Suggestions include additional aggregation (mean, max, sum, etc.) over:

  • All tracks

  • Subsets of tracks

Or, a single track of interest can be chosen, i.e., from a particular sample.

Optional - aggregate tracks.

Available variant scorers#

For more on the types of variant scorers and how they work, visit the API documentation.