Scoring variants on their splicing effect

Scoring variants on their splicing effect#

This notebook demonstrates how to score genetic variants on their splicing effect using AlphaGenome. This is the approach used in the AlphaGenome paper to score ClinVar variants for missplicing, and is the recommended method to assess whether a variant is causing aberrant splicing.

We use the “merged splicing” approach, which combines three splicing-related variant scorers into a single score:

Splice sites: Quantifies changes in splice site class assignment probabilities (donor, acceptor) between ALT and REF alleles.
Splice site usage: Quantifies changes in the relative usage of splice sites between ALT and REF alleles.
Splice junctions: Quantifies changes in the predicted splice junction usage between ALT and REF alleles.

The merged splicing score is computed as (described in the paper):

alphagenome_splicing = max(splice_sites) + max(splice_site_usage) + max(splice_junctions) / 5.0

This weighting scheme gives full weight to changes in splice site identity and usage, but a smaller weight to junction-level changes because their magnitude is larger.

Tip

Open this tutorial in Google Colab for interactive viewing.

# @title Install AlphaGenome

# @markdown Run this cell to install AlphaGenome.
from IPython.display import clear_output
! pip install alphagenome
clear_output()

Imports#

from alphagenome import colab_utils
from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
import pandas as pd

Load the model#

dna_model = dna_client.create(colab_utils.get_api_key())

Define the splicing variant scorers#

AlphaGenome provides three variant scorers related to splicing:

GeneMaskSplicingScorer with OutputType.SPLICE_SITES: Measures changes in splice site classification probabilities (donor, acceptor, neither).
GeneMaskSplicingScorer with OutputType.SPLICE_SITE_USAGE: Measures changes in splice site usage levels.
SpliceJunctionScorer: Measures changes in splice junction predictions.

We can access these from the recommended scorers dictionary:

splicing_scorers = [
    variant_scorers.RECOMMENDED_VARIANT_SCORERS['SPLICE_SITES'],
    variant_scorers.RECOMMENDED_VARIANT_SCORERS['SPLICE_SITE_USAGE'],
    variant_scorers.RECOMMENDED_VARIANT_SCORERS['SPLICE_JUNCTIONS'],
]

for scorer in splicing_scorers:
  print(f'{scorer.name} (signed={scorer.is_signed})')

Score a single variant#

Let’s score a variant near a known splice site. We use a 1MB sequence length, as this provides the best predictions for gene-level scoring.

# Define a variant near a splice site in the BRCA2 gene.
variant = genome.Variant(
    chromosome='chr13',
    position=32316462,
    reference_bases='T',
    alternate_bases='G',
)

# Create a 1MB interval centered on the variant.
interval = variant.reference_interval.resize(dna_client.SEQUENCE_LENGTH_1MB)

# Score the variant using the splicing scorers.
scores = dna_model.score_variant(
    interval=interval,
    variant=variant,
    variant_scorers=splicing_scorers,
    organism=dna_client.Organism.HOMO_SAPIENS,
)

# View the tidy scores.
df_scores = variant_scorers.tidy_scores([scores])
df_scores

Compute the merged splicing score#

The merged splicing score combines the three individual splicing scores. For each scorer, we take the maximum absolute score across all genes and tracks, then combine them with the following weights:

Splice sites: weight = 1.0
Splice site usage: weight = 1.0
Splice junctions: weight = 0.2 (i.e. divided by 5)

This combination captures both the disruption of splice site identity and the downstream effect on junction usage.

def compute_merged_splicing_score(
    df: pd.DataFrame,
) -> pd.DataFrame:
  """Computes the merged splicing score from tidy variant scores.

  For each variant, takes the max raw_score across genes and tracks for each
  splicing scorer, then combines them as:
    merged_splicing = max(splice_sites) + max(splice_site_usage)
                      + max(splice_junctions) / 5

  Args:
    df: Tidy scores DataFrame from variant_scorers.tidy_scores().

  Returns:
    DataFrame with one row per variant and columns for each splicing scorer's
    max score plus the merged splicing score.
  """
  # Get the max score per variant per output type.
  df['variant_id'] = df['variant_id'].map(str)
  max_scores = (
      df.groupby(['variant_id', 'output_type'])['raw_score']
      .max()
      .reset_index()
      .pivot(index='variant_id', columns='output_type', values='raw_score')
      .fillna(0.0)
  )

  # Compute the merged splicing score with the appropriate weights.
  max_scores['alphagenome_splicing'] = (
      max_scores.get('SPLICE_SITES', 0.0)
      + max_scores.get('SPLICE_SITE_USAGE', 0.0)
      + max_scores.get('SPLICE_JUNCTIONS', 0.0) / 5.0
  )

  return max_scores.reset_index()


merged_scores = compute_merged_splicing_score(df_scores)
merged_scores

Score a batch of variants#

You can score multiple variants and compute the merged splicing score for each:

# Define a batch of variants to score.
variants = [
    genome.Variant(
        chromosome='chr13',
        position=32316462,
        reference_bases='T',
        alternate_bases='G',
    ),
    genome.Variant(
        chromosome='chr17',
        position=43092919,
        reference_bases='G',
        alternate_bases='A',
    ),
    genome.Variant(
        chromosome='chr7',
        position=117559593,
        reference_bases='T',
        alternate_bases='A',
    ),
]

# Score each variant and collect results.
all_scores = []
for variant in variants:
  interval = variant.reference_interval.resize(dna_client.SEQUENCE_LENGTH_1MB)
  scores = dna_model.score_variant(
      interval=interval,
      variant=variant,
      variant_scorers=splicing_scorers,
      organism=dna_client.Organism.HOMO_SAPIENS,
  )
  all_scores.append(scores)

# Tidy all scores into a single DataFrame.
df_all_scores = variant_scorers.tidy_scores(all_scores)

# Compute merged splicing scores for all variants.
merged_scores = compute_merged_splicing_score(df_all_scores)
merged_scores

Score interpretation#

The score is in theory unbounded from 0 to infinity. Because the splice junction scorer is computing log fold change. However, emperically the majority of variants are in the range [0, 6].

We are working on a recommended variant score cutoff. We see in practice variants with score > 1.0 generally have large effect.