Gene association analysis

Modified on Fri, 16 Feb at 4:47 PM

TABLE OF CONTENTS

Introduction
1 Creating a plot
2 Setting parameters
- 2.1 Design
- 2.2 Parameters
3 Performing the gene association analysis
4. Visualization

Introduction

The gene association analysis allows you to quantify the correlation of a single gene of interest with all other genes in scRNA-seq data. A strong correlation may indicate to a possible interaction or shared function that can be interesting to study further.

The gene association analysis module allows the computation of the correlation coefficient between a specific gene and all other genes, thus producing a ranked list of genes. The highest-ranked genes, with the largest correlation coefficient, are the genes that are most associated with the gene of interest.

As with all other analyses that produce a ranked list of genes (e.g. differential gene expression analysis, marker gene analysis) the output of the gene association analysis can be used as input for the gene set enrichment, rank-based and rule-based meta-analyses.

We use the expression "gene association analysis" to describe this analysis in the context of a concrete example. However, this module can be used for estimating associations between any type of omics measurements, including metabolomics, proteomics, etc.

Scenario: Neurological disorders, including Alzheimer's disease, are characterized by intricate molecular interactions that remain incompletely understood. Advances in single-cell RNA sequencing (scRNA-seq) technology offer a promising avenue for unraveling the complexities of genes implicated in these disorders. In this research scenario, our objective is to utilize the gene association module to pinpoint genes exhibiting expression patterns closely mirroring a well-established marker of Alzheimer's disease—the APOE gene. Through this approach, we seek to infer the potential involvement of these associated genes in the underlying disease processes.

1 Creating a plot

As a first step of the analysis, a plot must be created by clicking on the create plot icon in your analysis track.

This will lead to the create plot page. Firstly we should enter the plot name and filling in the plot template to provide the proper context for performing this analysis:

You can then choose the "Gene association" algorithm from the "Choose algorithm to run your analysis".

The next step is to choose the data to analyze. This module accpets multiple normalized scRNA-seq datasets from the data pretreatment step.

Using the "Select Cells" button, you can choose the observations to use as input. For more information see the section on Cell/sample selection.

It is strongly advised to use normalized and scaled data, as well as to ensuring that any batch effect has been removed from the data. Using unnormalized or unscaled data make impossible to compare association values across genes. Batch effects can create spurious associations, as well as hide real ones, either way hindering the validity of the analysis.

2 Setting parameters

Once all the input has been selected the "Set parameters" field will be displayed with the following tabs: Design and Parameters.

2.1 Design

In this tab you can select the following two options:

Select feature of interest: This is the gene we are interested to contrast against all other genes. The analysis will compute the correlation between your gene of interest and each other genes present in the dataset.
Partition: Dropdown menu allows you to set how the data will be split by selecting one of the categorical metadata variables in your dataset. For example, selecting a "Cell type" metadata variable will create a rank list of associated genes for each cell type in your dataset.

2.2 Parameters

In this tab you can select the following three options:

Batch: Selecting a categorical variable from this menu will remove all the variation arising from it. For example, you can select a "Batch" categorical metadata variable to remove the batch effect from your data prior to gene association calculation. Note that you can't select the same metadata variable in this dropdown menu and in the "Partition" dropdown menu.
Scaling: Scaling determines the way you want to scale your data. For more information about data scaling, see section on Useful concepts.
Correlation method:
- Pearson:The default correlation method. Classical correlation coefficient that detects linear association between two variables. It is sensitive to outliers.
- Spearman: Detects linear trends in a robust way by applying the Pearson correlation coefficient on ranks, rather than the raw values. Less prone to be influenced by outliers.

Univariate correlation can be computed with the classical formula of the Pearson's correlation coefficient:

Here "Gi" represents the expression level of the gene of interest in sample i, while "gi" is the expression of the other gene in sample i. Spearman's correlation uses exactly the same formula, however expression values are replaced with their respective ranks.

3 Performing the gene association analysis

When the parameters are all set-up, you can click on the "Run" button to compute the gene association analysis results. As soon as the results are computed, an interactive plot will appear in the track. Clicking on the "VIew interactively" will allow you to view the results of the gene association analysis in the interactive plot page.

4. Visualization

In the interactive plot page of the gene association analysis you can view the results in a correlation table. For each gene and partition group in your dataset a correlation score is calculated.

The columns in the table are:

Rank: the rank for the features on that row.
All other columns: the groups you chose to include in the gene association analysis.

These results can now be used as input in downstream analysis like the can be used as input for the gene set enrichment, rank-based and rule-based meta-analyses. By default the genes are ranked according to their score, from most positive to most negative when used as input for these analyses.

Notice that some genes do not have a correlation score (i.e. the corrensponding cell is blank). This happens in cases when the gene is too lowly expressed to calculate a correlation. These will not be included in the downstream analyses.