TABLE OF CONTENTS
Quantify the association between a gene of interest and all other genes
During the analysis of (single cell) gene expression data we may be interested in quantifying correlations between genes. A strong correlation may indicate the presence of a regulatory mechanism between genes, or that the two genes respond similarly to the same stimulus. Either way, a large positive (or negative) association between genes may point to a possible interaction that can be interesting to study further.
The network analysis module allows to compute the association between a specific gene (gene of interest or GOI), and all other genes. Associations can be quantified across all samples at once, or separately for different subgroups of samples. In the latter case, association values across subgroups can be combined with meta-analysis for obtaining a single association value for each gene. Furthermore, associations can either be univariate, meaning that the correlation between the GOI and each gene is estimated without considering other genes, or partial, which indicate to what amount of association with the GOI cannot be explained by the mediation of the other genes.
We use the expression "gene association analysis" to describe this analysis in the context of a concrete example. However, this module can be used for estimating associations between any type of omics measurements, including metabolomics, proteomics, etc.
1 Creating a Plot
The first step of the analysis is to create a plot by clicking on the create plot icon. This will lead to a section where the algorithm of interest can be selected, in this case Network analysis.
To ensure the plots are efficiently organized, a name and description must be assigned to the analysis under the appropriate fields. Under the "Choose algorithm to run your analysis", "Network analysis" must be selected.
2 Selecting data
In the field "Choose track element", input can be Normal or Engineered matrix. For more information about the data to use as input, see section on Useful concepts.
Using the "Select Cells" button, you can choose the observations to use as input. For more information see the section on Cell/sample selection.
It is strongly advised to use normalized and scaled data, as well as to ensuring that any batch effect has been removed from the data.
- Using unnormalized or unscaled data make impossible to compare association values across genes
- Batch effects can create spurious associations, as well as hide real ones, either way hindering the validity of the analysis
3 Setting parameters
Once all input tracks have been selected the Set parameters field will be displayed with the following tabs: design, feature of interest and association type.
3.1 Setting parameters - Design
Here you can define whether the network analysis should be performed separately for subgroups of samples, or on all the samples all together. For working with all the samples, leave "Design" empty. Otherwise, choose a categorical column in your metadata. Samples will be partitioned according to the groups defined by the column you chose.
3.2 Setting parameters - Feature of interest
Feature of interest: this is the gene we are interested to contrast against all other genes. The analysis will compute the correlation between your gene of interest and each other genes present in the dataset.
3.2 Setting parameters - Association type
The user can decide whether to use a univariate approach or to perform a partial association analysis. See below for details
3.2.1 Univariate association
Univariate correlation can be computed with the classical formula of the Pearson's correlation coefficient:
Here "Gi" represents the expression level of the gene of interest in sample i, while "gi" is the expression of the other gene in sample i.
Spearman's correlation uses exactly the same formula, however expression values are replaced with their respective ranks.
22.214.171.124 Correlation type - Pearson vs Spearman
In case the user selects to perform a univariate analysis, it is also possible to specify whether Pearson's or Spearman's formula should be used.
3.2.2 Partial association
Parial association values are compute by regressing the GOI against all other genes using a ridge regression approach. In this type of regression we try to find the coefficients that optimize the following objective:
The first part of the expression tries to identify the beta coefficients that best approximate the GOI expression values using the other genes as independent predictors. The second part of the expression:
it's a penalization term that ensures that the coefficient do not assume arbitrarily large numbers, thus avoiding overfitting and allowing the model to be identified even when the number of genes is much larger than the number of samples. The parameter lambda is automatically identified using a cross-validation approach. We use the R package glmnet in our implementation, more details on this package can be found here.
4 Performing the gene association analysis
When the parameters are all set-up, you can click on the Run button to complete the analysis.
As soon as the analysis is over, a new table will appear in your track. You can click on the "View interactively" button to explore the results of the network analysis in the interactive plot page.
5. Network analysis interactive plot page
You visualize the network anylsis results as ranks table, a correlations table or as a network.
5.1 Ranks table
The table is interactive and sortable. The columns in the table are:
- Ranks: the rank of the genes
- All other columns: ranked listed of genes in each group based on association coefficients
5.2 Correlations table
The columns in the table are:
- Genes: the name of the genes
- All other columns: the association coefficients estimated during the analysis, one column per group.