Gene set enrichment analysis

Modified on Mon, 26 Feb 2024 at 04:08 PM

TABLE OF CONTENTS

Introduction



Analyzing large datasets on a feature-by-feature basis makes it difficult to discern global patterns. For example, metabolism can be divided in well-characterized pathways: a powerful approach to study those pathways is to determine whether transcripts, proteins or metabolites in a metabolic pathway are deregulated as a group. Such insight can be derived using the enrichment analysis. The analysis is not restricted to pathways: it can be applied to any possible defined set.

The UniApp uses GSEA (Gene Set Enrichment Analysis) to perform the competitive set enrichment analysis. GSEA is a computational method that determines whether an a priori defined set of features shows statistically significant differences between two biological states (e.g. phenotypes). This type of enrichment analysis is called competitive because when each set is analysed, all the features in the data are considered, not only the features present in that particular set.

The gene set enrichment analysis can be performed using any sort of ranked gene list as input. In the UniApp ranked list of genes can be created by using the differential gene expression analysis, marker genes, gene expression entropy, network analysis and rank-based meta-analysis. 


1. Algorithm settings

1.1 Creating a plot




As a first step of the analysis, a plot must be created by clicking on the create plot icon. This will lead you to a section where the analysis of interest (in this case gene set enrichment) can be selected.


The analysis can be effectively organised by assigning a name and description in the respective fields. Subsequently under the "choose algorithm to run your analysis", gene set enrichment must be selected.


1.2 Selecting data

Next, the input analysis can be selected into the track element under "choose track element". Since the input is a ranked list of genes the is observation selection.


1.3 Setting parameters



The Set parameters field will change depending on your selected input.



When using differential gene expression analysisgene expression entropygene set association analysis and rank-based meta-analysis as input the there will be two tabs available Gene set, where you can select gene set or sets to use in the set enrichment analysis and Parameters where you can set parameters for the set enrichment analysis.

Additionaly, when using inputs that have multiple ranked-gene lists in arranged in a table like from the cluster marker genes or network analysis you can select which column to include in the Select columns to include tab. 


1.3.1 Gene set


In the gene set tab you can select which gene sets to usein your set enrichment analysis. Currently the UniApp supports gene sets hosted on MsigDB and KEGG that  are annotated by domain experts. Additionally you can also provide a your own, custom gene set. 

  • Gene set library: select gene set or sets to use in the set enrichment analysis. Currently you can choose between sets from MsigDB, KEGG or ask a Unicle member for the possibility to upload a custom set. 
  • Gene set: select a subset of from the previously selected set. 

1.3.2 Parameters


In the parameters tab you can specify parameters for your set enrichment analysis. 

  • Minimum set size:  it defines the minimal set size. The sets which size is below this threshold will be ignored. This is important since the normalized score computed by the algorithm is not very accurate for extremely small sets.
  • Maximum set size:  it defines the maximal set size.
  • Permutations: the number of permutations to perform to assess the statistical significance of the enrichment score. The higher it is, the more accurate the p-values will be. 
  • Random seed: since the statistical significance is assessed by a permutation test, the same random seed must be used to reproduce the same results using the same parameters.

1.3.3 Columns to include

In the columns to include tab you can choose which columns to include in the gene set enrichment analysis in case you have selected marker genes or network analysis as input.


1.4 Running the gene enrichment analysis


When the parameters are all set-up, you can click on the Run icon on the top right to compute the gene set enrichment analysis results.

2. Interactive plot page of the gene set enrichment analysis 

The results of the GSEA can be explored with four different ways  in the interactive plot page. These are ranks, statistics, bar plot and waterfall plot.

2.1 Ranks

When clicking on the Ranks button a table will be displayed showing gene sets ranked by NES from highest to lowest. In case you performed the GSEA on multiple columns there will be a resulting ranked gene set column for each of the input columns.


2.2 Statistics

When clicking the Statistics button a table with statistics of the GSEA will be displayed. The table is interactive and sortable. The columns in the table are: 

  • Set: the name of the set.
  • Enrichment score: the enrichment score calculated by the algorithm. It is the degree to which the set is overrepresented at the top or bottom of the ranked list of features in the dataset. This score does not take into account the differences in set sizes, so it cannot be used to compare the different sets.
  • NES: the normalized enrichment score. This score takes into account the differences in set sizes, so it can be used to compare the different sets.
  • Direction: whether a set is upregulated (Up) or downregulated (Down) in the experimental group.
  • P-value: the significance of the result. Usually the significance threshold is set at 0.05.
  • Adjusted pvalue: adjusted p-values calculated with the Benjamini-Hochberg procedure (false discovery rate, FDR

In the Data to plot tab you can select for which input column the results will be displayed in case you used multiple columns as input.


2.3 Bar plot

 

When clicking the Bar plot a horizontal bar plot will be displayed, in which the x-axis is the NES (normalized enrichment score), and the y-axis represent the sets (from the most upregulated to the most downregulated). By default, the red color represents the unregulated sets (NES > 0), and the blue color represents the downregulated sets (NES < 0). 


In the Data to plot tab you can select for which input column the results will be displayed in case you used multiple columns as input.


2.4 Waterfall plot

When clicking the Waterfall plot button a waterfall plot will be displayed, in which the x-axis is the rank of the sets (from the most downregulated to the most upregulated), and the y-axis is the NES (normalized enrichment score). Each dot on the plot is a set, and you can see the name of each set by hovering over the dot with the pointer of the mouse. By default, the red color represents the upregulated sets (NES > 0), and the blue color represents the downregulated sets (NES < 0).


In the Data to plot tab you can select for which input column the results will be displayed in case you used multiple columns as input.


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article