Gene set association

Modified on Thu, 09 Mar 2023 at 03:01 PM

Identification of key genes players, in the case of generalized genetic disorders involving a wide array of symptoms and an unclear mechanism of action, can prove to be rather challenging. Genetic studies yield large lists of genes which can only be followed up for further investigation, potentially resulting in a sea of meaningless data.


The UniApp uses Pathway analysis to create a ranked gene list based on the relevance of the genes with respect to a biological mechanism. Meaning that genes exhibiting expression pattern highly similar to that of the inputted genes will be ranked the highest. The analysis can be conducted in various patient signature data, which can not only provide new insights into the mechanism of various unknown disease pathways, but also lead to potential targets for drug discovery and development. All in all, pathway analysis yields investigation ready top genes, saving time, money and most importantly, lives.   


Gene sets inputted into the Pathway analysis can only be recognised for further analysis if they are official gene symbol style nomenclature. You can make use of GeneCards or Uniprot to find the official symbol name of the gene of interest.


1 Algorithm settings

1.1 Creating a plot

As a first step of the analysis, a plot must be created by clicking on the create plot icon. This will lead you to a section where the analysis of interest (in this case Pathway analysis) can be selected.

The analysis can be effectively organised by assigning a name and description in the respective fields. Subsequently under the "choose algorithm to run your analysis", pathway analysis  must be selected.

1.2 Selecting data

Next the input analysis can be selected into the track element under "choose track element". In the cell selection tab you can choose the observations to use as input. For more information see the section on Cell/sample selectionNote that subsetting at the pretreatment step is a "hard" subset meaning that excluded cells/samples at this step will not be present in the downstream steps.

1.3 Setting paramaters

In the set parameters field you will be able to define how to perform the pathway analysis

  • Input: From the "Data matrix as input" dropdown menu you select the type of input. Currently only PCA is supported. For this you can use any preatreatment analysis step since it has precalculated principle components.
  • Method: From the "Method" dropdown menu you select the method. Currently only the neural network is supported.
  • Seed features: In the "Seed featuers" input you can paster your seed features. The pathway analysis algorithm will find genes with similar expressions to the genes provided in "Seed features". At least 10 input seed features are required for analysis to be able to run.
  • Advanced: From the "Advanced" tab you can set:
    • Number of bagging samples: Values 1-100 (suggested number of 40) The model takes longer if there are more samples selected. This is the number of times non-seeds are sampled and a model is built off that sample. If 40 is selected, 40 neural networks are performed across 40 random samples and in the end are combined to give a final “probability of seed gene” score. The more bagging samples could lead to a more accurate model but at the expense of processing time.
    • Number of features selected: 100 by default.

 Furthermore at least 1 observational variable is required for the analysis. 

1.4  Running the Analysis

Once you have selected all the necessary settings and commands, the analysis can be initiated by clicking on the "Run" icon on the top right. 

2. Interactive Plot Page

The output of the pathway analysis  analysis can be visualised through various plots and tables. The type of plot and the other factors with regards to the plot parameters can be defined in the algorithm box under "data to show". 


2.1 Selecting the plot type

There are 4 options for output visualisation namely, the ROC curve, the cutoff table, the confusion matrix and the ranked list. The analytical value of each plot will be discussed further. 

2.2 Defining Plot Parameters

For each plot various plot parameters can be inputted allowing for cut-off values to be defined. 

Show Youden cutoff: Toggles whether to show the optimal threshold.  The youden cutoff is a measure for the diagnostic test's ability to balance sensitivity and specificity. A value of 1 indicates that there are no false positives or false negatives indicating that the test is perfect. 

Show Threshold cutoff: Toggles weather to show threshold cutoff. 

Youden index cutoff: It is a summary measure of the ROC curve, it measures both the effectiveness of a diagnostic marker and enables the selection of an optimal threshold value.  

Fixed cutoff: In the case that the user has a preferred cut off score at which a gene is deterred to be a seed feature, this can be inputted manually into the fixed cutoff section. 

3. Performing the pathway analysis 

As mentioned previously the results can be visualised in 2 plots and 2 tables.

The first type of plot is the ROC plot curve, in which the x-axis shows the false positive fraction (clinical specificity) and the y axis reflects the true positive fraction (clinical sensitivity). The ROC (along with the confusion matrix) serves as tool to validate the model. The closer the AUC value is to 1, the higher the sensitivity and the specificity of the model. A model with a value of 1 is considered perfect, with the highest sensitivity and specificity. 

The second type of plot that can be generated by the analysis is the Confusion matrix. It shows model’s prediction of the genes being a seed feature against weather they are actually a seed feature or not. The default cut off probability is set at 0.5 for which the model assigns whether the gene is a seed or not. For example, if a gene has a score > 0.5, it is assigned to be a seed gene. As mentioned previously the function of the plot is to validate the model's prediction of the ranked list

The cutoff table demonstrates the how the metrics change based on the threshold probability used by the model to determine the status of the predicted gene (seed gene or predicted). A high threshold indicates that very high probability genes are captured as seed genes, while low thresholds indicates that many other genes may also be captured as seed genes. 

The ranked list generated by algorithm includes the top genes with an expression pattern that is predicted to show a similarity to that of the curated seed genes. Top genes are found to show the highest level of sensitivity and specificity and are listed in ascending order (highly ranked - poorly ranked). The list also includes predictions of various genes that have not been included in curated list originally, providing insights into notable novel drug targets. 

4. Exporting output

Under "export settings" the preferred export format for the table or plot can be selected, allowing for publication ready analysis output to be downloaded. 

6. Useful links

Gene prioritisation analysis 

Guide to understanding ROC

Youden index

Confusion matrix

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article