Clustering

Modified on Wed, 5 Jun at 10:19 AM

Table of contents
Introduction
1 Algorithm settings
2 Performing the clustering
3 Clustering visualization settings

Introduction

Clustering is a technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. We can use clustering analysis to gain some valuable insights from our data by finding which groups the data points fall into when we apply a clustering algorithm.

The UniApp enables you to perform clustering on all types of datasets, using different methods. The methods available are hierarchical clustering, K-means clustering and Louvain clustering (mostly used in scRNA-seq datasets).

The visualization for the clustering analysis is based on the results of the Dimensionality reduction analysis (the dimensionality reduction plot is used). You should perform the dimensionality reduction before doing the clustering, otherwise you won’t have a visual aid available to help you in the clustering process (you will not be able to see the shape and position of the clusters).
1 Algorithm settings

In the Algorithm box you will be able to define how to perform the dimensionality reduction. First, you need to define the input and type of dimensionality reduction to perform, and then you can set the parameters for selected dimensionality reduction method.
Select observations: in the select cells tab, you can choose the observations to use as input. For more information see the section on Observation selection.
Scaling: how you want to scale your data. For more information about data scaling, see section on Useful concepts .
Method: Select one of the available clustering methods: hierarchical clustering, K-means clustering and Louvain clustering (mostly used in scRNA-seq datasets).
Parameters: parameters settings for selected clustering method.
Run: submit job.

1.2 Clustering method settings
The currently available clustering methods in the UniApp are:

Hierarchical clustering: it seeks to build a hierarchy of clusters, starting by clustering the observations in a pair-wise way. It can be used on all types of data. In other words, initially each observation is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
K-means: it aims to partition the observations into clusters, in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Each cluster is updated at each iteration by calculating the centroid of each cluster and redefining clusters by distance from the centroid. It can be used on all types of data.
Louvain: it is a graph-based clustering approach. Distances between the cells are calculated based on previously identified components. This method embeds cells in a graph structure (for example a K-nearest neighbor graph), with edges drawn between cells with similar expression patterns, and then attempt to partition this graph into highly interconnected quasi-cliques or communities. It is mostly used for single cell RNA-seq datasets.
Each clustering method has its own parameters, which will be explained in the corresponding subsection.

1.2.1 Method settings for hierarchical clustering

The parameters you can set are the following:
Number of PCA dimensions: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.
Distance: the distance measure to use between observations during the distance calculation step. See section on Useful concepts for more information on distance measures.
Agglomeration method: the agglomeration method to use to cluster the observations. See section on Useful concepts for more information on agglomeration methods.
Use squared distance: whether to use the squared distances or not. This can be useful when using some agglomeration methods, like Centroid.

1.2.2 Method settings for K-means clustering

The parameters you can set are the following:
Number of PCA dimensions: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.
Number of clusters: the expected number of clusters. It must be a positive number less than the number of observations in the data.

1.2.2 Method settings for Louvain clustering

The parameters you can set are the following:

Number of PCA dimensions: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.
Number of neighbours: defines the number of neighbours to use for the k-nearest neighbor step. Determines size of the smallest possible cluster.
Resolution: sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
Random seed: since the Louvain clustering is a stochastic algorithm, the same random seed must be used to reproduce the same results when using the same parameters.

Differently from other clustering methods, Louvain does not have a parameter to specify the number of clusters it should compute. To get the expected number of clusters, you need to change the Number of neighbours and Resolution parameters until you obtain the expected number of clusters. It is advised to perfom a Brute force analysis to find a feeling to the right number of true biological clusters

The resolution parameter is usually set between 0-2, as per community standards. However, the resolution can also be set between 0-20 if necessary.

2 Performing the clustering

When the parameters are all set-up, you can click on the Run button to compute the clustering results. This could take quite some time, depending on the clustering method and the size of the data.

As soon as the clustering is computed, the plot (if present) will be color coded with the clustering results, and the clustering results will be added to the AutomatedClustering metadata variable. Any previous clustering results will be overridden.

The AutomatedClustering metadata variable is reserved by the UniApp. This means that it cannot be used in most analyses. It is recommended to make a copy of the AutomatedClustering variable if you want to use it, see the Copy column subsection in the Metadata manipulation section. For example, when annotating clusters you can make a copy of the AutomatedClustering metadata column and edit its content to assign biologically meaningful annotation to automated clusters.

3 Clustering visualization settings

The visualization settings in the clustering module are the same as in the dimensionality reduction module. For more information on these Visualization settings as described in the chapter on Dimenstionality reduction.