Clustering is a technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. We can use clustering analysis to gain some valuable insights from our data by finding which groups the data points fall into when we apply a clustering algorithm.

The UniApp enables you to perform clustering on all types of datasets, using different methods. The methods available are hierarchical clustering, K-means clustering and Louvain clustering (mostly used in scRNA-seq datasets).

The visualization for the clustering analysis is based on the results of the *Dimensionality reduction* analysis (the dimensionality reduction plot is used). You should perform the dimensionality reduction before doing the clustering, otherwise you won’t have a visual aid available to help you in the clustering process (you will not be able to see the shape and position of the clusters).

**1**** Algorithm settings**

In the *Algorithm *box you will be able to define how to perform the dimensionality reduction. First, you need to define the input and type of dimensionality reduction to perform, and then you can set the parameters for selected dimensionality reduction method.

**Select cells**: in the select cells tab, you can choose the observations to use as input. For more information see the section on __Cell/sample selection__. **Input**: can be *Normal* or *Engineered*. For more information about the data to use as input, see section on __Useful concepts__. **Scaling**: how you want to scale your data. For more information about data scaling, see section on __Useful concepts__. **Method**: Select one of the available clustering methods: hierarchical clustering, K-means clustering and Louvain clustering (mostly used in scRNA-seq datasets).**Parameters**: parameters settings for selected clustering method.**Download results**: downloads output of the algorithm. **Run**: submit job.**Bookmark links**: share analysis with your colleagues or our consultants.

**1.2**** Clustering method settings**

The currently available clustering methods in the UniApp are:

**Hierarchical clustering**: it seeks to build a hierarchy of clusters, starting by clustering the observations in a pair-wise way. It can be used on all types of data. In other words, initially each observation is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.**K-means**: it aims to partition the observations into clusters, in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Each cluster is updated at each iteration by calculating the centroid of each cluster and redefining clusters by distance from the centroid. It can be used on all types of data.**Louvain**: it is a graph-based clustering approach. Distances between the cells are calculated based on previously identified components. This method embeds cells in a graph structure (for example a K-nearest neighbor graph), with edges drawn between cells with similar expression patterns, and then attempt to partition this graph into highly interconnected quasi-cliques or communities. It is mostly used for single cell RNA-seq datasets.

Each clustering method has its own parameters, which will be explained in the corresponding subsection.

**1.2.1**** Method settings for hierarchical clustering**

The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.**Distance**: the distance measure to use between observations during the distance calculation step. See section on __Useful concepts__ for more information on distance measures. **Agglomeration method**: the agglomeration method to use to cluster the observations. See section on __Useful concepts__ for more information on agglomeration methods.**Use squared distance**: whether to use the squared distances or not. This can be useful when using some agglomeration methods, like *Centroid*.

**1.2.2**** Method settings for K-means clustering**

The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.

**Number of clusters**: the expected number of clusters. It must be a positive number less than the number of observations in the data.

**1.2.2**** Method settings for Louvain clustering**

## The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.

**Number of neighbours**: defines the number of neighbours to use for the k-nearest neighbor step. Determines size of the smallest possible cluster.**Resolution**: sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.**Random seed**: since the Louvain clustering is a stochastic algorithm, the same random seed must be used to reproduce the same results when using the same parameters.

Differently from other clustering methods, Louvain does not have a parameter to specify the number of clusters it should compute. To get the expected number of clusters, you need to change the *Number of neighbours* and *Resolution* parameters until you obtain the expected number of clusters. It is advised to perfom a __Brute force analysis__ to find a feeling to the right number of true biological clusters

**2** Performing the clusteringWhen the parameters are all set-up, you can click on the *Run *button to compute the clustering results. This could take quite some time, depending on the clustering method and the size of the data.

As soon as the clustering is computed, the plot (if present) will be color coded with the clustering results, and the clustering results will be added to the **AutomatedClustering** metadata variable. Any previous clustering results will be overridden.

The **AutomatedClustering** metadata variable is reserved by the UniApp. This means that it cannot be used in most analyses. It is recommended to make a copy of the **AutomatedClustering** variable if you want to use it, see the *Copy column* subsection in the *Metadata manipulation* section. For example, when annotating clusters you can make a copy of the **AutomatedClustering** metadata column and edit its content to assign biologically meaningful annotation to automated clusters.

*Dimensionality reduction*analysis (the dimensionality reduction plot is used). You should perform the dimensionality reduction before doing the clustering, otherwise you won’t have a visual aid available to help you in the clustering process (you will not be able to see the shape and position of the clusters).

**1**** Algorithm settings**

In the *Algorithm *box you will be able to define how to perform the dimensionality reduction. First, you need to define the input and type of dimensionality reduction to perform, and then you can set the parameters for selected dimensionality reduction method.

**Select cells**: in the select cells tab, you can choose the observations to use as input. For more information see the section on.__Cell/sample selection__**Input**: can be*Normal*or*Engineered*. For more information about the data to use as input, see section on.__Useful concepts__**Scaling**: how you want to scale your data. For more information about data scaling, see section on.__Useful concepts__**Method**: Select one of the available clustering methods: hierarchical clustering, K-means clustering and Louvain clustering (mostly used in scRNA-seq datasets).**Parameters**: parameters settings for selected clustering method.**Download results**: downloads output of the algorithm.**Run**: submit job.**Bookmark links**: share analysis with your colleagues or our consultants.

**1.2**** Clustering method settings**

The currently available clustering methods in the UniApp are:

**Hierarchical clustering**: it seeks to build a hierarchy of clusters, starting by clustering the observations in a pair-wise way. It can be used on all types of data. In other words, initially each observation is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.**K-means**: it aims to partition the observations into clusters, in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Each cluster is updated at each iteration by calculating the centroid of each cluster and redefining clusters by distance from the centroid. It can be used on all types of data.**Louvain**: it is a graph-based clustering approach. Distances between the cells are calculated based on previously identified components. This method embeds cells in a graph structure (for example a K-nearest neighbor graph), with edges drawn between cells with similar expression patterns, and then attempt to partition this graph into highly interconnected quasi-cliques or communities. It is mostly used for single cell RNA-seq datasets.

Each clustering method has its own parameters, which will be explained in the corresponding subsection.

**1.2.1**** Method settings for hierarchical clustering**

The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.**Distance**: the distance measure to use between observations during the distance calculation step. See section onfor more information on distance measures.__Useful concepts__**Agglomeration method**: the agglomeration method to use to cluster the observations. See section onfor more information on agglomeration methods.__Useful concepts__**Use squared distance**: whether to use the squared distances or not. This can be useful when using some agglomeration methods, like*Centroid*.

**1.2.2**** Method settings for K-means clustering**

The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.**Number of clusters**: the expected number of clusters. It must be a positive number less than the number of observations in the data.

**1.2.2**** Method settings for Louvain clustering**

**1.2.2**

**Method settings for Louvain clustering**

## The parameters you can set are the following:

**Number of PCA dimensions**: the number of principle components to use for clustering, generally it should be the same number as used for t-SNE or UMAP.**Number of neighbours**: defines the number of neighbours to use for the k-nearest neighbor step. Determines size of the smallest possible cluster.**Resolution**: sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.**Random seed**: since the Louvain clustering is a stochastic algorithm, the same random seed must be used to reproduce the same results when using the same parameters.

Differently from other clustering methods, Louvain does not have a parameter to specify the number of clusters it should compute. To get the expected number of clusters, you need to change theNumber of neighboursandResolutionparameters until you obtain the expected number of clusters. It is advised to perfom ato find a feeling to the right number of true biological clustersBrute force analysis

**2**Performing the clusteringWhen the parameters are all set-up, you can click on the *Run *button to compute the clustering results. This could take quite some time, depending on the clustering method and the size of the data.

As soon as the clustering is computed, the plot (if present) will be color coded with the clustering results, and the clustering results will be added to the **AutomatedClustering** metadata variable. Any previous clustering results will be overridden.

TheAutomatedClusteringmetadata variable is reserved by the UniApp. This means that it cannot be used in most analyses. It is recommended to make a copy of theAutomatedClusteringvariable if you want to use it, see theCopy columnsubsection in theMetadata manipulationsection. For example, when annotating clusters you can make a copy of theAutomatedClusteringmetadata column and edit its content to assign biologically meaningful annotation to automated clusters.

**3**** Clustering visualization settings**

The visualization settings in the clustering module are the same as in the dimensionality reduction module. For more information on these **Visualization settings** as described in the chapter on * Dimenstionality reduction*.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Feedback sent

We appreciate your effort and will try to fix the article