Useful concepts

Modified on Mon, 12 Feb at 9:50 PM

TABLE OF CONTENTS

1 LOESS regression
2 Data matrix as input
3 Scaling
4 Hierarchical clustering
4.1 Distance measure
7 Cluster similarity
- 7.1 Jaccard similarity score
- 7.2 Congruent features

The UniApp is a broad software, where a lot of parameters have to be set before performing an analysis. After the analysis is performed, more parameters have to be set to customize the results (plots, tables, etc.). Since there is a lot of parameter handling to do, we tried to standardize the interface across all analyses, to make it easier to perform those analyses without the need of consulting the manual each time. For that reason, many elements and parameters appear multiple times across different analyses. Those settings that appear multiple times are described in this section.

1 LOESS regression

LOESS regression is a nonparametric technique that uses local weighted regression to fit a smooth curve through data points. The procedure originated as LOWESS (LOcally WEighted Scatterplot Smoother). LOESS is based on the idea that any function can be well approximated in a small neighborhood by a low-order polynomial. LOESS can be useful for fitting a line to data points where there are noisy data values and sparse data points, and can reveal trends in data that might be difficult to model with a parametric curves (like linear regression).

The main idea of LOESS is to iteratively fit a low-degree polynomial to a subset of the data, for each point in the dataset. The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The LOESS fit is completed after the regression function values have been computed for each of the n data points.

The low-degree polynomials fit to each subset of the data are almost always of first (local linear regression) or second degree (local polynomial fits). Using a zero degree polynomial (Nadarya-Watson estimator, local constant fitting) turns LOESS into a weighted moving average. Such a simple local model might work well for some situations, but may not always approximate the underlying function well enough.

2 Data matrix as input

In some analyses, you can decide which data to use input. Usually the choice is between the normal data and the engineered data. The normal data is the data that was pretreated during the Pretreatment step, while the engineered data is the data that was generated during the Feature engineering analysis. Most of the times you will use the normal data as input, but in some cases it can be useful to perform an analysis on the engineered data: you could perform Dimensionality reduction on your engineered data that might contain pathway activities instead of feature abundances, to see the global effect of the pathway activities. Or you could perform Differential analysis on such engineered data, to see the magnitude of change of the pathways between two groups.

You can choose between:

Normal: the normal data is the data that was pretreated during the Pretreatment step.
Engineered: the engineered data is the data that was generated during the Feature engineering analysis.

In most analyses, only Normal and Engineered will be available.

3 Scaling

In general data analysis, most of the times, your dataset will contain features that are highly varying in magnitudes, units and range. This could be a problem when computing the distance between two observations: for example, if you use Euclidian distance between two data points, these differences in magnitudes, units and range will not be taken into account when calculating the distance, generating results that you would not expect. The results would vary greatly between different units: the features with high magnitudes will weigh-in a lot more in the distance calculations than features with low magnitudes.

Scaling the data is extremely important, and is done for some analysis in the UniApp, like Dimensionality reduction, since dimensionality reduction techniques like principal component analysis (PCA) are sensitive to this kind of problem.

The UniApp enables you to choose from these scaling techniques:

None: no scaling is performed.
Auto: it makes the values of each feature in the data have zero-mean and unit-variance (z-scores). This approach is a valid approach to correcting for different feature scaling and units if the predominant source of variance in each feature is signal rather than noise. Under these conditions, each feature will be scaled such that its useful signal has an equal footing with other variables’ signal. However, if a given feature has significant contributions from noise (i.e. a low signal-to-noise ratio) or has a standard deviation near zero, then autoscaling will cause this variable’s noise to have an equal footing with the signal in other variables.
Center: centering the data converts all the expression/abundances to fluctuations around zero instead of around the mean of the expression/abundance. It adjusts for differences in the offset between high and low abundant features: it subtracts the mean of the feature from the feature itself. When the data is heteroscedastic, the effect of this method is not always sufficient.
Scale: it divides the feature by its own standard deviation. Usually used with Center to perform the Auto scaling.
Range: in biology, the biological range could be used instead of the standard deviation to measure the spread of the data. The biological range is the difference between the minimal and the maximal concentration reached by a certain feature. Range scaling uses this biological range as the scaling factor. A disadvantage of range scaling with regard to the other scaling methods is that only two values are used to estimate the biological range, while for the standard deviation all measurements are taken into account. This makes range scaling more sensitive to outliers.
Pareto: pareto scaling is a technique intermediate between centering and autoscaling. With this form of scaling the data is first mean centered and then divided by the square root of the standard deviation for the variable. The net effect is that larger variables receive more importance than with autoscaling, but less than with mean centering alone.
Vast: vast is an acronym of variable stability and it is an extension of autoscaling. It focuses on stable variables, the variables that do not show strong variation, using the standard deviation and the so-called coefficient of variation as scaling factors. The use of the coefficient of variation results in a higher importance for features with a small relative standard deviation and a lower importance for features with a large relative standard deviation.
Level: level scaling uses a size measure instead of a spread measure for the scaling. Level scaling converts the changes in the feature expression/abundance into changes relative to the average expression/abundance of the feature by using the mean as the scaling factor.

By default, usually the Auto scaling is performed, since it is one of the most used scaling method available.

There is no best scaling method. Even if auto scaling is usually preferred, it does not mean it is the best scaling method that you can use for your own data. Try to experiment with different scaling parameters, to see what works best for your data.

4 Hierarchical clustering

Hierarchical clustering seeks to build a hierarchy of clusters, starting by clustering the observations in a pair-wise way iteratively. In other words, initially each observation is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

This clustering method has two important parameters: the distance measure and the agglomeration method.

4.1 Distance measure

In order to decide which objects/clusters should be combined or divided, we need methods for measuring the similarity between objects. There are many methods to calculate the (dis)similarity information.

The distance measures available in the UniApp are:

Euclidian distance: the Euclidean metric is the ordinary straight-line distance between two points in Euclidean space.
Maximum: maximum distance between two components of x and y (supremum norm).
Manhattan: the distance between two points is the sum of the absolute differences of their Cartesian coordinates.
Canberra: it is a weighted version of the Manhattan distance. It is often used for data scattered around the origin, as it is biased for measures around the origin and very sensitive for values close to zero.
Binary: the vectors are regarded as binary bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.
Minkowski: it is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. Since p = 2, it is equivalent to the euclidian distance (as for now).

4.2 Agglomeration methods

The agglomeration method takes the distance information and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original dataset are linked together in a hierarchical tree.

The agglomeration methods available in the UniApp are:

Complete: the distance between two clusters is the maximum distance between two objects, one from each cluster.
Centroid: the distance between two clusters is the distance between their centroids. You should use squared (Euclidean) distances.
Single: the distance between two clusters is the minimum distance between any two objects, one from each cluster.
Average: the distance between two clusters is the average of all pairwise distances between the members of both clusters.
McQuitty: it is the same as Average, but the cluster sizes are disregarded when calculating the average distances. As a consequence, smaller clusters will receive larger weight in the clustering process.
Median: the distance between two clusters is the distance between their medians.
Ward.D: the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward’s linkage is to minimize the within-cluster sum of squares. You should use squared (Euclidean) distances.
Ward.D2: Same as Ward.D, but you should use non squared (Euclidean) distances.

Be careful on how you combine the distance measure and the agglomeration method, sometimes there are specific combinations that you should use. For example, if you want to use the Centroid agglomeration method, it is recommended to use the squared Euclidian distance.

5 Differential analysis

Differential analysis means taking the (normalized) data and performing statistical analysis to discover quantitative changes in expression/abundance levels between experimental groups. Some concepts about how the differential analysis is done in the UniApp are explained in the following subsections.

5.1 Linear models

Linear models can account for fixed effects in the data. The assumption is that the data has been gathered from all the levels (the values) of the factor (the variable) that are of interest. For example, if the objective of an experiment is to compare the effects of three specific dosages of a drug, in this case the dosage is the factor, and the three specific dosages in the experiment are the levels: there is no intent to say anything about the other dosages.

5.2 VOOM normalization

VOOM is an acronym for mean-variance modelling at the observational level. VOOM is a normalization specific to bulk RNA-seq data, and it is used to transform the data in a way that the data can be analyzed as if it were microarray data. This normalization estimates the mean-variance trend for the data, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.

The VOOM normalization is going to be applied only when the data is annotated as Raw before performing the differential analysis (or the marker set analysis) for bulk RNA-seq data.

6 Meta analysis

In general, there is no specific definition for meta-analysis. One definition could be that meta-analysis is the application of statistical techniques to merge multiple results obtained from individual studies into a single outcome. Different strategies have been proposed to combine the information based on multiple analyses in order identify consistently deregulated features, and in this section our own strategy will be described

6.1 Strategy

We adopted a non-parametric statistic called “Rank combination”. The objective is to find the top commonly upregulated (or downregulated) features across all comparisons. In our method, a feature ranking is performed in each individual dataset, which is based on the differential analysis (or the competitive set enrichment analysis when comparing sets): the ranking is calculated from the most upregulated feature (rank 1) to the most downregulated feature (rank n) (the opposite can be done to find the most commonly downregulated features). The product, mean or sum of ranks (or other metrics) from all datasets can be used to calculate the overall rank.

When the differential analysis is used as the comparison, the log fold change will be used to rank the features. When the competitive set enrichment analysis is used as the comparison, the normalized enrichment score (NES) will be used to rank the sets.

As an example, let’s say we want to find the most commonly upregulated features in group A across two datasets, D1 and D2. In D1 we perform the comparison (reference group vs experimental group) B vs A and C vs A, and in D2 we perform the same comparison. These comparisons are performed through a differential analysis, to obtain the log fold change for each feature in each comparison.

Table 1: [Comparison step] The log fold changes for each performed comparison.
Feature	B vs A (D1)	C vs A (D1)	B vs A (D2)	C vs A (D2)
Feature 1	1.5	-2.0	0.1	5.0
Feature 2	-2.0	0.9	-1.1	-2.0
Feature 3	0.7	1.5	2.0	1.0
Feature 4	0.4	1.7	1.0	-0.5

Once all these comparisons are performed, we rank each feature from the most upregulated feature (rank 1) to the most downregulated feature (rank n). The ranking is done this way because we want to find the most commonly upregulated features.

Table 2: [Ranking step] The ranks for each performed comparison.
Feature	B vs A (D1)	C vs A (D1)	B vs A (D2)	C vs A (D2)
Feature 1	1	4	2	3
Feature 2	4	3	2	1
Feature 3	2	2	1	2
Feature 4	3	1	2	3

Now these ranks must be merged together to calculate the overall rank. This can be done, for example, by multiplying all the ranks together. Once the rank product is calculated, we can find which feature was the most commonly upregulated by checking which feature has the lowest product of ranks. The product of ranks is called score of the meta-analysis.

Table 3: [Merging step] The score is calculated for each feature.
Feature	Score (Product of ranks)	Final rank
Feature 1	12	2
Feature 2	192	4
Feature 3	8	1
Feature 4	18	3

By using this method (product rank), we find that Feature 3 is the most commonly upregulated feature across all comparisons.

6.2 Overall rank calculation

The overall rank calculation can be computed using different methods. These are the methods available in the UniApp:

Product: the product of the ranks will be used.
Sum: the sum of the ranks will be used.
Median: the median of the ranks will be used.

You can try to use the three main ranking methods (Product, Sum and Media) to see if the top ranking features do not change much across the different methdods

Other methods can be used, in which the score of the meta-analysis is calculated without passing through the ranking step:

Fisher: the Fisher’s combined probability test will be used to combine the p-values of the different comparisons together. These combined p-values will be the score of the meta-analysis.
Range: the difference of the maximum and minimum original comparison values will be the score of the meta-analysis.
Standard deviation: the standard deviation of the original comparison values will be the score of the meta-analysis.
Variance: the variance of the original comparison values will be the score of the meta-analysis.
Absolute: the sum of the absolute values of the original comparison values will be the score of the meta-analysis.

7 Cluster similarity

To assess conservation of cell phenotypes, we calculate the similarity of marker feature sets using the pair-wise Jaccard similarity coefficients for all clusters/groups against all other clusters/groups. In addition, for each cluster/group, we define which features are common and specific compared to all the other groups.

7.1 Jaccard similarity score

The Jaccard similarity coefficient is defined as the size of the intersection divided by the size of the union of sets:

Where J is the Jaccard index and A and B are two sets of marker features.

This enables us to create a similarity score matrix (which is similar to a correlation matrix):

Table 4: The Jaccard similarity matrix.
Observation	Cluster A	Cluster B	Cluster C
Cluster A	1.00	0.75	0.01
Cluster B	0.75	1.00	0.21
Cluster C	0.01	0.21	1.00

For visualization purpose, we perform principal component analysis (PCA) on the Jaccard similarity matrix. The first two components are used to visually check how the clusters/groups similar to each other.

7.2 Congruent features

For each cluster/group, we define which features are common and specific compared to all the other clusters/groups. This is done simply by counting how many features are common between the same clusters in different datasets/comparisons, and by calculating the median rank for each feature.

First, we compute the markers set for each datasets:

Table 5: The marker set for each cluster in each dataset (dataset 1 is D1, dataset 2 is D2, dataset 3 is D3).
Rank	Cluster A in D1	Cluster A in D2	Cluster A in D3
1	Feature 1	Feature 2	Feature 9
2	Feature 2	Feature 1	Feature 10
3	Feature 5	Feature 3	Feature 11
4	Feature 6	Feature 4	Feature 1
5	Feature 7	Feature 8	Feature 12

Then we calculate:

The occurence (in percentage) of each feature for each cluster/group.
The median rank for each feature.

For example, in this case where we considere Cluster A, Feature 1 has a median rank of 2, and it occurs 100% of the times across the three datasets.

The features that have a high occurence and low (better) median rank are the features that are specific to the cluster, while the features that have low occurence and high (worse) rank are the features that are non specific to the cluster.