Data pretreatment

Modified on Wed, 13 Mar 2024 at 04:05 PM

The goal of data pretreatment is to clean the data, so that it becomes amenable for downstream analyses: this step is mandatory and necessary to use all the downstream analyses. Pretreatment encompasses not only filtering, but also data normalization that can shape your data in a way that strengthens the biological effects you want to study. The data pretreatment is specific to each type of data and each type of technology. Each type of pretreatment will be described in this section, divided by type and technology.

It is important to note that only data that was annotated as Raw in the Data annotation step of the upload module can be normalized. Data that was annotated as Normalized is presumed to be already normalized, thus there is no need to normalize it again in this module.

TABLE OF CONTENTS

1 Creating a plot
2 Selecting data
3 Setting parameters
4 Performing data pretreatment

1 Creating a plot

As a first step of the analysis, a plot must be created by clicking on the create plot icon in your analysis track. This will lead to a section where the analysis of interest can be selected.

In order to ensure efficient organization, a name and description must be assigned to the analysis under the appropriate fields. Subsequently under "Choose algorithm to run your analysis" Data pretreatment must be selected. Then click on the "Select algorithm" button.

2 Selecting data

In the tab "Choose track element", the input analysis can be selected. In the SELECT OBSERVATIONS tab you can choose the observations to use as input. For more information see the section on Observation selection. Note that subsetting at the pretreatment step is a "hard" subset meaning that excluded cells/samples at this step will not be present in the downstream steps.

Once you selected the desired options, you can click on the "select track element" tab in order to proceed further. You will subsequently see the a field pop up to the right, namely the set parameters field.

3 Setting parameters

In the Data pretreatment Set parameters field, you can specify the settings with which you want to perform the data pretreatment. Note that the available options will vary depending on the type of technology used to generate the data.

3.1 Filtering: Cell filtering and Gene filtering

The settings described here are filtering methods unique to the technology used to generate the data used as input to the data pretreatment step.

3.1.1 Transcriptomics (Micro-array)

Since the background correction and normalization procedure for micro-array data is platform dependent (Affymetrix, Oligo, Illumina, etc.), you should upload data that has already been background corrected and normalized (e.g., with RMA for Affymetrix micro-arrays). For this reason, there is no special option to pretreat micro-array data, since it should be already pretreated.

3.1.2 Transcriptomics (bulk RNA-seq)

Features in bulk RNA-seq data can be filtered with two methods: observation percentage and average expression.

3.1.2.1 Observation percentage

For bulk RNA-seq data, an option to filter the low-quality features (genes) is provided. To do so, you need to set two parameters:

Feature expression threshold - the count per million (CPM) for which a feature is considered expressed. By default, it is 0 (no filtering), but in most cases it could be set to 1.
Percentage of samples expressing the feature - the percentage of observations (samples) that need to express a feature to keep that feature in the data. The actual number of observations will be also displayed dynamically, as you change the percentage threshold.

The explanation is simple: any feature that is below the CPM threshold will be considered not expressed. After that, we count how many times a feature is expressed in all observations. If the feature is expressed a number of times which is below the chosen threshold, the feature will be filtered out from the data.

3.1.2.1 Average expression

Here you can provide one parameter to filter out low quality features (genes) in bulk RNA seq data:

Feature average expression threshold - all the features which average (mean) expression is below the defined threshold will be filtered out from the data. By default, it is 0 (no filtering), but usually thresholds from 0.01 to 0.001 are used.

3.1.3 Transcriptomics (scRNA-seq)

For scRNA-seq data has these unique options in the data pretreatment Algorithm box: cell filtering and gene filtering.

3.1.3.1 Cell filtering

In cell filtering we aim to remove low quality, dead or poorly sequenced cells. To do so we use the following filtering parameters.

The thresholds for cell filtering are dynamically displayed on the data pretreatment density plot and you can use it to fine tune the thresholds.

Minimum number of genes per cells - set minimal number of genes a true cell should express. Use this option to remove poorly sequenced cells.
Maximum number of genes per cells - set maximal number of genes a true cell should express. Use this option to remove potential cell doublets.
Feature average expression threshold - all the features which average (mean) expression is below the defined threshold will be filtered out from the data. By default, it is 0 (no filtering), but usually thresholds from 0.01 to 0.001 are used.
Maximum mitochondrial percentage per cell - the percentage of mitochondrial gene expression is calculated for each cell. The cells that are above the chosen mitochondrial expression threshold are considered low quality cells and filtered out. Usually, thresholds from 10% to 5% are used. To avoid the mitochondrial expression filtering altogether, you can set the threshold to 100%.

Filtering data is more art than science. Although default values work fine there are certain caveats to keep in mind. Although mitochondrial percentages from 5 to 10% are often used, some cells like hepatocytes have normally a high mitochondrial percentage. Normal quiescent cells can have lower number of genes per cell than the default 200 minimum number of genes per cell, while cells in culture can have more than the usually recommended maximum of 6000 genes per cell. We recommend using lenient filters first and then removing cell clusters that are impossible to annotate or are ambigous.

3.1.3.2 Gene filtering

In the gene filtering tab, you can set the parameters which will decide which genes will be dropped from the analysis. The here being that gene detected in too few cells or the ones which are ubiquetlsy expressed might not be informative in the downstream analysis.

Minimum number of cells per gene - set the minimum number of cells that should express a true gene.
Maximum number of cells per gene - set the maximum number of cells that should express a true gene.

3.2 Normalization types

This makes variables normally distributed. Certain statistical methods require that variables that are normally distributed in order to get reliable results. For this reason the variable is changed such that the information in the variable remains but it is suitable to enter the statistical model. The type of operation that changes a variable’s distribution to a normal one depends on the original distribution of the variable. There are two operations that normalize a variable in this analysis, standard and log.

There are three options available to normalize the data:

Standard: Makes a variable have a normal distribution by subtracting the mean and then dividing all the values by the standard deviation. This results in the variable having a mean of 0 and variance of 1 and takes the form of a normal distribution.
Log: Makes a variable normally distributed by applying the log function to the variable. This is necessary if the original distribution takes the form of a log-normal distribution.
None: No transformation will be applied to the data.

The Standard normalization does the following:

Transcriptomics (Micro-array): log2 transformation (equal to the Log setting).
Transcriptomics (RNA-seq): TMM normalization.
Transcriptomics (scRNA-seq): Seurat normalization.
Proteomics: log2 transformation (equal to the Log setting).

This is a crucial step in the analysis pipeline: be sure to select the right normalization for the data you uploaded.

3.3 Advanced

Here, the feature expression threshold can be entered.

Feature expression threshold - the count per million (CPM) for which a feature is considered expressed. By default it is 0 (no filtering), but in most cases it could be set to 1.

4 Performing data pretreatment

Once the pretreatment settings have been set-up, you can perform the pretreatment. Click on the Run button to perform the pretreatment. Remember that subsetting at the pretreatment step is a "hard" subset meaning that excluded cells/samples at this step will not be present in the downstream steps.