Pseudobulk transformation

Modified on Wed, 8 Nov, 2023 at 6:41 AM

Why to use the pseudobulk transformation

Single cell (sc) RNA-seq provides expression profiles for each single cell with a given biological sample. However, sometimes it is necessary to summarize this information at a higher level, for example specifing the expression profile of a specific cluster or even of a whole cell type.

The pseudobulk transformation recapitulates the expression level of each gene at the desired level. The user can specify how cells should be grouped (e.g., Seurat clusters, or cell types) and one expression profile for each group is computed.

There are multiple applications where the pseudobulk transformation can help in the analysis and interpretation of the data. Differential expression analysis in scRNA-seq data can be biased towards generating a large number of false positive results, when multiple biological replicates are present for each experimental group. This is due to a mere statistical artifact: with hundreds or thousands of cells for each biological replicate, the total sample size is so high that any the statistical test will tend to produce highly significant p-values, regardless of the presence of actual biological differences. This issue can be solved by transforming the data into pseudobulk, which means to obtain a single expression profile for each biological replicate, and then identify differential expressed genes in the transformed data. A large scale comparison involving both simulate and real data has shown that differential expression analysis computed on pseudobulk data is much less biased towards false positives than the same analysis performed on the orginal single cell counts [1]

Similarly to the case of differential expression analysis, the pseudobulk transformation can be used for computing correlation between expression profiles, or simply for semplifying the visualization of the data.

The algorithm behind the pseudobulk transformation

The pseudo bulk transformation is quite straightforward: for each gene, the number of reads are summed across all the cells belonging to the same group / cell type. In formulas:

Where X_ij corresponds to the expression of gene j in cell i, while Y_kj to the expression of gene j in cell type or cluster k

How to transform scRNA-seq data in pseudobulk

As a first step of the analysis, a plot must be created by clicking on the create plot icon. This will lead you to a section where the analysis of interest (in this case pseudobulk transformation) can be selected.

The next step is to choose the data to be transformed. The pseudobulk transformation is meant to be used on unnormalized counts from a scRNA-seq dataset.

Finally, one categorical column from the metadata matrix must be chosen for clustering the cells. This will usually be the column with cell type information, so that one pseudobulk profile is computed for each cell type. However, the user can choose other categorizations, for example clusters computed by algorithms like Seurat, experimental conditions, and more.

We are now ready to execute the pseudobulk transformation:

Example of pseudobulk transformation

The following heatmap presents a subset of count data from a lung cancer scRNA-seq experiments [2]. Each row corresponds to one gene, each column to a cell, for a total of 2825 cells. Lighter shades of color correspond to higher expression, darker shades to lowe expression. Cells are grouped in cell types, as visible from the colored ribbon on top of the heatmap.

We can now transform these data in pseudobulk, grouping the cells according to their type. The transformation will produce only three expression profiles, one for each cell type.

We can see that the pseudobulk transformation highlights differences across the overall expression profiles of the three cell types that were not immediately recognizable in the previous heatmap.

Using the transformed data in downstream analyses

Once the data are transformed in pseudobulk, the transformed matrix can be used as input in all analyses that can process bulk RNA-seq data. This includes differential gene expression analysis (DGEA), as well as gene set enrichement analysis (GSEA), principal component analysis (PCA), and more.

References

[1] Squair, J.W., Gautier, M., Kathe, C. et al. Confronting false discoveries in single-cell differential expression. Nat Commun 12, 5692 (2021). https://doi.org/10.1038/s41467-021-25960-2

[2] Goveia, J., Rohlenova, K., Taverna, F. et al. An Integrated Gene Expression Landscape Profiling Approach to Identify Lung Tumor Endothelial Cell Heterogeneity and Angiogenic Candidates. Cancer Cell. 2020 Jan 13;37(1):21-36.e13. doi: 10.1016/j.ccell.2019.12.001.