Trajectory inference

Modified on Mon, 11 Dec 2023 at 11:10 AM

TABLE OF CONTENTS

Introduction

In many processes, cells change continuously throughout time. For example, a cell could change from one cell type to another cell type, or a young and immature cell could evolve into a specialized cell (differentiation processes). Ideally, the expression levels of each individual cell should be tracked to see how it changes over time. The problem is that cells are destroyed (lysed) during the sequencing process where the RNA is extracted, making it impossible to track the expression profile of each individual cell throughout time. Instead, we would need to sample at multiple time-points and obtain snapshots of the gene expression profiles of the different cell types.


Taking different snapshots at different time points could be time-consuming and expensive. But if we think about it, when we take a single snapshot, the different cells will be at different time points in their life: this means that we could reconstruct the expression trajectories of the cells by using just one snapshot, since each snapshot may contain cells at varying points along the developmental progression. Trajectory inference algorithms are statistical methods to order the cells along one or more trajectories which represent the underlying developmental trajectories. This ordering is referred as pseudotime.


Differentiation trajectories that can be modelled be pseudotime be cyclical, linear, bifurcating, multifurcating or in the form of a tree, connected or disconneteced graph.

While there are trajectory inference algorithms capable of infering all mentioned trajectories, we propose that each complex trajectory should be deconstructed into multiple linear trajectories for a more comprehensive understanding of the underlying biological processes.


The UniApp enables you to perform linear pseudotime analysis (SCORPIUS), in which the computed trajectory will be linear. With this methods, you will be able to find out how the cells are ordered throughout time, and you will also be able to see how a certain features (e.g. genes) are expressing themselves over time.

These pseudotime algorithms always find a differentiation trajectory, even when the trajectory itself does not make sense at all. Before trying to compute a pseudotime trajectory, you have to think about the biological processes at play in the data, and if it makes sense for these biological process to have a differentiation trajectory between them.



Scenario: In your scRNA-seq dataset of the bone marrow you have identified three populations of interest based on canonical marker genes. These are the hematopoietic stem cells, common myeloid progenitor cells and common lymphoid progenitor cells. You have observed that there is a phenotypic continuum on the dimension reduction plot which hints that there is a bifurcating differentiation trajectory occuring from the stem cells to the common myeloid progenitor cells and common lymphoid progenitor cells. You would now like to know what are the genes that are driving this differentiation trajectory. To answer this question use can use the trajectory inference module to first examine each of these two trajectories separately, first the trajectory from stem cells to the common myeloid progenitor cells and secondly from stem cells to the common lympoid progenitor cells.



1 Creating a plot 


As a first step of the analysis, a plot must be created by clicking on the create plot icon in your analysis track. This will lead to the create plot page. Firstly we should enter the plot name and filling in the plot template to provide the proper context for performing this analysis: 



 Next you can choose the "Trajectory inference" algorithm from the "Choose algorithm to run your analysis".


2 Selecting data


 

The trajectory inference algorithm only excepts normalized scRNA-seq data pretreatment analysis steps as input. In the menu "Choose a project" you can choose from which project to select input. From tge "Choose track element", the input data can be selected. To confirm your selection click on the "Select track element" button. Observations can be selected via the "Select observations" button.


The trajectory inference algorithm only accepts normalized scRNA-seq data as input.



3 Setting parameters


In the "Set parameters" field you will be able to define how to perform the pseudotime analysis.  


In the first menu tab, "Feature selection", you can set which features will be used as input by the trajectory inference algorithm. It is usually recommended to use the "Highly variable features" instead of "All" the features (or to use a Custom marker set of features). In this way the pseudotime analysis will include only the features that contain the maximum amount of information (since in single cell RNA-seq datasets most of the values are 0s).



Next, you need to define the type of trajectory inference algorithm to use in "Method" tab. Currently, the only method available is the SCORPIUS method.

In the "Parameters" tab you can define the following:

  • PCA dimensions: the components (dimensions) to calculate during the principal component analysis (PCA) step.
  • K: the k parameter for the k-means clustering step.
  • Stretch factor: a stretch factor for the endpoints of the trajectory curve, allowing the curve to grow to avoid bunching at the end.
  • Random seed: since the algorithm has some stochastic elements, the same random seed must be used to reproduce the same results when using the same parameters.

4Performing the trajectory inference

When the parameters are all set-up, you can click on the "Run" button to compute the pseudotime analysis results. This could take quite some time, depending on the method used and the size of the data (it could take several hours for extremely big datasets). 


These algortihms do not infer the trajectory direction, but only the trajectory itself.

SCORPIUS (linear pseudotime analysis) constructs an initial trajectory by clustering the data with k-means clustering and finds the shortest path through the cluster centers. This initial trajectory is subsequently refined in an iterative way using the principal curves algorithm. The individual cells can then be ordered by projecting the n-dimensional points onto the trajectory.

As soon as the trajectory is computed, an interactive plot will appear in the track. Clicking on the "VIew interactively" will allow you to view the results of the trajectory inference in the interactive plot page.

5Trajectory inference interactive plot page and settings

The trajectory inference interactive plot page can display three types of plots: trajectory plot, expression line plot and expression dot plot.

5.1  Trajectory plot

The trajectory plot shows you the inferred pseudotime trajectory. Each dot on the plot is a cell and the black line is the infered trajectory.


Once the trajectory inference result is available, you can color code the trajectory plot by any feature or variable you want. To color code the plot by a metadata variable, you need to select Metadata, and then select the variable you want to visualize. To color code the plot by a feature (e.g. a gene), you need to select Original feature, and then select the feature of interest. 


The trajectory plot can be customized in different ways. Clicking on "Visualization settings" enables you to change the structure of the plot (dot size, grid, etc.), and the color coding of the clusters. 


5.2Expression plot and smoothed line plot

The expression dot plot and smoothed line plot show you the expression profile of a feature (gene) over the inferred pseudotime trajectory. The x-axis represents the pseudotime, the y-axis the gene expression while each dot is a cell. Altough they plot the same data, the use-case and interpretation of these two  plots is different:

  • Expression dot plot: This plot is more suited to assess the distribution of individual data points and to identify outliers.
  • Expression line plot: Once the distribution is understood from the dot plot, a smoothed line plot can be employed to visualize trends and patterns more clearly. 


Altough most parameters for these plots are the same there are some plot specific parameters for each of them. 


5.2.1Select input

Bothe te expression line and dot plots have the same way of selecting input. If you want to visualize a feature, you need to select "Original feature", and then select the feature of interest.


The expression plot can be color coded by any metadata variable you want. To color code the plot by a metadata variable, you need to select Metadata, and then select the variable you want to visualize.


Numeric metadata variables cannot be used to color code the expression plot.



5.2.2Visualization settings


The parameters exclusive to the expression dot plot are:


  • Dots size: how big/small the dots representing the original data will be on the plot.


The parameters exclusive to the smoothed line plot are:


  • Color: the color scheme to use to color code the clusters.
  • Smoothed expression line width: how thick the smoothed expression line should be.
  • Show original size: whether or not to show the original (non-smoothed) data in the background.
  • Dots size: if you decide to show the original data, this parameter defines how big/small the dots representing the original data will be on the plot.
  • LOESS regression: explained in detail in section 5.2.3.


The parameters common to both the expression line and expression dot plot are:


  1. Quantilize groups: smooth the groups together to obtain a clearer line plot. More on this in the next subsection.
  2. Invert pseudotime: whether or not to invert the pseudotime axis.
  3. Scale expression (0 to 100): whether or not to scale the expression from 0 to 100 (where 0 is mapped to the original minimum value, and 100 to the maximum original value).
  4. Hide legend: whether to hide the legend or not.
  5. Hide grid: whether to hide the grid or not.

5.2.2.1Quantilize groups

The quantilize groups option is used to smooth the groups together to obtain a smooth line plot. The groups in question are the ones that are being used to color code the plot. This can be useful since the line plot by default is not smooth, but it is “fractured”.

When using the quantilize groups option, the lower and upper quantile is computed for each group (based on the settings specified in the slider). The groups are ordered based on ther lower quantile value (from the minimum to the maximum), and new group ranges are defined by averaging together the upper quantile of the first group and the lower quantile of the second group (in succession from the first group to the last group). The newly defined group ranges are then used during the plotting.

This option should only be used if the groups are showing themselves up clearly in succession (for example, when group 1 is in the first part of the plot, group 2 in the middle, and group 3 in the last part). If that is not the case (i.e. random distribution of the groups), this option must not be used, since it would produce nonsensical results. Remember that this option is mainly used to improve the visualization of the data: if the data after quantilization shows something that is completely different from the data before the quantilization, then this option must not be used.


5.2.3LOESS regression

To easily find patterns in the expression data, we use the LOESS regression to smooth the original data into a line. 

LOESS regression is a nonparametric technique that uses local weighted regression to fit a smooth curve through data points. The procedure originated as LOWESS (LOcally WEighted Scatterplot Smoother). LOESS is based on the idea that any function can be well approximated in a small neighborhood by a low-order polynomial. LOESS can be useful for fitting a line to data points where there are noisy data values and sparse data points, and can reveal trends in data that might be difficult to model with a parametric curves (like linear regression).


The main idea of LOESS is to iteratively fit a low-degree polynomial to a subset of the data, for each point in the dataset. The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The LOESS fit is completed after the regression function values have been computed for each of the n data points.


The low-degree polynomials fit to each subset of the data are almost always of first (local linear regression) or second degree (local polynomial fits). Using a zero degree polynomial (Nadarya-Watson estimator, local constant fitting) turns LOESS into a weighted moving average. Such a simple local model might work well for some situations, but may not always approximate the underlying function well enough.

To decide how the regression should be performed, you can to set the following parameters:

  • Regression model: the LOESS regression model to use. You can choose between Local linear regression (linear fit), Local polynomial fit (polynomial fit) and Nadaraya-Watson estimator (local constant fitting).
  • Regression span: it indicates how much data you want to use to perform the local regression at each iteration. 0.75 means that 75% of the data is used at each iteration.

6 Useful links

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article