Data upload

Modified on Wed, 20 Nov at 4:10 PM

TABLE OF CONTENTS

Navigation
1. Dataset name
2. Creating a new experiment and data upload - Analysis technology
3. Upload data matrices
4. Creating a new experiment and data upload - Annotate your files
5. Setting parameters
6. Completing dataset upload
7. Examples of datasets for each technology type

NOTICE: All data is expected to be clear of sensitive information that should not be distributed such as patient names, home addresses, or other (direct/indirect) traceable personal data.

NOTICE: after clicking proceed to data staging, it is at this moment prudent to stay by the computer since the upload process will have start over from the beginning if the application goes into slumber mode. We are currently working on improving this step and will update our customers on when this will be implemented

In this section, we will explain how to start your journey with UniApp. This is one of the most crucial steps of your data analysis, as the correctness of your data upload will dictate how your downstream analyses will go. Double checking and the four eyes principle are warranted. In the first step, the data and metadata of your experiment should be uploaded from your device by clicking on Upload dataset.

This will bring you to a wizard that helps you to upload and annotate your dataset

1. Dataset name

In the first step, you have to provide your dataset name. You can choose any name, but it is a good practice to have a unique and informative name.

2. Creating a new experiment and data upload - Analysis technology

In the second step, the technology used to generate the experiment's data must be stated.

In case you are uploading data that needs to be processed by our consulting team, select 'Upload to Unicle consulting'.

3. Upload data matrices

The data matrix and metadata need to be uploaded in the second column. If you upload data to our consulting team, please upload one compressed/zipped file as this will decrease upload time.

How to compress files on your computer: select all files that you would like to upload, right click for the context menu and select 'Compressed (zipped) folder). This will create a compressed folder with all your selected files. Alternatively, you can use software such as https://www.7-zip.org/ .

For all other uploads you have to upload exactly two files: data matrix and metadata. But first, ensure that your data matrix and metadata are formatted correctly.

3.1 Basic data matrix format (bulk, scRNAseq, microarray, proteomics)

The data matrix must be in the following form:

Feature	Observation 1	Observation 2	Observation 3	Observation 4
Feature 1	0	2	0	56
Feature 2	20	12	20	0
Feature 3	0	25	31	15
Feature 4	7	32	7	40
Feature 5	6	0	6	0
Feature 6	7	0	7	17

The features and observations should be unique (if there are duplicates, UniApp will take care of that, but it is not recommended to have duplicated names). Empty and missing values are not allowed but for metabolomics and proteomics and in this case they should be indicated with: NA. The data must be uploaded as a .csv or a .txt file. Any other formats are not supported. The first column of the data is dedicated to the feature IDs, while all the other columns are dedicated to the expression/abundance of each feature in each sample. The features can be your genes, metabolites, or protein IDs (or names), while the observations are your sample names or cell names.

The expression/abundance of each feature must be in plain numeric format (using the scientific notation is not allowed).

The metadata should be in the following form:

Observation	Condition	Batch
Observation 1	Control	1
Observation 2	Control	2
Observation 3	Treatment	1
Observation 4	Treatment	2

As for the data, the metadata can be uploaded as a CSV or a TXT file. Any other formats are not supported (like the Excel format or raw data files). In case of the example presented above only the Condition and Batch columns are present however, your metadata can have many columns (e.g. clinical data). The first column of the metadata is dedicated to the observation names, while all the other columns are dedicated to any relevant information associated with the observations (groups, progress-free survival, clusters, etc.). The more columns containing relevant information, the better.

Check if your data and metadata match before uploading.
The observation names in the metadata must match with the observation names in the data file. If the observations do not match, the data file will be used as the ground truth to generate a metadata file that is consistent with the data. These means that observations that are in the metadata but not in the data will be removed, and observations that are in the data but not in the metadata will be added (with empty entries).

After you have selected your data matrices, click upload files.

When working with large files, making a compressed/zipped csv or txt (.csv.zip/.txt.zip) file for upload will result in shorter uploading times.

3.2 10X data (scRNAseq)

10X data can be uploaded directly to the UniApp for single cell RNA data. The following files need to be uploaded to the UniApp:

mtx file
gene file
barecode file
metadata matrix

3.3 GCT

GCT files can be directly uploaded to our system.

3.4 Anndata

Anndata can be uploaded as a h5ad file.

3.5 Seurat objects

Seurat objects can be uploaded a .rds files.

3.6 Metabolomics

Data for metabolomics experiments looks very similar to the basic data uploads, with inclusion of a HMDBID column that needs to be left empty.

Feature	Observation 1	Observation 2	Observation 3
Feature 1	2	0	56
Feature 2	12	20	0
Feature 3	25	31	15
Feature 4	32	7	40
Feature 5	0	6	0
Feature 6	0	7	17

For the metadata of metabolomics, an extra column should be added named 'injection order'.

Observation	Injection order	Condition
Observation 1	1	Control
Observation 2	2	Control
Observation 3	3	Treatment
Observation 4	4	Treatment

3.7 Spatial single cell 10X

For the time being, all spatial upload will have to go through our team.

The following files need to be uploaded:

10X file
json file
image

3.8 Gene metadata

The gene metadata is a versatile dataset type designed to store a broad array of gene information. Within this file format, the initial column contains gene identifiers, while subsequent columns are flexible, capable of holding categorical or numerical values.

This type dataset can be visualized and explored using the gene metadata table algorithm. Additionally, it can be used as input in the rank and rule-based meta-analysis allowing you to integrate any available gene information with the omic results you have created within the UniApp.

An example of a correctly formated gene metadata file is shown below:

Gene	Gene type	Disease score	Source
BRCA1	Oncogene	0.84	CancerDatabase
BRCA1	Oncogene	0.90	Gene Expression Atlas
TP53	Tumor supressor	0.77	Gene Expression Atlas
EGFR	Receptor	0.23	Gene Expression Atlas

Notice that duplicate gene indentifiers are allowed. It is possible handle these duplicates in downstream analyses.

4. Creating a new experiment and data upload - Annotate your files

As the last step, you must annotate the files you uploaded in the previous step. Simply select which file is which from the dropdown menu.

Once you have annotated your file click on the "Proceed to data staging" button.

5. Setting parameters

In the setting parameters tab you will be able to annotate your data in data staging: specify data matrix type, organism of origin and gene name identifier.

In the Parameters data tab you can more precisely define your data matrix type and provide information that are used for certain downstream algorithms.

Select data matrix type: specify data matrix type, meaning if the data matrix has been previously normalized or not. A data matrix can be either raw or normalized. You can check if your data matrix has been previously normalized by taking a look at the digits in your data matrix. If the numbers in your data matrix are predominantly integers, no normalization was previously performed. This means you should perform data normalization in the subsequent Data pretreatment module. If the numbers are predominantly decimals the matrix is already normalized. In this case you can skip data normalization in the Data pretreatment module.
Select organism: specify from which organism the data was derived from. Current options are: human, mouse or rat. You can identify human data by the use of all capital letters for the features. Murine features are usually depicted with 1 capital letter, followed by small letters.
Select gene identifier: select type of gene identifier used in the first column of the data matrix.

MAKE SURE TO DOUBLE CHECK THIS STEP WITH YOUR SUPERVISOR IF NECESSARY, WRONG ANNOTATIONS WILL LEAD TO ERRORS IN YOUR DOWNSTREAM ANALYSIS OR WILL IMPEDE DATA UPLOAD.

6. Completing dataset upload

Click on 'Complete Dataset Upload to successfully upload your new dataset.

7. Examples of datasets for each technology type

Here you can find a link to an example of a dataset (metadata + data) for each technology type. This can be used as a template to model your own data to, to check whether the upload functions in the UniApp, see compare to your data files to investigate where a possible problem with the upload might be in your data.