Data upload

Modified on Wed, 20 Nov at 4:10 PM

TABLE OF CONTENTS



NOTICE: All data is expected to be clear of sensitive information that should not be distributed such as patient names, home addresses, or other (direct/indirect) traceable personal data.


NOTICE: after clicking proceed to data staging, it is at this moment prudent to stay by the computer since the upload process will have start over from the beginning if the application goes into slumber mode. We are currently working on improving this step and will update our customers on when this will be implemented


Navigation 


In this section, we will explain how to start your journey with UniApp. This is one of the most crucial steps of your data analysis, as the correctness of your data upload will dictate how your downstream analyses will go. Double checking and the four eyes principle are warranted. In the first step, the data and metadata of your experiment should be uploaded from your device by clicking on Upload dataset.




This will bring you to a wizard that helps you to upload and annotate your dataset


1. Dataset name

In the first step, you have to provide your dataset name. You can choose any name, but it is a good practice to have a unique and informative name.



2. Creating a new experiment and data upload - Analysis technology

In the second step, the technology used to generate the experiment's data must be stated. 

In case you are uploading data that needs to be processed by our consulting team, select 'Upload to Unicle consulting'.


3. Upload data matrices

The data matrix and metadata need to be uploaded in the second column. If you upload data to our consulting team, please upload one compressed/zipped file as this will decrease upload time. 

How to compress files on your computer: select all files that you would like to upload, right click for the context menu and select 'Compressed (zipped) folder). This will create a compressed folder with all your selected files. Alternatively, you can use software such as https://www.7-zip.org/ .

For all other uploads you have to upload exactly two files: data matrix and metadata. But first, ensure that your data matrix and metadata are formatted correctly.



3.1 Basic data matrix format (bulk, scRNAseq, microarray, proteomics)

The data matrix must be in the following form:

Feature
Observation 1
Observation 2
Observation 3
Observation 4
Feature 1
0
2
0
56
Feature 2
20
12
20
0
Feature 3
0
25
31
15
Feature 4
7
32
7
40
Feature 5
6
0
6
0
Feature 6
7
0
7
17

The features and observations should be unique (if there are duplicates, UniApp will take care of that, but it is not recommended to have duplicated names). Empty and missing values are not allowed but for metabolomics and proteomics and in this case they should be indicated with: NA.  The data must be uploaded as a .csv or a .txt file. Any other formats are not supported. The first column of the data is dedicated to the feature IDs, while all the other columns are dedicated to the expression/abundance of each feature in each sample. The features can be your genes, metabolites, or protein IDs (or names), while the observations are your sample names or cell names.

The expression/abundance of each feature must be in plain numeric format (using the scientific notation is not allowed).

The metadata should be in the following form:

Observation
Condition
Batch
Observation 1
Control
1
Observation 2
Control
2
Observation 3
Treatment
1
Observation 4
Treatment
2

As for the data, the metadata can be uploaded as a CSV or a TXT file. Any other formats are not supported (like the Excel format or raw data files). In case of the example presented above only the Condition and Batch columns are present however, your metadata can have many columns (e.g. clinical data). The first column of the metadata is dedicated to the observation names, while all the other columns are dedicated to any relevant information associated with the observations (groups, progress-free survival, clusters, etc.). The more columns containing relevant information, the better.

Check if your data and metadata match before uploading. 

The observation names in the metadata must match with the observation names in the data file. If the observations do not match, the data file will be used as the ground truth to generate a metadata file that is consistent with the data. These means that observations that are in the metadata but not in the data will be removed, and observations that are in the data but not in the metadata will be added (with empty entries).


After you have selected your data matrices, click upload files.

When working with large files, making a compressed/zipped csv or txt (.csv.zip/.txt.zip) file for upload will result in shorter uploading times. 



3.2 10X data (scRNAseq)


10X data can be uploaded directly to the UniApp for single cell RNA data. The following files need to be uploaded to the UniApp: 


  • mtx file
  • gene file
  • barecode file
  • metadata matrix 


3.3 GCT


GCT files can be directly uploaded to our system.


3.4 Anndata


Anndata can be uploaded as a h5ad file.


3.5 Seurat objects


Seurat objects can be uploaded a .rds files.


3.6 Metabolomics


Data for metabolomics experiments looks very similar to the basic data uploads, with inclusion of a HMDBID column that needs to be left empty. 


Feature
HMDBID
Observation 1
Observation 2
Observation 3
Feature 1

2
0
56
Feature 2

12
20
0
Feature 3

25
31
15
Feature 4

32
7
40
Feature 5

0
6
0
Feature 6

0
7
17


For the metadata of metabolomics, an extra column should be added named 'injection order'.


Observation
Injection order
Condition
Observation 1
1
Control
Observation 2
2
Control
Observation 3
3
Treatment
Observation 4
4
Treatment

3.7 Spatial single cell 10X

For the time being, all spatial upload will have to go through our team.


The following files need to be uploaded: 

  • 10X file
  • json file
  • image

3.8 Gene metadata

The gene metadata is a versatile dataset type designed to store a broad array of gene information. Within this file format, the initial column contains gene identifiers, while subsequent columns are flexible, capable of holding categorical or numerical values. 


This type dataset can be visualized and explored using the gene metadata table algorithm. Additionally, it can be used as input in the rank and rule-based meta-analysis allowing you to integrate any available gene information with the omic results you have created within the UniApp. 


An example of a correctly formated gene metadata file is shown below:


Gene
Gene type
Disease score
           Source
BRCA1
Oncogene
0.84
CancerDatabase
BRCA1
Oncogene
0.90

Gene Expression Atlas

TP53
Tumor supressor
0.77

Gene Expression Atlas

EGFR
Receptor
0.23

Gene Expression Atlas

Notice that duplicate gene indentifiers are allowed. It is possible handle these duplicates in downstream analyses.




4. Creating a new experiment and data upload - Annotate your files 

As the last step, you must annotate the files you uploaded in the previous step. Simply select which file is which from the dropdown menu. 



Once you have annotated your file click on the "Proceed to data staging" button. 

5. Setting parameters

In the setting parameters tab you will be able to annotate your data in data staging: specify data matrix type, organism of origin and gene name identifier. 




In the Parameters data tab you can more precisely define your data matrix type and provide information that are used for certain downstream algorithms.


  • Select data matrix type: specify data matrix type, meaning if the data matrix has been  previously normalized or not. A data matrix can be either raw or normalized. You can check if your data matrix has been previously normalized by taking a look at the digits in your data matrix. If the numbers in your data matrix are predominantly integers, no normalization was previously performed. This means you should perform data normalization in the subsequent Data pretreatment module. If the numbers are predominantly decimals the matrix is already normalized. In this case you can skip data normalization in the Data pretreatment module. 
  • Select organism: specify from which organism the data was derived from. Current options are: human, mouse or rat. You can identify human data by the use of all capital letters for the features. Murine features are usually depicted with 1 capital letter, followed by small letters.
  • Select gene identifier: select type of gene identifier used in the first column of the data matrix.

MAKE SURE TO DOUBLE CHECK THIS STEP WITH YOUR SUPERVISOR IF NECESSARY, WRONG ANNOTATIONS WILL LEAD TO ERRORS IN YOUR DOWNSTREAM ANALYSIS OR WILL IMPEDE DATA UPLOAD.

6. Completing dataset upload


Click on 'Complete Dataset Upload to successfully upload your new dataset.

7. Examples of datasets for each technology type


Here you can find a link to an example of a dataset (metadata + data) for each technology type. This can be used as a template to model your own data to, to check whether the upload functions in the UniApp, see compare to your data files to investigate where a possible problem with the upload might be in your data.


7.1 Bulk RNA seq

https://drive.google.com/drive/folders/1Tm70Zv-DqsghuLAQAa-g28VRlGqAra_L

7.2 Micro array gene expression

https://drive.google.com/drive/folders/1ois5--P5fZ5HDiouY-DWgk1Q1gK19bDs

7.3 Gene expression spatial data

Expected soon.

7.4 Single cell RNA seq

https://drive.google.com/drive/folders/1t3eM3YxVm6NoGr3GBTvlAh53xUoBih2k

7.5 Proteomics

https://drive.google.com/drive/folders/13Ix4-0ZWUnC_u3GHmYfbP-sVTTsN5_9l

7.6 Metabolomics

https://drive.google.com/drive/folders/137CahatOugkvx1wd5MngG3hOP7cv3AC8



Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article