Skip to content

Spatial mapping of cell types by integration of transcriptomics data

License

Notifications You must be signed in to change notification settings

almaan/stereoscope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spatial mapping of cell types by integration of transcriptomics data

This repository contains the Python package stereoscope, the implementation of the method presented in the paper "Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography". In addition, scripts used to: preprocess, visualize, and compare results presented in the paper are also included.

Below examples illustrating how to use stereoscope can be found. While some of these examples are cast as a guides to reproduce the results presented in the paper, they can easily be generalized and applied to any data set. The three examples are:

  1. Reproducing the mouse brain analysis: Conducting the complete analysis of the mouse brain presented in the paper, from downloading data to visualization of the results
  2. Reproducing the method comparison: Generation of synthetic data, running all three methods and comparing them
  3. Using pre-estimated parameters : Examining four additional breast cancer samples, applying already estimated single cell parameters to set of ST data.

Installing stereoscope

In order to install stereoscope, we begin by cloning this very repository. Open your terminal and go to your desired download directory - then enter the following:

foo@bar:~$ git clone https://github.com/almaan/stereoscope 

Now to actually install stereoscope, just use the provided setup.py file. If you are working with conda or virtual environments, this is where you should create and source a new environment.

the following commands

foo@bar:~$ cd stereoscope
foo@bar:~$ ./setup.py install

NOTE: on some systems you may need to add the --user flag to the install module, resulting in ./setup.py install --user.

Assuming the installation was successful, you should now have access to both the stereoscope python package (stsc) and the stereoscope command line interface (CLI) meaning you can conduct the analysis from the terminal. To confirm that the installation was successful, we will run two tests, which are:

foo@bar:~$ python -c "import stsc; print(stsc.__version__)"
stereoscope : 0.2.0
foo@bar:~$ stereoscope test
successfully installed stereoscope CLI

If you cannot access stereoscope from the command line, and receive a message stating something similar to:

foo@bar:~$ stereoscope test
bash stereoscope: command not found..

This means that your install location is not a part of the PATH variable. Including the --user flag during installation should place your packages in your ~/.local directory. Thus, this issue may be resolved by entering the following into your terminal:

foo@bar:~$ export PATH=$PATH:/user/home/.local/bin

Having installed stereoscope, we are now ready to start with the analysis.


Reproducing The Mouse Brain Analysis

Let us begin by reproducing the results presented for the mouse brain (Hippocampal region), to briefly remind you what that means; we will analyze two (1k) ST-sections of the mouse brain (jointly) and map single cell data from the resource mousebrain.org onto it. Here we will go through the whole workflow including downloading the data, preprocessing, analysis and visualization. If you aren't that interested in the first parts and want to get started with the analysis, you can skip step 1-2 and use the already processed data found in the data/mousebrain folder. Simply unzip these files into a folder named data/curated.

1. Downloading the data

As mentioned above, the data we used in the paper is downloaded from mousebrain.org. More specifically, we used the loom-file containing cells from the Hippocampus region. If you do not have loompy installed, we refer you to this link for further instructions. You can either download the data set via the web browser, or the terminal using :

foo@bar:~$ cd data
foo@bar:~$ mkdir raw curated
foo@bar:~$ cd raw
foo@bar:~$ curl -O https://storage.googleapis.com/linnarsson-lab-loom/l1_hippocampus.loom  

2. Prepare Data

2.1 Subsample Single Cell Data

We will subsample the single cell data, this is not a requirement, but it allows us to run the analysis a bit faster. First we create a modified loom-file where a new set of labels is introduced, formed by joining the Cluster and Class labels together, this adds (some) context to our labels compared to just using cluster indices. Enter the following in the terminal:

foo@bar:~$ ../../preprocess/hippocampus/create-mod-hippo-loom.py l1_hippocampus.loom .
successfully created modifed loom-file >> mod_l1_hippocampus.loom

Next, we subsample our data set. We use a lower and upper bound (See Methods) of 25 repsectively 250 cells per cell type. Do this by entering:

foo@bar:~$ ../../preprocess/hippocampus/subsample-data.py -lf mod_l1_hippocampus.loom -o ../curated -lb 25 -ub 250 -cn
"Celltype_1"

Unique Identifier for set >> 20191029082957812352            
Astrocytes_13 | was discarded due to insufficient number of cells
Astrocytes_14 | Used 250 cells                               
Astrocytes_38 | was discarded due to insufficient number of cells
Astrocytes_39 | was discarded due to insufficient number of cells
Astrocytes_40 | Used 250 cells                               
Astrocytes_41 | Used 31 cells
Astrocytes_42 | Used 250 cells
Astrocytes_44 | Used 41 cells
....

This will create three files in the data/curated folder - a count matrix of the cells included in the set, a meta-file containing their respective labels and a ''stats-file'' which displays the composition of the set. All files belonging to the same set are marked with a unique identifier which is time and date-based (here : 20191029082957812352), your identifier will therefore be different from the one above. As you may notice, not all ''cell types'' are listed; since we specified a lower bound, all types with fewer than 25 members (cells) are excluded.

2.2 ST-data

We have included the two mouse brain sections presented in the paper as .tsv files in the repo. You do not need to preprocess these, but we have zipped them to reduce the file sizes. Unzip these files and place them in the data/curated folder, you could do this either interactively or for example by entering the following command into the terminal :

ST-data password : zNLXkYk3Q9znUseS

foo@bar:~$ 7z e mouse-st-data.zip

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21

Scanning the drive for archives:
1 file, 7254353 bytes (7085 KiB)

Extracting archive: mouse-st-data.zip
--
Path = mouse-st-data.zip
Type = zip
Physical Size = 7254353

    
    Enter password (will not be echoed):
....

foo@bar:~$ mv st-hippo*tsv curated/

If all steps were successfull, this (below) should be representative of the content in your data/curated folder

foo@bar:~$ ls -1 curated/*
20191029082957812352.cnt_data.tsv
20191029082957812352.mta_data.tsv
20191029082957812352.stats.tsv
st-hippo1.tsv
st-hippo2.tsv

2.3 Data Formatting

The output from the subsampling and the ST-data we provided you with are given in the correct formats, hence you will not have to do any additional work. However, we will briefly describe what type of files are required and how they should be formatted in order to run stereoscope :

  • Spatial Transcriptomics (ST) Count Data
    • Alt1 - a .tsv file with spots as rows and genes as columns
    • Alt2 - a .h5ad file with capture locations as observations (rows) and genes as variables (columns)
  • Single Cell Count Data File
    • Alt1 - a .tsv file with cells as observations (rows) and genes as variables (columns), each cell (row) should have a unique label
    • Alt2 - a .h5ad file with cells as rows and genes as columns. Cell type annotations can be read from this file as well, and should then be put in the .obs slot. See Alt2 below for more information.
  • Single Cell Annotation Data
    • Alt1 - a .tsv file with the same rownames as the count data file, either with one single column listing the annotations, or multiple columns where the column containing the labels should be named 'bio_celltype**.
    • Alt2 - use labels from .h5ad file. Make sure your labels are located in the .obs slot and then use the flat --label_colname KEY to indicate what the key to access these labels is (i.e., KEY).

NOTE : The h5ad support is still in beta. If you are experiencing any problems, we'd be tremendously happy if you'd let us know by posting an issue!

Some additional things to keep in mind are:

  • Make sure that your ST and single cell data use the same gene identifiers. For example, if one set uses ENSEMBL ids while the other one uses HGNC gene symbols, this will raise an error. However as long as the ids match, stereoscope is agnostic to the type of identifiers used.
  • Do not normalize your data - the model relies on using raw count data, your gene counts should thus always be integer numbers.
  • When using h5ad files, the index values of the var and obs slots will be used as gene (var.index) and capture location (obs.index) identifiers.

3. Analysis

3.1 Run the analysis

Now when the data has been prepared, we are ready to use `steroscope**. We will run the complete analysis, estimating rates and logits from the single cell data and then use these to infer the proportion values in our spatial data. We use the following specs:

parameter values
number of genes 5000
sc epochs 75000
sc batch size 100
st epochs 75000
st batch size 100
learning rate 0.01
gpu True

NOTE : In the original manuscript we used the top 5000 (w.r.t. expression levels) genes to show that stereoscope produces good results without any need for pre-processing. However, for optimal results we would recommend a more sophisticated selection of genes; one suggestion is to use the function highly_variable_genes from scanpy's pre-processing module (see documentation). Once a set of genes has been extracted, these can be provided (as a .txt file) to stereoscope using the -gl flag. In the next update, this will be introduced as an option in the run module.

To run the analysis enter the following into your terminal :

foo@bar:~$ cd ../../res
foo@bar:~$ stereoscope run --sc_cnt ../data/curated/*cnt*.tsv --sc_labels ../data/curated/*mta*.tsv -sce 75000  -o hippo_1 -n 5000 --st_cnt ../data/curated/st-hippo*tsv -ste 75000 --gpu -stb 100 -scb 100
[2019-10-29 09:07:50,891 - stsc - INFO ] >> Using device cuda
[2019-10-29 09:07:50,891 - stsc - INFO ] >> fitting sc data | count file : ../data/curated/20191029082957812352.cnt_data.tsv | labels file : ../data/curated/20191029082957812352.mta_data.tsv                                                                                 
[2019-10-29 09:09:51,527 - stsc - INFO ] >> SC data GENES : 5000  SC data CELLS : 8449  SC data TYPES : 56

Epoch : 211  /75000 | Loss : 3.004263E 07 | [                     ]

This will create a subfolder named hippo_1 in the res folder, where all results, logs and data related to this analysis will eventually be located.

For more information regarding which arguments and configurations you can make to your analysis use:

foo@bar:~$ stereoscope run -h

3.2 Monitoring progress

Even with GPU resources the analysis will take some time to complete. While the progress bar shows the current status - it's also of interest to put this into context and track the progress over time. We can do so by using the progress module accordingly:

foo@bar:~$ stereoscope progress -lf hippo_1/sc_loss*txt & disown

This will open up a interactive window, where you can zoom and move around. The plot will be updated every 10 seconds, allowing you to see how the loss changes over time. Appending & disown to the command, allows us to keep using the same terminal window. Below is an example of what the progress-plot may look like.

alt text

4. Inspection of results

4.1 Orienting the results

Upon completion of the analysis. The folder res/hippo_1 folder should contain the following set of files:

foo@bar:~$ ls -1
logits.2019-10-29090750.880065.tsv
R.2019-10-29090750.880065.tsv
sc_loss.2019-10-29090750.880065.txt
sc_model.2019-10-29090750.880065.pt
st-hippo1
st-hippo2
st_loss.2019-10-29090750.880065.txt
st_model.2019-10-29090750.880065.pt
stsc.2019-10-29090750.880065.log

With the subfolder content being:

foo@bar:~$ ls st-hippo*/ 
st-hippo1/:
W.2019-10-29090750.880065.tsv

st-hippo2/:
W.2019-10-29090750.880065.tsv

The W-files contain the proportion estimates, and thus what we are mainly interested in. Every section has its own ouput folder, while the W-files share the same name. These files are given in a matrix format n_spots x n_types where each element represents the proportion of a cell type within a specific spot. To further illustrate :

foo@bar:~$ head st-hippo1/W*tsv -n 5 | cut -f 1-5 | column -t
               Astrocytes_14  Astrocytes_40  Astrocytes_41  Astrocytes_42  
4.83x31.08     0.017654551    0.019365933    0.021137744    0.019530216
19.98x24.93    0.020201972    0.017161706    0.016906356    0.017648073
15.87x9.01     0.017025681    0.017077405    0.016912553    0.016671246
5.83x27.97     0.019220736    0.018679986    0.017026993    0.017045338

We could justify the usage of a "threshold", for example setting all proportions lower than a certain value to zero. We will not to adjust our results and keep all values as they are.

4.2 Visualization

We include a tool (the look module) for seamless visualization of the results generated by stereoscope. To use look do:

foo@bar:~$ stereoscope look -pp st-hippo*/W*tsv -o viz -sc i -sb s -nc 7 -c "umap" -g -ms 40

This will generate two types of images, saved to the folder viz

  1. Separate visualizations : spots are plotted according to their array coordinates, and the intensity of their blue facecolor corresponds to the proportion value of each celltype. All of these plots are scaled internally (this highlights the spatial patterns, but is not suitable for comparisons), changing the argument -sc i to -sc s will scale all values within each section.

  2. Joint visualizations : These are the type of "compressed" images described in the paper's Method. This type of visualization is found in the Supplementary. Regions of similar colors have a similar cell type compositions. The method used for the dimensionality reduction is umap, but you can also choose between pca (slightly faster) and tsne (much slower).

Below are some examples of the generated images:

alt text alt text

You can customize the output of stereoscope look in multiple ways by providing certain arguments, for more information run:

foo@bar:~$ stereoscope look -h

One could also overlay the spots on the tissue image, to see how the spatial patterns relates to the morphology, like we did in our figures. This type of visualization is not a part of the stereoscope package, but we do provide scripts for this.

The material that you need for this would be:

  • HE-image - image taken of the tissue, can be scaled but not cropped
  • Transformation Matrix - matrix that maps spot coordinates to pixel coordinates
  • Proportion Estimates - output from stereoscope analysis
  • Mask (optional)- mask to indicate which parts of the area to be included (transparent) resepectively excluded (black)

We actually use resized images (30% downscaled), since the original images are unnecessarily large.

Still being in the res folder, all you have to do is run:

foo@bar:~$ ./map2he.py -i ../data/mouse/rsc/st-hippo1.jpg -t ..data/mouse/rsc/st-hippo1-tmat.txt -p st-hippo1/W*tsv -sf 0.3 -si -o he_overlay

using image file ../data/mouse/rsc/st-hippo1.jpg
using proportions file st-hippo1/W.2019-10-29090750.880065.tsv
Rendering type Astrocytes_14
Rendering type Astrocytes_40
....

Resulting in images like these (we rotated these images in our paper):

alt text

Reproducing the Method Comparion

In the paper we compare stereoscope with two other methods DWLS and deconvSeq using synthetic data. This comparison is something we will reproduce in this section.

1. Generating Synthetic Data

Begin by unzipping the file data/comp/comp-data.zip into the data/comp/ folder

foo@bar:~$ cd data/comp
foo@bar:~$ unzip comp-data.zip 
foo@bar:~$ ls -1

comp-data.zip

real:
hippo-real-sc-cnt.tsv
hippo-real-sc-mta.tsv

synthetic:
counts.st-hippo-comp.tsv
generation.20190908194059502947.cnt_data.tsv
generation.20190908194059502947.mta_data.tsv
members.st-hippo-comp.tsv
proportions.st-hippo-comp.tsv
validation.20190908194059502947.cnt_data.tsv
validation.20190908194059502947.mta_data.tsv

As you can see, there is already a set of synthetic data available; you can use this data if you'd like to, if so, then move ahead to step 2. We will however go through the full procedure for generation of these files.

Begin by removing the prepared data, to make sure we don't mix thing up :

foo@bar:~$ rm synthetic/*

We first split our single cell data into a "generation" and ''validation'' set, preferably of equal size. The single cell data is the same Hippocampus set we used in the previous example. The only difference is that we subsampled w.r.t. to Subclass labels rather than Cluster labels.

foo@bar:~$ ../../comparison/synthetic_data_generation/make_sc_sets.py real/hippo-real-sc-cnt.tsv real/hippo-real-sc-mta.tsv synthetic

You should now have the following four files in your synthetic folder :

foo@bar:~$ ls -1 synthetic/
generation.20190908194059502947.cnt_data.tsv
generation.20190908194059502947.mta_data.tsv
validation.20190908194059502947.cnt_data.tsv
validation.20190908194059502947.mta_data.tsv

We will use the "generation" set to generate synthetic ST data. As in the paper we use 1000 spots and 500 genes. To easily keep track of our files we tag them with st-hippo-comp. The validation set is set aside, to later be used as single cell data input to each respective method upon deconvolution.

foo@bar:~$ ../../comparison/synthetic_data_generation/make_st_set.py -c synthetic/generation.20190908194059502947.cnt_data.tsv -l synthetic/generation.20190908194059502947.mta_data.tsv -ns 1000 -ng 500 -o synthetic -t st-hippo-comp

A total of three files will be generated - ST expression data (counts), proportion values (proportions) and the number of cells (members). Since we know the actual proportions (ground truth) of cell types within each spot, this data is suitable to use for comparison of method performance.

foo@bar:~$ ls -1 synthetic/
counts.st-hippo-comp.tsv
generation.20190908194059502947.cnt_data.tsv
generation.20190908194059502947.mta_data.tsv
members.st-hippo-comp.tsv
proportions.st-hippo-comp.tsv
validation.20190908194059502947.cnt_data.tsv
validation.20190908194059502947.mta_data.tsv

2. Running stereoscope

We will run stereoscope with the following arguments (see previous section for more details regarding the stereoscope interface):

foo@bar:~$ stereoscope run --sc_cnt synthetic/valid*cnt* --sc_labels synthetic/valid*mta* -scb 256 \ 
-sce 50000 --st_cnt synthetic/counts.st-hippo-comp.tsv -ste 50000 -o ../res/comp-stereoscope \
--gpu -lr 0.1

Where the results will be saved to comp-stereoscope in the res folder

3. Running other methods

Since none of the other two methods were designed with ST data in mind, we have written wrappers for them. This allows ST data to be used as input and render similar output files to stereoscope.

We'd also like to state that it is not our intent to criticize these methods, but rather show that they are not optimal for usage with ST-data.

Disclaimer: While we've tried to add checks for dependencies and automatic installation of the required R packages for these two methods; deconvSeq and DWLS are not our tools and hence package dependencies may become a problem - we refer to respective method's github page for further help in addressing such issues.

DWLS

Small modifications were made to the DWLS code; some of the provided functions crashed upon execution, these changes were however minor and do not affect the method itself. We provide a ''modded'' version of their file Deconvolution_functions.R. The two modifications we made are marked with the tag #MODIFICATION within the code. Using Seurat for extraction of DE-genes was not successfull, thus we used MAST for this purpose. Negative values were occasionally given for some proportions, all of these were set to zero.

To run DWLS go to the main directory of this repo and enter:

foo@bar:~$ comparison/alternative_methods/DWLS/DWLS-implementation.R -wd res/comp-DWLS \
-sc data/comp/synthetic/validation*cnt_data.tsv -mt data/comp/synthetic/validation.*.mta_data.tsv \
-st data/comp/synthetic/counts*.tsv 

This will generate an output file formatted like that of stereoscope, that is: a matrix with dimensions n_spots x n_types, where each element represents the proportion of a specific type within a given spot.

deconvSeq

We followed the instructions given in the HTML-Vignette for deconvSeq in order to estimate the proportions - no cell cycle filtering was performed.

To run deconvSeq go to the main directory of the repo and enter:

foo@bar:~$ comparison/alternative_methods/deconvSeq/deonvSeq-implementation.R \
 -sc data/comp/synthetic/validation.*.cnt_data.tsv \
 -mt data/comp/synthetic/validation.*.mta_data.tsv \ 
 -st data/comp/synthetic/counts.*.tsv -o res/comp-deconvSeq

NOTE : We noticed that the bioMart-dependent functions in deconvSeq do not always complete successfully, if such an error arise wait a minute and try again.

4. Comparing performance

Having estimated the proportions using all three methods, we can now compare them. As a performance metric we compute the RMSE between the actual proportion values within each spot and the estimated values. The results are visualizaed in a boxplot, and a one-sided test (Wilcoxon signed-rank test) is also conducted to see whether stereoscope performs better than the other methods.

foo@bar:~$ comparison/compare.py -rf  res/comp-stereoscope/*/W*tsv  res/deconvSeq-proportions.tsv res/comp-DWLS/proportions.tsv \
 -tf data/comp/synthetic/proportions.hippo.tsv -o res/comp -mn stereoscope deconvseq DWLS

Which will generate the following image:

alt text

The dashed line represents the mean of the average performance upon randomly sampling propotrions from a homogenous Dirichlet distribution (concentration 1), repeated 1000 times.

Using pre-estimated parameters

Occasionaly you might face a scenario where you have run stereoscope in a previous analysis and now want to use the same single cell data set but different (or more) ST-data. The way we have implemented our method and built stereoscope allows you to this in a seamless manner.

To exemplify, imagine you've been working with lung cancer ST and single cell data, and mapped the latter onto the former. Now assume your collaborators tell you that they have a nice set of breast cancer data that they want you to analyze for them. While a breast cancer data set would be ideal to work with, one initial analysis you could do is to see how the lung cancer single cell data maps onto the breast cancer spatial data - potenitally removing some of the lung-specific types.

We have downloaded and curated the data set presented in the publication _"Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species", meaning we removed cell types which we considered as lung-specific such as club cells and ciliated cells. For more infomration regarding the data set see Rapolas. et. al.. We have also estimated single cell parameters for this data, which we provide you with (to emulate the scenario described above).

As for the ST data - in the original ST publication 4 Breast Cancer sections were presented - we will analyze theses breast cancer samples using the Lung Cancer single cell data set. Fortunately half of the work is already done.

alt text

Above : parts of Figure 4 in the original ST publication, displaying some of the breast cancer tissue.

1. Downloading data

The data is available at this link, download the 4 breast cancer samples and put them in data/add-bc either interactively (make sure to rename them as below) or using the terminal :

foo@bar:~$ cd data
foo@bar:~$ mkdir add-bc
foo@bar:~$ cd add-bc
foo@bar:~$ for ii in 1 2 3 4; do curl https://www.spatialresearch.org/wp-content/uploads/2016/07/Layer${ii}_BC_count_matrix-1.tsv --output st-bc${ii}.tsv; done

Once the download have completed you should have the following content in your data/add-bc folder:

foo@bar:~$ ls -1
st-bc1.tsv
st-bc2.tsv
st-bc3.tsv
st-bc4.tsv

These files are already processed, only having spots under the tissue and uses HGNC gene symbols - hence there is no need to pre-process our data.

2. Running stereoscope

You will find the estimated single cell parameters (rates and logits) for the Lung Cancer data set in the data/params-lc.zip file, which when unzipped will create a folder data/params-lc. To use these in conjuction with the newly downloaded breast cancer ST-data simply do:

foo@bar:~$ stereoscope run --sc_fit ../params-lc/R*.tsv ../params/logits*.tsv --st_cnt st-bc*.tsv \
--ste 50000 -stb 256 -lr 0.01 --gpu -o ../../res/bc-add
[2019-10-30 05:44:00,255 - stsc - INFO ] >> Using device cuda
[2019-10-30 05:44:00,256 - stsc - INFO ] >> load sc parameter | rates (R) : ../params-lc/R.lc.tsv | logodds (logits) : ../params-lc/logits.lc.tsv
[2019-10-30 05:44:00,312 - stsc - INFO ] >> fit st data section(s) :['st-bc1.tsv', 'st-bc2.tsv', 'st-bc3.tsv','st-bc4.tsv']
[2019-10-30 05:44:13,668 - stsc - INFO ] >> ST data GENES : 16385  ST data SPOTS : 1031  

The flag --sc_fit expects two positional arguments, the first being the path to the rates and the second being the path to the logits.

3. Visualization

We can visualize the resullts by using the look module of stereoscope (see above for more detailed description). This is simply done by :

foo@bar:~$ cd ../../res/bc-add
foo@bar:~$ stereoscope look -pp bc*/W*tsv -ms 80 -nc 5 -sc i -sb s -c "umap" -g -y -o viz

Which will generate a set of images like these: alt text

We do not have a good reference like the Allen Brain Atlas to assess the quality of our results here, and we have no ambition of presenting these as new findings in a biological context. Rather, this is to be considered as partly a demonstration of how to use stereoscope in different settings, but also how data sets from different tissues can be combined if one desires.

TODO's

  • Add cell type priors based on single cell data
  • implement subsampling as module

Special Thanks

  • cartal - for identifying small bugs in the look module and providing suggestions for additional features to enhance visualization.
  • nik1sto - catching several bugs and providing great feedback along with suggestions for features to improve.