.. role:: python(code) :language: python :class: highlight Single-cell Reference Guided Gene Expression Embedding ======================================================== A single-cell reference provides complementary molecular context beyond histology and can improve prediction. Below, we describe how we construct the gene expression embedding from the single-cell reference. Cell-type deconvolution using RCTD ------------------------------------------------------ First, you need to build a single-cell reference (paired with ST data or from similar tissues) with cell type labels saved as ``celltype`` in ``metadata``. Save the reference as ``sc_reference.RDS``. **This step is not included in** ``spEnhance`` **workflow**. Next, we computed the cell type-by-gene reference matrix :math:`\mathbf{R} \in\mathbb{R}^{C\times G}`, where each entry :math:`R_{c,g}` is the mean expression of gene :math:`g` across all cells of type :math:`c`. Then we apply ``RCTD`` to deconvolve each spot, yielding :math:`\mathbf{\Pi}=[\pi_{s,c}]\in[0,1]^{S\times C}`, which estimates the fraction of each cell type present in spot :math:`s = 1, \dots, S` (rows summing to one). .. code-block:: shell Rscript run_rctd.R ${prefix}sc_reference.RDS ${prefix}cnts_train_seed_1.csv ${prefix}locs.csv ${prefix} 4 + **Input**: + ``${prefix}sc_reference.RDS``: paired single cell reference. You can replace it with file name of your own. Cell type labels shoule be saved as ``celltype`` in ``metadata`` + ``${prefix}cnts_train_seed_1.csv``: count matrix for deconvolution. **Do not use the unsplit data.** + ``${prefix}locs.csv``: spot location matrix paired with the previous count matrix. + **Parameters**: + ``${prefix}``: directory to the folder containing the files, i.e. ``data/``. + **Output**: + ``proportion_celltype.csv``: spot deconvolution results, with each row representing a spot and each column representing a cell type (row summing to one). + ``locs_celltype.csv``: spot location matrix paired with ``proportion_celltype.csv``. + ``reference.csv``: cell type-by-gene reference matrix calculated using single-cell reference. Additionally, obtain cell type names by running: .. code-block:: shell python select_genes.py --n-top=600 ${prefix}"proportion_celltype.csv" ${prefix}"cell-type-names.txt" Cell type names will be saved into ``cell-type-names.txt``. Pixel-level cell type prediction ------------------------------------------------------ To predict pixel-level cell types, we train a graph convolutional network (GCN) that maps the histology embedding :math:`\mathbf{U}` to pixel-wise probabilities :math:`\mathbf{P} \in[0,1]^{H_1\times W_1\times C}`, using the spot-level deconvolution :math:`\mathbf\Pi` for weak supervision. .. code-block:: shell python impute_slide_celltype.py ${prefix} --epochs=100 --device='cuda' --n_states=5 + **Input**: + ``embeddings-hist-merged.pickle``: merged histology features. + ``proportion_celltype.csv``: spot deconvolution results, containing estimated proportion of each cell type in each spot. + ``locs_celltype.csv``: spot location matrix paired with ``proportion_celltype.csv``. + ``cell-type-names.txt``: file containing cell type names. + **Parameters**: + ``${prefix}``: directory to the folder containing the files, i.e. ``data/``. + ``--device``: choosing which device to use, either ``cuda`` or ``cpu``. + ``--n_states``: number of states (number of independent models trained, validated and used for prediction) + **Output**: + ``Cell_proportion/``: predicted cell type proportion for each pixel, with each cell type saved in a ``CELL-TYPE.pickle`` file. The use of GPU is highly recommended. Gene expression feature assignment ------------------------------------------------------------ Given the cell type-by gene reference matrix :math:`\mathbf{R}` and the pixel-level cell-type probabilities :math:`\mathbf{P}`, we assign a gene-expression value at pixel :math:`(h,w)` via :math:`V_{h,w,g}^{(0)} \;=\; \sum_{c=1}^C P_{h,w,c}\, R_{c,g}`. Collecting all genes and pixels forms the tensor :math:`\mathbf{V}^{(0)} \in \mathbb{R}^{H_1 \times W_1 \times G}`. To extract compact representations, we reduce the gene dimension using truncated SVD to obtain the gene feature embedding :math:`\mathbf{V} \in \mathbb{R}^{H_1 \times W_1 \times G_1}`. .. code-block:: shell python assign_reference.py ${prefix} --mode='combined' --normalize='gene-zscore' --dim=256 + **Input**: + ``Cell_proportion/``: predicted cell type proportion for each pixel. + ``reference.csv``: cell type-by-gene reference matrix calculated using single-cell reference. + **Parameters**: + ``${prefix}``: directory to the folder containing the files, i.e. ``data/``. + ``--mode``: two modes of assigning reference gene expression for each cell type are provided. + ``combined``: each pixel's gene expression was estimated by linearly combining cell-type reference profiles using the pixel's predicted cell-type proportions as weights. + ``uncombined``: each pixel's gene expression was estimated as the cell-type reference profiles of the most probable cell type. + ``--normalize``: two modes of normalization offered. + ``gene-zscore``: z-score normalization for each gene across all pixels. + ``celltype``: z-score normalization for all genes in the same cell type. + ``--dim``: number of reduced dimensions of gene expression features. + **Output**: + ``embeddings-gene.pickle``: gene expression embedding. + ``embeddings-combined.pickle``: combined embeddings of both gene expression and histological features. .. image:: /_static/celltype.png :width: 600px :align: center :alt: Celltype prediction