Single-cell Reference Guided Gene Expression Embedding#

A single-cell reference provides complementary molecular context beyond histology and can improve prediction. Below, we describe how we construct the gene expression embedding from the single-cell reference.

Cell-type deconvolution using RCTD#

First, you need to build a single-cell reference (paired with ST data or from similar tissues) with cell type labels saved as celltype in metadata. Save the reference as sc_reference.RDS. This step is not included in spEnhance workflow.

Next, we computed the cell type-by-gene reference matrix \(\mathbf{R} \in\mathbb{R}^{C\times G}\), where each entry \(R_{c,g}\) is the mean expression of gene \(g\) across all cells of type \(c\).

Then we apply RCTD to deconvolve each spot, yielding \(\mathbf{\Pi}=[\pi_{s,c}]\in[0,1]^{S\times C}\), which estimates the fraction of each cell type present in spot \(s = 1, \dots, S\) (rows summing to one).

Rscript run_rctd.R ${prefix}sc_reference.RDS ${prefix}cnts_train_seed_1.csv ${prefix}locs.csv ${prefix} 4
  • Input:
    • ${prefix}sc_reference.RDS: paired single cell reference. You can replace it with file name of your own. Cell type labels shoule be saved as celltype in metadata

    • ${prefix}cnts_train_seed_1.csv: count matrix for deconvolution. Do not use the unsplit data.

    • ${prefix}locs.csv: spot location matrix paired with the previous count matrix.

  • Parameters:
    • ${prefix}: directory to the folder containing the files, i.e. data/.

  • Output:
    • proportion_celltype.csv: spot deconvolution results, with each row representing a spot and each column representing a cell type (row summing to one).

    • locs_celltype.csv: spot location matrix paired with proportion_celltype.csv.

    • reference.csv: cell type-by-gene reference matrix calculated using single-cell reference.

Additionally, obtain cell type names by running:

python select_genes.py --n-top=600 ${prefix}"proportion_celltype.csv" ${prefix}"cell-type-names.txt"

Cell type names will be saved into cell-type-names.txt.

Pixel-level cell type prediction#

To predict pixel-level cell types, we train a graph convolutional network (GCN) that maps the histology embedding \(\mathbf{U}\) to pixel-wise probabilities \(\mathbf{P} \in[0,1]^{H_1\times W_1\times C}\), using the spot-level deconvolution \(\mathbf\Pi\) for weak supervision.

python impute_slide_celltype.py ${prefix} --epochs=100 --device='cuda' --n_states=5
  • Input:
    • embeddings-hist-merged.pickle: merged histology features.

    • proportion_celltype.csv: spot deconvolution results, containing estimated proportion of each cell type in each spot.

    • locs_celltype.csv: spot location matrix paired with proportion_celltype.csv.

    • cell-type-names.txt: file containing cell type names.

  • Parameters:
    • ${prefix}: directory to the folder containing the files, i.e. data/.

    • --device: choosing which device to use, either cuda or cpu.

    • --n_states: number of states (number of independent models trained, validated and used for prediction)

  • Output:
    • Cell_proportion/: predicted cell type proportion for each pixel, with each cell type saved in a CELL-TYPE.pickle file.

The use of GPU is highly recommended.

Gene expression feature assignment#

Given the cell type-by gene reference matrix \(\mathbf{R}\) and the pixel-level cell-type probabilities \(\mathbf{P}\), we assign a gene-expression value at pixel \((h,w)\) via \(V_{h,w,g}^{(0)} \;=\; \sum_{c=1}^C P_{h,w,c}\, R_{c,g}\). Collecting all genes and pixels forms the tensor \(\mathbf{V}^{(0)} \in \mathbb{R}^{H_1 \times W_1 \times G}\).

To extract compact representations, we reduce the gene dimension using truncated SVD to obtain the gene feature embedding \(\mathbf{V} \in \mathbb{R}^{H_1 \times W_1 \times G_1}\).

python assign_reference.py ${prefix} --mode='combined' --normalize='gene-zscore' --dim=256
  • Input:
    • Cell_proportion/: predicted cell type proportion for each pixel.

    • reference.csv: cell type-by-gene reference matrix calculated using single-cell reference.

  • Parameters:
    • ${prefix}: directory to the folder containing the files, i.e. data/.

    • --mode: two modes of assigning reference gene expression for each cell type are provided.
      • combined: each pixel’s gene expression was estimated by linearly combining cell-type reference profiles using the pixel’s predicted cell-type proportions as weights.

      • uncombined: each pixel’s gene expression was estimated as the cell-type reference profiles of the most probable cell type.

    • --normalize: two modes of normalization offered.
      • gene-zscore: z-score normalization for each gene across all pixels.

      • celltype: z-score normalization for all genes in the same cell type.

    • --dim: number of reduced dimensions of gene expression features.

  • Output:
    • embeddings-gene.pickle: gene expression embedding.

    • embeddings-combined.pickle: combined embeddings of both gene expression and histological features.

Celltype prediction