Binary files /tmp/tmp9u2nr5ih/JXU5aINcP1/r-bioc-demixt-1.12.0+dfsg/build/vignette.rds and /tmp/tmp9u2nr5ih/xxrad0_Ugz/r-bioc-demixt-1.14.0+dfsg/build/vignette.rds differ diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/changelog r-bioc-demixt-1.14.0+dfsg/debian/changelog --- r-bioc-demixt-1.12.0+dfsg/debian/changelog 2022-05-15 19:13:20.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/changelog 2022-11-29 08:12:51.000000000 +0000 @@ -1,3 +1,23 @@ +r-bioc-demixt (1.14.0+dfsg-1) unstable; urgency=medium + + * Team upload. + * Repack upstream source to exclude HTML doc with compressed JS + * Build-Depends: r-bioc-dss + Closes: #1024597 + * Add lintian-overrides (see lintian bug #1017966) + + -- Andreas Tille Tue, 29 Nov 2022 09:12:51 +0100 + +r-bioc-demixt (1.14.0-1) unstable; urgency=medium + + * Team upload. + * New upstream version + * Fixed debian/watch for BioConductor + * dh-update-R to update Build-Depends (routine-update) + * Reduce piuparts noise in transitions (routine-update) + + -- Andreas Tille Mon, 21 Nov 2022 13:38:58 +0100 + r-bioc-demixt (1.12.0+dfsg-1) unstable; urgency=medium * Team upload. diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/control r-bioc-demixt-1.14.0+dfsg/debian/control --- r-bioc-demixt-1.12.0+dfsg/debian/control 2022-05-15 19:13:20.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/control 2022-11-29 08:12:51.000000000 +0000 @@ -13,6 +13,10 @@ r-cran-kernsmooth, r-cran-matrixcalc, r-cran-rmarkdown, + r-bioc-dss, + r-cran-dendextend, + r-cran-psych, + r-bioc-sva, r-cran-matrixstats, r-cran-truncdist, r-cran-base64enc, diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/copyright r-bioc-demixt-1.14.0+dfsg/debian/copyright --- r-bioc-demixt-1.12.0+dfsg/debian/copyright 2022-05-15 19:13:20.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/copyright 2022-11-29 08:12:51.000000000 +0000 @@ -2,7 +2,7 @@ Upstream-Name: DeMixT Upstream-Contact: Shaolong Cao, Peng Yang Source: https://bioconductor.org/packages/DeMixT/ -Files-Excluded: DeMixT_Log.nb.html +Files-Excluded: DeMixT_Log.html Files: * Copyright: 2014-2020 Zeya Wang , Shaolong diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/salsa-ci.yml r-bioc-demixt-1.14.0+dfsg/debian/salsa-ci.yml --- r-bioc-demixt-1.12.0+dfsg/debian/salsa-ci.yml 2022-05-15 19:13:20.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/salsa-ci.yml 2022-11-29 08:12:51.000000000 +0000 @@ -9,3 +9,7 @@ # Thus reprotest is disabled here variables: SALSA_CI_DISABLE_REPROTEST: 1 + +# Reduce piuparts noise in transitions +piuparts: + allow_failure: true diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/source/lintian-overrides r-bioc-demixt-1.14.0+dfsg/debian/source/lintian-overrides --- r-bioc-demixt-1.12.0+dfsg/debian/source/lintian-overrides 1970-01-01 00:00:00.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/source/lintian-overrides 2022-11-29 08:12:51.000000000 +0000 @@ -0,0 +1,2 @@ +# See lintian bug #1017966 +r-bioc-demixt source: source-is-missing [inst/doc/demixt.html] diff -Nru r-bioc-demixt-1.12.0+dfsg/debian/watch r-bioc-demixt-1.14.0+dfsg/debian/watch --- r-bioc-demixt-1.12.0+dfsg/debian/watch 2022-05-15 19:13:20.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/debian/watch 2022-11-29 08:12:51.000000000 +0000 @@ -1,3 +1,3 @@ version=4 -opts="downloadurlmangle=s?^(.*)\.\.?https:$1packages/release/bioc?,repacksuffix=+dfsg,dversionmangle=s/\+dfsg[0-9]*//g,repack,compression=xz" \ +opts="downloadurlmangle=s?^(.*)\.\.?https://bioconductor.org/packages/release/bioc?,repacksuffix=+dfsg,dversionmangle=auto" \ https://bioconductor.org/packages/release/bioc/html/DeMixT.html .*/DeMixT_(.*).tar.gz Binary files /tmp/tmp9u2nr5ih/JXU5aINcP1/r-bioc-demixt-1.12.0+dfsg/DeMixT_Log.pdf and /tmp/tmp9u2nr5ih/xxrad0_Ugz/r-bioc-demixt-1.14.0+dfsg/DeMixT_Log.pdf differ diff -Nru r-bioc-demixt-1.12.0+dfsg/DeMixT_Log.Rmd r-bioc-demixt-1.14.0+dfsg/DeMixT_Log.Rmd --- r-bioc-demixt-1.12.0+dfsg/DeMixT_Log.Rmd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/DeMixT_Log.Rmd 2022-11-01 19:11:00.000000000 +0000 @@ -7,6 +7,14 @@ # DeMixT Log Cell type-specific deconvolution of heterogeneous tumor samples with two or three components using expression data from RNAseq or microarray platforms. +## Update from version 1.12.0 to 1.14.0 + +Added preprocessing functions and modified vignettes accordingly + +Included R.h in DeMixT.c to fix compilation error caused by Rprintf. + +Fixed the logic operation in the IF_inverse function of DeMixT_GS. + ## Update from version 1.8.1 to 1.8.2 Revised the link for citation. diff -Nru r-bioc-demixt-1.12.0+dfsg/DESCRIPTION r-bioc-demixt-1.14.0+dfsg/DESCRIPTION --- r-bioc-demixt-1.12.0+dfsg/DESCRIPTION 2022-04-26 20:27:24.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/DESCRIPTION 2022-11-01 21:20:59.000000000 +0000 @@ -2,8 +2,8 @@ Title: Cell type-specific deconvolution of heterogeneous tumor samples with two or three components using expression data from RNAseq or microarray platforms -Version: 1.12.0 -Date: 2021-11-20 +Version: 1.14.0 +Date: 2022-10-04 Author: Zeya Wang , Shaolong Cao, Wenyi Wang Maintainer: Shuai Guo @@ -11,7 +11,8 @@ data from a mixture of two or three components. LazyData: TRUE Depends: R (>= 3.6.0), parallel, Rcpp (>= 1.0.0), SummarizedExperiment, - knitr, KernSmooth, matrixcalc, rmarkdown + knitr, KernSmooth, matrixcalc, rmarkdown, DSS, dendextend, + psych, sva Imports: matrixStats, stats, truncdist, base64enc, ggplot2 LinkingTo: Rcpp NeedsCompilation: yes @@ -19,10 +20,10 @@ biocViews: Software, StatisticalMethod, Classification, GeneExpression, Sequencing, Microarray, TissueMicroarray, Coverage License: GPL-3 -RoxygenNote: 7.1.1 +RoxygenNote: 7.1.2 +Packaged: 2022-11-01 21:20:59 UTC; biocbuild git_url: https://git.bioconductor.org/packages/DeMixT -git_branch: RELEASE_3_15 -git_last_commit: f7bf7d2 -git_last_commit_date: 2022-04-26 -Date/Publication: 2022-04-26 -Packaged: 2022-04-26 20:27:24 UTC; biocbuild +git_branch: RELEASE_3_16 +git_last_commit: 8da5af8 +git_last_commit_date: 2022-11-01 +Date/Publication: 2022-11-01 diff -Nru r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.html r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.html --- r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.html 2022-04-26 20:27:24.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.html 2022-11-01 21:20:59.000000000 +0000 @@ -10,9 +10,7 @@ - - A Vignette for DeMixT @@ -107,13 +105,20 @@ for (var i = 0; i < sheets.length; i++) { if (sheets[i].ownerNode.dataset["origin"] !== "pandoc") continue; try { var rules = sheets[i].cssRules; } catch (e) { continue; } - for (var j = 0; j < rules.length; j++) { + var j = 0; + while (j < rules.length) { var rule = rules[j]; // check if there is a div.sourceCode rule - if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") continue; + if (rule.type !== rule.STYLE_RULE || rule.selectorText !== "div.sourceCode") { + j++; + continue; + } var style = rule.style.cssText; // check if color or background-color is set - if (rule.style.color === '' && rule.style.backgroundColor === '') continue; + if (rule.style.color === '' && rule.style.backgroundColor === '') { + j++; + continue; + } // replace div.sourceCode by a pre.sourceCode rule sheets[i].deleteRule(j); sheets[i].insertRule('pre.sourceCode{' + style + '}', j); @@ -318,8 +323,11 @@

A Vignette for DeMixT

-

Zeya Wang, Fan Gao, Shaolong Cao and Peng Yang

-

2022-04-26

+

Last updated: 2022-11-01

+ + + +
@@ -339,16 +347,23 @@

3. Installation

-
-

3.1 Source file

-

DeMixT source files are compatible with Windows, Linux and macOS.

-

DeMixT_1.8.2 is the latest version, which is for a computer that has OpenMP. It can be downloaded from Bioconductor website. https://bioconductor.org/packages/release/bioc/html/DeMixT.html

-

To install DeMixT_1.8.2 from GitHub, start R and enter:

- -

For more information, please visit: http://bioinformatics.mdanderson.org/main/DeMixT

+

The DeMixT package is compatible with Windows, Linux and MacOS. The user can install it from Bioconductor:

+
if (!require("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
+
+BiocManager::install("DeMixT")
+

For Linux and MacOS, the user can also install the latest DeMixT from GitHub:

+
if (!require("devtools", quietly = TRUE))
+    install.packages('devtools')
+
+devtools::install_github("wwylab/DeMixT")
+

Check if DeMixT is installed successfully:

+
# load package
+library(DeMixT)
+

Note: DeMixT relies on OpenMP for parallel computing. Starting from R 4.00, R no longer supports OpenMP on MacOS, meaning the user can only run DeMixT with one core on MacOS. We therefore recommend the users to mainly use Linux system for running DeMixT to take advantage of the multi-core parallel computation.

-
-

3.2 Functions

+
+

4. Functions

The following table shows the functions included in DeMixT.

@@ -364,63 +379,66 @@ - + - + - + - + - + + + + +
DeMixTDeconvolution of tumor samples with two or three componentsDeconvolution of tumor samples with two or three components.
DeMixT_GSEstimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihoodEstimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihood.
DeMixT_DEEstimates the proportions of mixed samples for each mixing componentEstimates the proportions of mixed samples for each mixing component.
DeMixT_S2Deconvolves expressions of each sample for unknown componentDeconvolves expressions of each sample for unknown component.
Optimum_KernelCCall the C function used for parameter estimation in DeMixTCall the C function used for parameter estimation in DeMixT.
DeMixT_PreprocessingPreprocessing functions before running DeMixT.
-
-

4. Methods

+

5. Methods

-

4.1 Model

+

5.1 Model

Let \(Y_{ig}\) be the observed expression levels of the raw measured data from clinically derived malignant tumor samples for gene \(g, g = 1, \cdots, G\) and sample \(i, i = 1, \cdots, My\). \(G\) denotes the total number of probes/genes and \(My\) denotes the number of samples. The observed expression levels for solid tumors can be modeled as a linear combination of raw expression levels from three components: \[ {Y_{ig}} = \pi _{1,i}N_{1,ig} + \pi _{2,i}N_{2,ig} + (1 - \pi_{1,i} - \pi _{2,i}){T_{ig}} \label{eq:1} \]

Here \(N_{1,ig}\), \(N_{2,ig}\) and \({T_{ig}}\) are the unobserved raw expression levels from each of the three components. We call the two components for which we require reference samples the \(N_1\)-component and the \(N_2\)-component. We call the unknown component the T-component. We let \(\pi_{1,i}\) denote the proportion of the \(N_1\)-component, \(\pi_{2,i}\) denote the proportion of the \(N_2\)-component, and \(1 - \pi_{1,i}-\pi_{2,i}\) denote the proportion of the T-component. We assume that the mixing proportions of one specific sample remain the same across all genes.

Our model allows for one component to be unknown, and therefore does not require reference profiles from all components. A set of samples for \(N_{1,ig}\) and \(N_{2,ig}\), respectively, needs to be provided as input data. This three-component deconvolution model is applicable to the linear combination of any three components in any type of material. It can also be simplified to a two-component model, assuming there is just one \(N\)-component. For application in this paper, we consider tumor (\(T\)), stromal (\(N_1\)) and immune components (\(N_2\)) in an admixed sample (\(Y\)).

-

Following the convention that \(\log_2\)-transformed microarray gene expression data follow a normal distribution, we assume that the raw measures \(N_{1,ig} \sim LN({\mu _{{N_1}g}},\sigma _{{N_1}g}^2)\), \(N_{2,ig} \sim LN({\mu _{{N_2}g}},\sigma _{{N_2}g}^2)\) and \({T_{ig}} \sim LN({\mu _{Tg}}, \sigma _{Tg}^2)\), where LN denotes a \(\log_2\)-normal distribution and \(\sigma _{{N_1}g}^2\),\(\sigma _{{N_2}g}^2\), \(\sigma _{Tg}^2\) reflect the variations under \(\log_2\)-transformed data. Consequently, our model can be expressed as the convolution of the density function for three \(\log_2\)-normal distributions. Because there is no closed form of this convolution, we use numerical integration to evaluate the complete likelihood function (see the full likelihood in the Supplementary Materials).

+

Following the convention that \(\log_2\)-transformed microarray gene expression data follow a normal distribution, we assume that the raw measures \(N_{1,ig} \sim LN({\mu _{{N_1}g}},\sigma _{{N_1}g}^2)\), \(N_{2,ig} \sim LN({\mu _{{N_2}g}},\sigma _{{N_2}g}^2)\) and \({T_{ig}} \sim LN({\mu _{Tg}}, \sigma _{Tg}^2)\), where LN denotes a \(\log_2\)-normal distribution and \(\sigma _{{N_1}g}^2\),\(\sigma _{{N_2}g}^2\), \(\sigma _{Tg}^2\) reflect the variations under \(\log_2\)-transformed data. Consequently, our model can be expressed as the convolution of the density function for three \(\log_2\)-normal distributions. Because there is no closed form of this convolution, we use numerical integration to evaluate the complete likelihood function (see the full likelihood in the Supplementary Materials in [1]).

-

4.2 The DeMixT algorithm for deconvolution

+

5.2 The DeMixT algorithm for deconvolution

DeMixT estimates all distribution parameters and cellular proportions and reconstitutes the expression profiles for all three components for each gene and each sample. The estimation procedure (summarized in Figure 1b) has two main steps as follows.

  1. Obtain a set of parameters \(\{\pi_{1,i}, \pi_{2,i}\}_{i=1}^{My}\), \(\{\mu_T, \sigma_T\}_{g=1}^G\) to maximize the complete likelihood function, for which \(\{\mu_{N_{1,g}}, \sigma_{N_{1,g}}, \mu_{N_{2,g}}, \sigma_{N_{2,g}}\}_{g=1}^G\) were already estimated from the available unmatched samples of the \(N_1\) and \(N_2\) component tissues. (See further details in our paper.)

  2. Reconstitute the expression profiles by searching each set of \(\{n_{1,ig}, n_{2,ig}\}\) that maximizes the joint density of \(N_{1,ig}\), \(N_{2,ig}\) and \(T_{ig}\). The value of \(t_{ig}\) is solved as \({y_{ig}} - {{\hat \pi }_{1,i}}{n_{1,ig}} - {{\hat \pi }_{2,i}}{n_{2,ig}}\).

These two steps can be separately implemented using the function DeMixT_DE or DeMixT_GS for the first step and DeMixT_S2 for the second, which are combined in the function DeMixT(Note: DeMixT_GS is the default function for first step).

-

In version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, based on the observed normal reference samples. It has been shown to improve accuracy in proportion estimation for the scenario where a dataset consists of samples where true tumor proportions are skewed to the high end.

+

Since version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, based on the observed normal reference samples. It has been shown to improve accuracy in proportion estimation for the scenario where a dataset consists of samples where true tumor proportions are skewed to the high end.

-

5. Examples

+

6. Examples

-

5.1 Simulated two-component data

- - +

6.1 Simulated two-component data

+ +
##               PiN1       PiT
 ## Sample 1 0.5955120 0.4044880
 ## Sample 2 0.2759014 0.7240986
@@ -428,229 +446,304 @@
 ## Sample 4 0.4497041 0.5502959
 ## Sample 5 0.6516980 0.3483020
 ## Sample 6 0.4365191 0.5634809
- +
## [1] "Gene 418" "Gene 452" "Gene 421" "Gene 112" "Gene 154" "Gene 143"
- - + +
##         Sample 1   Sample 2   Sample 3  Sample 4   Sample 5
 ## Gene 1 18.857446  60.727041 159.878946 92.031635  40.873852
 ## Gene 2  2.322481   3.390938   2.406093  2.558962   2.438189
 ## Gene 3 48.843631 208.166410  66.986239 38.107580 460.556751
- +
##         Sample 1  Sample 2 Sample 3  Sample 4  Sample 5
 ## Gene 1  59.37087  71.80492  74.1755  73.55878  72.96267
 ## Gene 2 107.66874 131.20005 113.6376 120.35924 125.28224
 ## Gene 3 513.43184 669.79145 613.3042 491.09308 741.76507
- +
##            MuN1      MuT
 ## Gene 1 6.166484 5.924321
 ## Gene 2 6.677594 2.974551
 ## Gene 3 9.329628 7.396647
- +
##           SigmaN   SigmaT
 ## Gene 1 0.2222914 1.127726
 ## Gene 2 0.2319681 1.614169
 ## Gene 3 0.1881647 1.320477
-
-

5.2 Simulated two-component data with DeMixT Spike-in Normal

+
+

6.2 Simulated two-component data

In the simulation,

- +

where \(\pi_i \in (0.25, 0.95)\) is from truncated normal distribution. In general, the true distribution of tumor proportion does not follow a uniform distribution between \([0,1]\), but instead skewed to the upper part of the interval.

- +

This simulation was designed to compare previous DeMixT resutls with DeMixT spike-in results under both gene selection method.

- +

-

5.3 Simulated three-component data

+

6.3 Simulated three-component data

In this simulation,

- +

where \(G1\) is the number of genes that \(\mu_{N1}\) is close to \(\mu_{N2}\).

- - + +

-
-

5.4 PRAD in TCGA Dataset

- - -

Tumor proportion estimation remains consistence with and without spike-in normal, when its distribution is roughly symmetric around 0.5.

+
+

6.4 Real data: PRAD in TCGA dataset

+
+

6.4.1 Preprocessing

+

For the deconvolution of real data with DeMixT, the user may apply the following preprocessing steps first. Here, we use the PRAD (prostate adenocarcinoma) from TCGA as an example.

+
    +
  1. (Optional) Remove suspicious samples by hierarchical clustering
  2. +
+

It is possible that the some of the tumor and normal samples are mislabelled. We use the function

+
# hc_labels <- detect_suspicious_sample_by_hierarchical_clustering_2comp(count.matrix, normal.id, tumor.id)
+

to visually inspect the separation of tumor and normal samples based on the hierarchical clustering of their expressions, in which count.matrix is the raw count matrix; normal.id and tumor.id are the vectors of normal and tumor sample ids, respectively. Generally, one cluster contains tumor samples and the other contains normal samples. Any samples that are clustered outside of its own group label, e.g., tumor samples clustered within the normal sample cluster or normal samples in the tumor cluster, are considered as mislabelled samples and filtered out.

+

+

count.matrix, normal.id and tumor.id are updated by

+ +
    +
  1. Select genes with small variation in gene expression across samples
  2. +
+

In this step, we select a subset of ~9000 genes from the original gene set (>50,000) before running DeMixT with the GS (Gene Selection) method so that our model-based gene selection maintains good statistical properties. We use the function

+
# plot_sd(count.matrix, normal.id, tumor.id)
+

to visualize the distribution of standard deviation of log2 expressions space of normal and tumor samples, and use the function

+
# num_gene_remaining_different_cutoffs <- subset_sd_gene_remaining(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0),  cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1)
+

to find he a range of variance in genes from normal samples (cutoff_normal_range) and from tumor samples (cutoff_tumor_range) which results in roughly 9,000 genes.

+

We then use the function

+
# count.matrix <- subset_sd(count.matrix, normal.id, tumor.id, cutoff_normal = cutoff_normal_range, cutoff_tumor = cutoff_tumor_range)
+

to update the count.matrix such that it only includes the selected genes.

+
    +
  1. Scale normalization
  2. +
+

We apply a scale normalization at the 75th percentile across all the tumor and normal samples using the function

+
# count.matrix <- scale_normalization_75th_percentile(count.matrix)
+

to adjust the expression levels in the samples.

+

Note: The user may also use the function

+
# count.matrix <- DeMixT_preprocessing(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0),  cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1)
+

to perform the preprocessing steps of 2) and 3) in one go.

+
    +
  1. (Optional) Batch effect correction for tumor samples from different batches by ComBat
  2. +
+

If the tumor samples are from different batches, we recommend the user to inspect the batch effect using the function before running DeMixT

+
# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2)
+

This function will generate a PCA plot, in which the samples are colored by the labels indicating different batches of tumor samples, as well as normal samples. If there is a clear separation between different batches of tumor samples, there is likely batch effects. We use the function

+
# count.matrix.tumor <- batch_correction(count.matrix.tumor, labels)
+

to reduce this effect, where count.matrix.tumor is the raw count matrix only from tumor samples and labels is the factor of tumor batches. The user may choose other batch effect correction methods at this step.

+

The user may inspect the batch effect again after the above step using the mentioned function

+
# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2)
+
+
+

6.4.2 Deconvolution using DeMixT

+

To optimize the DeMixT parameter setting for the input data, we recommend testing an array of combinations of the number of spike-ins and the number of selected genes.

+

The number of CPU cores used by the DeMixT function for parallel computing is specified by the parameter nthread. By default (such as in the code block below), nthread = total_number_of_cores_on_the_machine - 1. The user can change nthread to a number between 0 and the total number of cores on the machine.

+
## Because of the random initial values and the spike-in samples within the DeMixT function, 
+## we would like to remind the user to set seeds to keep track. This seed setting will be 
+## internalized in DeMixT in the next update.
+
+# set.seed(1234)
+
+# data.Y = SummarizedExperiment(assays = list(counts = count.matrix[, tumor.id]))
+# data.N1 <- SummarizedExperiment(assays = list(counts = count.matrix[, normal.id]))
+
+## In practice, we set the maximum number of spike-in as min(n/3, 200), 
+## where n is the number of samples. 
+# nspikesin_list = c(0, 50, 100, 150)
+## One may set a wider range than provided below for studies other than TCGA.
+# ngene.selected_list = c(500, 1000, 1500, 2500)
+
+#for(nspikesin in nspikesin_list){
+#    for(ngene.selected in ngene.selected_list){
+#        name = paste("PRAD_demixt_GS_res_nspikesin", nspikesin, "ngene.selected", 
+#                      ngene.selected,  sep = "_");
+#        name = paste(name, ".RData", sep = "");
+#        res = DeMixT(data.Y = data.Y,
+#                     data.N1 = data.N1,
+#                     ngene.selected.for.pi = ngene.selected,
+#                     ngene.Profile.selected = ngene.selected,
+#                     filter.sd = 0.7, # same upper bound of gene expression standard deviation 
+#                     # for normal reference. i.e., preprocessed_data$sd_cutoff_normal[2]
+#                     gene.selection.method = "GS",
+#                     nspikein = nspikesin)
+#        save(res, file = name)
+#    }
+#}
+

We suggest selecting the optimal parameter combination that produces the largest average correlation of estimated tumor propotions with those produced by other combinations. The location of the mode of the Pi estimation may also be considered. The mode located too high or too low may suggest biased estimation.

+

Instead of selecting using the parameter combination with the highest correlation, one can also select the parameter combination that produces estimated tumor proportions that are most biologically meaningful.

+

A comprehensive tutorial of using DeMixT for real data deconvolution can be found at https://wwylab.github.io/DeMixT/tutorial.html.

+
-

5.5 Deconvolution using normal reference samples from GTEx

+

6.5 Deconvolution using normal reference samples from GTEx

We conducted experiments across cancer types to evaluate the impact of technical artifacts such as batch effects to the proportion estimation when using a different cohort. We applied GTEx expression data from normal prostate samples as the normal reference to deconvolute the TCGA prostate cancer samples, where normal tissues were selected without significant pathology. The estimated proportions showed a reasonable correlation (Spearman correlation coefficient = 0.65) with those generated using TCGA normal prostate samples as the normal reference.

- +

+
+

7. Reference

+

[1]. Wang, Z. et al. Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration. iScience 9, 451–460 (2018).

+
-

6. Session Info

- -
## R version 4.2.0 RC (2022-04-19 r82224)
+

8. Session Info

+ +
## R version 4.2.1 (2022-06-23)
 ## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Ubuntu 20.04.4 LTS
+## Running under: Ubuntu 20.04.5 LTS
 ## 
 ## Matrix products: default
-## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
-## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
+## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
+## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
 ## 
 ## locale:
 ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
@@ -664,47 +757,72 @@
 ## character(0)
 ## 
 ## other attached packages:
-## [1] DeMixT_1.12.0
+## [1] DeMixT_1.14.0
 ## 
 ## loaded via a namespace (and not attached):
-##  [1] Rcpp_1.0.8.3                lattice_0.20-45            
-##  [3] assertthat_0.2.1            digest_0.6.29              
-##  [5] utf8_1.2.2                  R6_2.5.1                   
-##  [7] GenomeInfoDb_1.32.0         stats4_4.2.0               
-##  [9] evaluate_0.15               ggplot2_3.3.5              
-## [11] highr_0.9                   pillar_1.7.0               
-## [13] utils_4.2.0                 zlibbioc_1.42.0            
-## [15] rlang_1.0.2                 jquerylib_0.1.4            
-## [17] S4Vectors_0.34.0            Matrix_1.4-1               
-## [19] rmarkdown_2.14              labeling_0.4.2             
-## [21] stringr_1.4.0               RCurl_1.98-1.6             
-## [23] munsell_0.5.0               DelayedArray_0.22.0        
-## [25] compiler_4.2.0              xfun_0.30                  
-## [27] pkgconfig_2.0.3             stats_4.2.0                
-## [29] BiocGenerics_0.42.0         base64enc_0.1-3            
-## [31] htmltools_0.5.2             tidyselect_1.1.2           
-## [33] SummarizedExperiment_1.26.0 tibble_3.1.6               
-## [35] GenomeInfoDbData_1.2.8      matrixcalc_1.0-5           
-## [37] IRanges_2.30.0              matrixStats_0.62.0         
-## [39] grDevices_4.2.0             fansi_1.0.3                
-## [41] crayon_1.5.1                dplyr_1.0.8                
-## [43] withr_2.5.0                 bitops_1.0-7               
-## [45] grid_4.2.0                  jsonlite_1.8.0             
-## [47] gtable_0.3.0                lifecycle_1.0.1            
-## [49] DBI_1.1.2                   magrittr_2.0.3             
-## [51] datasets_4.2.0              scales_1.2.0               
-## [53] KernSmooth_2.23-20          cli_3.3.0                  
-## [55] stringi_1.7.6               farver_2.1.0               
-## [57] XVector_0.36.0              bslib_0.3.1                
-## [59] ellipsis_0.3.2              graphics_4.2.0             
-## [61] generics_0.1.2              vctrs_0.4.1                
-## [63] base_4.2.0                  tools_4.2.0                
-## [65] Biobase_2.56.0              glue_1.6.2                 
-## [67] purrr_0.3.4                 MatrixGenerics_1.8.0       
-## [69] parallel_4.2.0              fastmap_1.1.0              
-## [71] yaml_2.3.5                  colorspace_2.0-3           
-## [73] GenomicRanges_1.48.0        knitr_1.38                 
-## [75] sass_0.4.1                  methods_4.2.0
+## [1] colorspace_2.0-3 bsseq_1.34.0 +## [3] rjson_0.2.21 XVector_0.38.0 +## [5] GenomicRanges_1.50.0 base64enc_0.1-3 +## [7] farver_2.1.1 stats_4.2.1 +## [9] bit64_4.0.5 AnnotationDbi_1.60.0 +## [11] fansi_1.0.3 codetools_0.2-18 +## [13] splines_4.2.1 R.methodsS3_1.8.2 +## [15] sparseMatrixStats_1.10.0 mnormt_2.1.1 +## [17] cachem_1.0.6 knitr_1.40 +## [19] jsonlite_1.8.3 Rsamtools_2.14.0 +## [21] annotate_1.76.0 base_4.2.1 +## [23] png_0.1-7 R.oo_1.25.0 +## [25] DSS_2.46.0 HDF5Array_1.26.0 +## [27] compiler_4.2.1 httr_1.4.4 +## [29] assertthat_0.2.1 Matrix_1.5-1 +## [31] fastmap_1.1.0 limma_3.54.0 +## [33] cli_3.4.1 htmltools_0.5.3 +## [35] tools_4.2.1 gtable_0.3.1 +## [37] glue_1.6.2 GenomeInfoDbData_1.2.9 +## [39] dplyr_1.0.10 grDevices_4.2.1 +## [41] Rcpp_1.0.9 Biobase_2.58.0 +## [43] jquerylib_0.1.4 vctrs_0.5.0 +## [45] Biostrings_2.66.0 rhdf5filters_1.10.0 +## [47] nlme_3.1-160 rtracklayer_1.58.0 +## [49] DelayedMatrixStats_1.20.0 psych_2.2.9 +## [51] xfun_0.34 stringr_1.4.1 +## [53] lifecycle_1.0.3 restfulr_0.0.15 +## [55] gtools_3.9.3 XML_3.99-0.12 +## [57] dendextend_1.16.0 edgeR_3.40.0 +## [59] zlibbioc_1.44.0 scales_1.2.1 +## [61] BSgenome_1.66.0 graphics_4.2.1 +## [63] MatrixGenerics_1.10.0 parallel_4.2.1 +## [65] SummarizedExperiment_1.28.0 rhdf5_2.42.0 +## [67] utils_4.2.1 yaml_2.3.6 +## [69] memoise_2.0.1 gridExtra_2.3 +## [71] ggplot2_3.3.6 sass_0.4.2 +## [73] datasets_4.2.1 stringi_1.7.8 +## [75] RSQLite_2.2.18 highr_0.9 +## [77] genefilter_1.80.0 S4Vectors_0.36.0 +## [79] BiocIO_1.8.0 permute_0.9-7 +## [81] BiocGenerics_0.44.0 BiocParallel_1.32.0 +## [83] GenomeInfoDb_1.34.0 rlang_1.0.6 +## [85] pkgconfig_2.0.3 matrixStats_0.62.0 +## [87] bitops_1.0-7 evaluate_0.17 +## [89] lattice_0.20-45 Rhdf5lib_1.20.0 +## [91] GenomicAlignments_1.34.0 labeling_0.4.2 +## [93] bit_4.0.4 tidyselect_1.2.0 +## [95] magrittr_2.0.3 R6_2.5.1 +## [97] IRanges_2.32.0 generics_0.1.3 +## [99] DelayedArray_0.24.0 DBI_1.1.3 +## [101] pillar_1.8.1 withr_2.5.0 +## [103] mgcv_1.8-41 survival_3.4-0 +## [105] KEGGREST_1.38.0 RCurl_1.98-1.9 +## [107] tibble_3.1.8 crayon_1.5.2 +## [109] KernSmooth_2.23-20 utf8_1.2.2 +## [111] rmarkdown_2.17 viridis_0.6.2 +## [113] locfit_1.5-9.6 grid_4.2.1 +## [115] sva_3.46.0 data.table_1.14.4 +## [117] blob_1.2.3 methods_4.2.1 +## [119] matrixcalc_1.0-6 digest_0.6.30 +## [121] xtable_1.8-4 R.utils_2.12.1 +## [123] stats4_4.2.1 munsell_0.5.0 +## [125] viridisLite_0.4.1 bslib_0.4.0
diff -Nru r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.R r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.R --- r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.R 2022-04-26 20:27:24.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.R 2022-11-01 21:20:58.000000000 +0000 @@ -47,9 +47,6 @@ if(ofile) invisible(dev.off()) } -## ----install_DeMixT----------------------------------------------------------- -# devtools::install_github("wwylab/DeMixT") - ## ----Algorithm, echo=FALSE, out.width='100%'---------------------------------- knitr::include_graphics(path = paste0("Algorithm.png")) @@ -144,21 +141,16 @@ scale_shape_manual(values=c(seq(1:3))) + labs(x = 'True Proportion', y = 'Estimated Proportion') -## ----read_data_PRAD, warning=FALSE, message=FALSE---------------------------- -load('res.PRAD.RData'); +## ----hierarchical_clustering, echo=FALSE, out.width='85%', fig.align='center'---- +knitr::include_graphics(path = paste0("hierarchical_clustering.png")) -## ----PRAD, fig.height = 4, fig.width = 6, fig.align='center', warning=FALSE---- -res.PRAD.df = as.data.frame(cbind(res.PRAD$res.GS.PRAD, res.PRAD$res.GS.SP.PRAD)) -res.PRAD.df$V1 <- as.numeric(as.character(res.PRAD.df$V1)) -res.PRAD.df$V2 <- as.numeric(as.character(res.PRAD.df$V2)) -names(res.PRAD.df) = c('Estimated.Proportion', 'Estimated.Proportion.SP') -## Plot -ggplot(res.PRAD.df, aes(x=Estimated.Proportion, y=Estimated.Proportion.SP)) + - geom_point() + - geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", lwd = 0.5) + - xlim(0,1) + ylim(0,1) + - scale_shape_manual(values=c(seq(1:3))) + - labs(x = 'Estimated Proportion', y = 'Estimated Proportion After Spike-in') +## ----------------------------------------------------------------------------- +# normal.id <- setdiff(normal.id, names(hc_labels$cluster[hc_labels$cluster == 1])) +# tumor.id <- setdiff(tumor.id, names(hc_labels$cluster[hc_labels$cluster == 2])) +# count.matrix <- count.matrix[, c(normal.id, tumor.id)] + +## ----plot_sd, echo=FALSE, out.width='85%', fig.align='center'----------------- +knitr::include_graphics(path = paste0("sd_expresion.png")) ## ----GTEx, echo=FALSE, out.width='65%', fig.align='center'-------------------- knitr::include_graphics(path = paste0("GTEx_normal.png")) diff -Nru r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.Rmd r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.Rmd --- r-bioc-demixt-1.12.0+dfsg/inst/doc/demixt.Rmd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/inst/doc/demixt.Rmd 2022-11-01 19:11:00.000000000 +0000 @@ -1,7 +1,6 @@ --- title: "A Vignette for DeMixT" -author: "Zeya Wang, Fan Gao, Shaolong Cao and Peng Yang" -date: "`r Sys.Date()`" +date: "Last updated: `r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{DeMixT.Rmd} @@ -9,6 +8,7 @@ \usepackage[utf8]{inputenc} --- +
```{r setup, include=FALSE, message=FALSE} library(ggplot2) @@ -117,39 +117,46 @@ # 3. Installation -## 3.1 Source file - -DeMixT source files are compatible with Windows, Linux and macOS. +The DeMixT package is compatible with Windows, Linux and MacOS. The user can install it from ``Bioconductor``: +``` +if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") -DeMixT_1.8.2 is the latest version, which is for a computer that has OpenMP. -It can be downloaded from Bioconductor website. -https://bioconductor.org/packages/release/bioc/html/DeMixT.html +BiocManager::install("DeMixT") +``` -To install DeMixT_1.8.2 from GitHub, start R and enter: +For Linux and MacOS, the user can also install the latest DeMixT from GitHub: +``` +if (!require("devtools", quietly = TRUE)) + install.packages('devtools') -```{r install_DeMixT} -# devtools::install_github("wwylab/DeMixT") +devtools::install_github("wwylab/DeMixT") ``` -For more information, please visit: - +Check if DeMixT is installed successfully: +``` +# load package +library(DeMixT) +``` +**Note**: DeMixT relies on OpenMP for parallel computing. Starting from R 4.00, R no longer supports OpenMP on MacOS, meaning the user can only run DeMixT with one core on MacOS. We therefore recommend the users to mainly use Linux system for running DeMixT to take advantage of the multi-core parallel computation. -## 3.2 Functions +# 4. Functions The following table shows the functions included in DeMixT. Table Header | Second Header ------- | ---------------------------------- -DeMixT | Deconvolution of tumor samples with two or three components -DeMixT_GS | Estimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihood -DeMixT_DE | Estimates the proportions of mixed samples for each mixing component -DeMixT_S2 |Deconvolves expressions of each sample for unknown component -Optimum_KernelC | Call the C function used for parameter estimation in DeMixT +DeMixT | Deconvolution of tumor samples with two or three components. +DeMixT_GS | Estimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihood. +DeMixT_DE | Estimates the proportions of mixed samples for each mixing component. +DeMixT_S2 |Deconvolves expressions of each sample for unknown component. +Optimum_KernelC | Call the C function used for parameter estimation in DeMixT. +DeMixT_Preprocessing | Preprocessing functions before running DeMixT. -# 4. Methods +# 5. Methods -## 4.1 Model +## 5.1 Model Let \(Y_{ig}\) be the observed expression levels of the raw measured data from clinically derived malignant tumor samples for gene \(g, g = 1, \cdots, G\) and @@ -188,9 +195,9 @@ Consequently, our model can be expressed as the convolution of the density function for three \(\log_2\)-normal distributions. Because there is no closed form of this convolution, we use numerical integration to evaluate the complete -likelihood function (see the full likelihood in the Supplementary Materials). +likelihood function (see the full likelihood in the Supplementary Materials in [1]). -## 4.2 The DeMixT algorithm for deconvolution +## 5.2 The DeMixT algorithm for deconvolution DeMixT estimates all distribution parameters and cellular proportions and reconstitutes the expression profiles for all three components for each gene @@ -213,7 +220,7 @@ DeMixT_GS for the first step and DeMixT_S2 for the second, which are combined in the function DeMixT(Note: DeMixT_GS is the default function for first step). -In version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, +Since version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, based on the observed normal reference samples. It has been shown to improve accuracy in proportion estimation for the scenario where a dataset consists of samples where true tumor proportions are skewed to the high end. @@ -223,9 +230,9 @@ ``` -# 5. Examples +# 6. Examples -## 5.1 Simulated two-component data +## 6.1 Simulated two-component data ```{r, sim_2comp_GS, results="hide", message=FALSE} data("test.data.2comp") @@ -259,7 +266,7 @@ head(res.S2$decovSigma,3) ``` -## 5.2 Simulated two-component data with DeMixT Spike-in Normal +## 6.2 Simulated two-component data In the simulation, ````markdown @@ -337,7 +344,7 @@ labs(x = 'True Proportion', y = 'Estimated Proportion') ``` -## 5.3 Simulated three-component data +## 6.3 Simulated three-component data In this simulation, ````markdown @@ -422,30 +429,132 @@ ``` +## 6.4 Real data: PRAD in TCGA dataset -## 5.4 PRAD in TCGA Dataset +### 6.4.1 Preprocessing +For the deconvolution of real data with DeMixT, the user may apply the following preprocessing steps first. Here, we use the PRAD (prostate adenocarcinoma) from TCGA as an example. -```{r read_data_PRAD, warning=FALSE, message=FALSE} -load('res.PRAD.RData'); +1. **(Optional) Remove suspicious samples by hierarchical clustering** + +It is possible that the some of the tumor and normal samples are mislabelled. We use the function ``` +# hc_labels <- detect_suspicious_sample_by_hierarchical_clustering_2comp(count.matrix, normal.id, tumor.id) +``` +to visually inspect the separation of tumor and normal samples based on the hierarchical clustering of their expressions, in which ``count.matrix`` is the raw count matrix; ``normal.id`` and ``tumor.id`` are the vectors of normal and tumor sample ids, respectively. Generally, one cluster contains tumor samples and the other contains normal samples. Any samples that are clustered outside of its own group label, e.g., tumor samples clustered within the normal sample cluster or normal samples in the tumor cluster, are considered as mislabelled samples and filtered out. -```{r PRAD, fig.height = 4, fig.width = 6, fig.align='center', warning=FALSE} -res.PRAD.df = as.data.frame(cbind(res.PRAD$res.GS.PRAD, res.PRAD$res.GS.SP.PRAD)) -res.PRAD.df$V1 <- as.numeric(as.character(res.PRAD.df$V1)) -res.PRAD.df$V2 <- as.numeric(as.character(res.PRAD.df$V2)) -names(res.PRAD.df) = c('Estimated.Proportion', 'Estimated.Proportion.SP') -## Plot -ggplot(res.PRAD.df, aes(x=Estimated.Proportion, y=Estimated.Proportion.SP)) + - geom_point() + - geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", lwd = 0.5) + - xlim(0,1) + ylim(0,1) + - scale_shape_manual(values=c(seq(1:3))) + - labs(x = 'Estimated Proportion', y = 'Estimated Proportion After Spike-in') +```{r hierarchical_clustering, echo=FALSE, out.width='85%', fig.align='center'} +knitr::include_graphics(path = paste0("hierarchical_clustering.png")) ``` -Tumor proportion estimation remains consistence with and without spike-in normal, when its distribution is roughly symmetric around 0.5. +``count.matrix``, ``normal.id `` and ``tumor.id`` are updated by +```{r} +# normal.id <- setdiff(normal.id, names(hc_labels$cluster[hc_labels$cluster == 1])) +# tumor.id <- setdiff(tumor.id, names(hc_labels$cluster[hc_labels$cluster == 2])) +# count.matrix <- count.matrix[, c(normal.id, tumor.id)] +``` + +2. **Select genes with small variation in gene expression across samples** + +In this step, we select a subset of ~9000 genes from the original gene set (>50,000) before running DeMixT with the GS (Gene Selection) method so that our model-based gene selection maintains good statistical properties. We use the function + +``` +# plot_sd(count.matrix, normal.id, tumor.id) +``` +to visualize the distribution of standard deviation of log2 expressions space of normal and tumor samples, +```{r plot_sd, echo=FALSE, out.width='85%', fig.align='center'} +knitr::include_graphics(path = paste0("sd_expresion.png")) +``` +and use the function + +``` +# num_gene_remaining_different_cutoffs <- subset_sd_gene_remaining(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0), cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1) +``` +to find he a range of variance in genes from normal samples ``(cutoff_normal_range)`` and from tumor samples ``(cutoff_tumor_range)`` which results in roughly 9,000 genes. + +We then use the function +``` +# count.matrix <- subset_sd(count.matrix, normal.id, tumor.id, cutoff_normal = cutoff_normal_range, cutoff_tumor = cutoff_tumor_range) +``` +to update the ``count.matrix`` such that it only includes the selected genes. + +3. **Scale normalization** + +We apply a scale normalization at the 75th percentile across all the tumor and normal samples using the function + +``` +# count.matrix <- scale_normalization_75th_percentile(count.matrix) +``` +to adjust the expression levels in the samples. + +**Note**: The user may also use the function +``` +# count.matrix <- DeMixT_preprocessing(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0), cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1) +``` +to perform the preprocessing steps of 2) and 3) in one go. + + +4. **(Optional) Batch effect correction for tumor samples from different batches by ComBat** + +If the tumor samples are from different batches, we recommend the user to inspect the batch effect using the function before running DeMixT +``` +# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2) +``` +This function will generate a PCA plot, in which the samples are colored by the ``labels`` indicating different batches of tumor samples, as well as normal samples. If there is a clear separation between different batches of tumor samples, there is likely batch effects. We use the function +``` +# count.matrix.tumor <- batch_correction(count.matrix.tumor, labels) +``` +to reduce this effect, where ``count.matrix.tumor`` is the raw count matrix only from tumor samples and ``labels`` is the factor of tumor batches. The user may choose other batch effect correction methods at this step. + +The user may inspect the batch effect again after the above step using the mentioned function +``` +# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2) +``` + +### 6.4.2 Deconvolution using DeMixT +To optimize the DeMixT parameter setting for the input data, we recommend testing an array of combinations of the number of spike-ins and the number of selected genes. + +The number of CPU cores used by the DeMixT function for parallel computing is specified by the parameter ``nthread``. By default (such as in the code block below), ``nthread = total_number_of_cores_on_the_machine - 1``. The user can change ``nthread`` to a number between 0 and the total number of cores on the machine. + +``` +## Because of the random initial values and the spike-in samples within the DeMixT function, +## we would like to remind the user to set seeds to keep track. This seed setting will be +## internalized in DeMixT in the next update. + +# set.seed(1234) + +# data.Y = SummarizedExperiment(assays = list(counts = count.matrix[, tumor.id])) +# data.N1 <- SummarizedExperiment(assays = list(counts = count.matrix[, normal.id])) + +## In practice, we set the maximum number of spike-in as min(n/3, 200), +## where n is the number of samples. +# nspikesin_list = c(0, 50, 100, 150) +## One may set a wider range than provided below for studies other than TCGA. +# ngene.selected_list = c(500, 1000, 1500, 2500) + +#for(nspikesin in nspikesin_list){ +# for(ngene.selected in ngene.selected_list){ +# name = paste("PRAD_demixt_GS_res_nspikesin", nspikesin, "ngene.selected", +# ngene.selected, sep = "_"); +# name = paste(name, ".RData", sep = ""); +# res = DeMixT(data.Y = data.Y, +# data.N1 = data.N1, +# ngene.selected.for.pi = ngene.selected, +# ngene.Profile.selected = ngene.selected, +# filter.sd = 0.7, # same upper bound of gene expression standard deviation +# # for normal reference. i.e., preprocessed_data$sd_cutoff_normal[2] +# gene.selection.method = "GS", +# nspikein = nspikesin) +# save(res, file = name) +# } +#} +``` +We suggest selecting the optimal parameter combination that produces the largest average correlation of estimated tumor propotions with those produced by other combinations. The location of the mode of the Pi estimation may also be considered. The mode located too high or too low may suggest biased estimation. + +Instead of selecting using the parameter combination with the highest correlation, one can also select the parameter combination that produces estimated tumor proportions that are most biologically meaningful. + +A comprehensive tutorial of using DeMixT for real data deconvolution can be found at [https://wwylab.github.io/DeMixT/tutorial.html](https://wwylab.github.io/DeMixT/tutorial.html). -## 5.5 Deconvolution using normal reference samples from GTEx +## 6.5 Deconvolution using normal reference samples from GTEx We conducted experiments across cancer types to evaluate the impact of technical artifacts such as batch effects to the proportion estimation when using a different cohort. We applied GTEx expression data from normal prostate samples as the normal reference to deconvolute the TCGA prostate cancer samples, where normal tissues were selected without significant pathology. The estimated proportions showed a reasonable correlation (Spearman correlation coefficient = 0.65) with those generated using TCGA normal prostate samples as the normal reference. @@ -469,8 +578,10 @@ knitr::include_graphics(path = paste0("GTEx_normal.png")) ``` +# 7. Reference +[1]. Wang, Z. et al. Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration. iScience 9, 451–460 (2018). -# 6. Session Info +# 8. Session Info ```{r} sessionInfo(package = "DeMixT") diff -Nru r-bioc-demixt-1.12.0+dfsg/man/batch_correction.Rd r-bioc-demixt-1.14.0+dfsg/man/batch_correction.Rd --- r-bioc-demixt-1.12.0+dfsg/man/batch_correction.Rd 1970-01-01 00:00:00.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/batch_correction.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/DeMixT_Preprocessing.R +\name{batch_correction} +\alias{batch_correction} +\title{batch_correction} +\usage{ +batch_correction(count.matrix, batch_labels) +} +\arguments{ +\item{count.matrix}{A matrix of raw expression count with \eqn{G} by \eqn{(My)}, where \eqn{G} is the number +of genes, \eqn{My} is the number of mixed tumor samples. Row names are genes + column names are tumor sample ids.} + +\item{batch_labels}{Factor of tumor samples from different batches} +} +\value{ +Batch effect corrected count matrix for tumor samples +} +\description{ +Batch effect correction for multiple batches of tumor samples using ComBat +} diff -Nru r-bioc-demixt-1.12.0+dfsg/man/DeMixT_DE.Rd r-bioc-demixt-1.14.0+dfsg/man/DeMixT_DE.Rd --- r-bioc-demixt-1.12.0+dfsg/man/DeMixT_DE.Rd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/DeMixT_DE.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -13,7 +13,7 @@ if.filter = TRUE, filter.sd = 0.5, ngene.selected.for.pi = NA, - nspikein = NULL, + nspikein = min(200, ceiling(ncol(data.Y) * 0.3)), mean.diff.in.CM = 0.25, tol = 10^(-5), pi01 = NULL, @@ -22,25 +22,24 @@ ) } \arguments{ -\item{data.Y}{A SummarizedExperiment object of expression data from mixed -tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the number -of genes and \eqn{My} is the number of mixed samples. Samples with the same -tissue type should be placed together in columns.} - -\item{data.N1}{A SummarizedExperiment object of expression data -from reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix -where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +\item{data.Y}{A SummarizedExperiment object of expression data from mixed +tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +number of genes and \eqn{My} is the number of mixed samples. Samples with +the same tissue type should be placed together in columns.} + +\item{data.N1}{A SummarizedExperiment object of expression data from +reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +where \eqn{G} is the number of genes and \eqn{M1} is the number of samples for component 1.} \item{data.N2}{A SummarizedExperiment object of expression data from -additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where \eqn{G} is the number of genes and \eqn{M2} is the number of samples for component 2. Component 2 is needed only for running a three-component model.} -\item{niter}{The maximum number of iterations used in the algorithm of -iterated conditional modes. A larger value better guarantees -the convergence in estimation but increases the running time. The default is -10.} +\item{niter}{The maximum number of iterations used in the algorithm of +iterated conditional modes. A larger value better guarantees the convergence +in estimation but increases the running time. The default is 10.} \item{nbin}{The number of bins used in numerical integration for computing complete likelihood. A larger value increases accuracy in estimation but @@ -50,7 +49,7 @@ \item{if.filter}{The logical flag indicating whether a predetermined filter rule is used to select genes for proportion estimation. The default is TRUE.} -\item{filter.sd}{The cut-off for the standard deviation of lognormal +\item{filter.sd}{The cut-off for the standard deviation of lognormal distribution. Genes whose log transferred standard deviation smaller than the cut-off will be selected into the model. The default is 0.5.} @@ -58,53 +57,52 @@ proportion estimation. The difference between the expression levels from mixed tumor samples and the known component(s) are evaluated, and the most differential expressed genes are selected, which is called DE. It is enabled -when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where -\eqn{G} is the number of genes. Users can also try using more genes, -ranging from \eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} +when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where \eqn{G} +is the number of genes. Users can also try using more genes, ranging from +\eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} \item{nspikein}{The number of spikes in normal reference used for proportion -estimation. The default value is \eqn{ min(200, 0.3*My)}, where -\eqn{My} the number of mixed samples. If it is set to 0, proportion -estimation is performed without any spike in normal reference.} - -\item{mean.diff.in.CM}{Threshold of expression difference for selecting genes -in the component merging strategy. We merge three-component to two-component -by selecting genes with similar expressions for the two known components. -Genes with the mean differences less than the threshold will be selected for -component merging. It is used in the three-component setting, and is enabled -when if.filter = TRUE. The default is 0.25.} +estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +number of mixed samples. If it is set to 0, proportion estimation is +performed without any spike in normal reference.} + +\item{mean.diff.in.CM}{Threshold of expression difference for selecting +genes in the component merging strategy. We merge three-component to +two-component by selecting genes with similar expressions for the two known +components. Genes with the mean differences less than the threshold will be +selected for component merging. It is used in the three-component setting, +and is enabled when if.filter = TRUE. The default is 0.25.} \item{tol}{The convergence criterion. The default is 10^(-5).} -\item{pi01}{Initialized proportion for first kown component. The default is +\item{pi01}{Initialized proportion for first kown component. The default is \eqn{Null} and pi01 will be generated randomly from uniform distribution.} -\item{pi02}{Initialized proportion for second kown component. pi02 is needed -only for running a three-component model. The default is \eqn{Null} and pi02 +\item{pi02}{Initialized proportion for second kown component. pi02 is needed +only for running a three-component model. The default is \eqn{Null} and pi02 will be generated randomly from uniform distribution.} \item{nthread}{The number of threads used for deconvolution when OpenMP is -available in the system. The default is the number of whole threads minus one. -In our no-OpenMP version, it is set to 1.} +available in the system. The default is the number of whole threads minus +one. In our no-OpenMP version, it is set to 1.} } \value{ -\item{pi}{A matrix of estimated proportion. First row and second row -corresponds to the proportion estimate for the known components and unkown -component respectively for two or three component settings, and each column -corresponds to one sample.} -\item{pi.iter}{Estimated proportions in each iteration. It is a \eqn{niter -*Ny*p} array, where \eqn{p} is the number of components. This is -enabled only when output.more.info = TRUE.} -\item{gene.name}{The names of genes used in estimating the proportions. -If no gene names are provided in the original data set, the genes will -be automatically indexed.} +\item{pi}{A matrix of estimated proportion. First row and second row +corresponds to the proportion estimate for the known components and unkown +component respectively for two or three component settings, and each column +corresponds to one sample.} \item{pi.iter}{Estimated proportions in each +iteration. It is a \eqn{niter *Ny*p} array, where \eqn{p} is the number of +components. This is enabled only when output.more.info = TRUE.} +\item{gene.name}{The names of genes used in estimating the proportions. If +no gene names are provided in the original data set, the genes will be +automatically indexed.} } \description{ -This function is designed to estimate the deconvolved -expressions of individual mixed tumor samples for unknown component -for each gene. +This function is designed to estimate the deconvolved expressions of +individual mixed tumor samples for unknown component for each gene. } \examples{ + # Example 1: estimate proportions for simulated two-component data # with spike-in normal reference data(test.data.2comp) @@ -133,10 +131,12 @@ # data.N2 = test.data.3comp$data.N2, # if.filter = TRUE) + } \references{ -Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of -Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: 451-460. +Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of +Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: +451-460. } \seealso{ http://bioinformatics.mdanderson.org/main/DeMixT diff -Nru r-bioc-demixt-1.12.0+dfsg/man/DeMixT_GS.Rd r-bioc-demixt-1.14.0+dfsg/man/DeMixT_GS.Rd --- r-bioc-demixt-1.12.0+dfsg/man/DeMixT_GS.Rd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/DeMixT_GS.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -2,8 +2,8 @@ % Please edit documentation in R/DeMixT_GS.R \name{DeMixT_GS} \alias{DeMixT_GS} -\title{Estimates the proportions of mixed samples for each mixing component -using profile likelihood gene selection} +\title{Estimates the proportions of mixed samples for each mixing component using +profile likelihood gene selection} \usage{ DeMixT_GS( data.Y, @@ -16,7 +16,7 @@ ngene.Profile.selected = NA, ngene.selected.for.pi = NA, mean.diff.in.CM = 0.25, - nspikein = NULL, + nspikein = min(200, ceiling(ncol(data.Y) * 0.3)), tol = 10^(-5), pi01 = NULL, pi02 = NULL, @@ -24,25 +24,24 @@ ) } \arguments{ -\item{data.Y}{A SummarizedExperiment object of expression data from mixed -tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the number -of genes and \eqn{My} is the number of mixed samples. Samples with the same -tissue type should be placed together in columns.} - -\item{data.N1}{A SummarizedExperiment object of expression data -from reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix -where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +\item{data.Y}{A SummarizedExperiment object of expression data from mixed +tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +number of genes and \eqn{My} is the number of mixed samples. Samples with +the same tissue type should be placed together in columns.} + +\item{data.N1}{A SummarizedExperiment object of expression data from +reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +where \eqn{G} is the number of genes and \eqn{M1} is the number of samples for component 1.} \item{data.N2}{A SummarizedExperiment object of expression data from -additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where \eqn{G} is the number of genes and \eqn{M2} is the number of samples for component 2. Component 2 is needed only for running a three-component model.} -\item{niter}{The maximum number of iterations used in the algorithm of -iterated conditional modes. A larger value better guarantees -the convergence in estimation but increases the running time. The default is -10.} +\item{niter}{The maximum number of iterations used in the algorithm of +iterated conditional modes. A larger value better guarantees the convergence +in estimation but increases the running time. The default is 10.} \item{nbin}{The number of bins used in numerical integration for computing complete likelihood. A larger value increases accuracy in estimation but @@ -52,79 +51,79 @@ \item{if.filter}{The logical flag indicating whether a predetermined filter rule is used to select genes for proportion estimation. The default is TRUE.} -\item{filter.sd}{The cut-off for the standard deviation of lognormal +\item{filter.sd}{The cut-off for the standard deviation of lognormal distribution. Genes whose log transferred standard deviation smaller than the cut-off will be selected into the model. The default is TRUE.} \item{ngene.Profile.selected}{The number of genes used for proportion -estimation ranked by profile likelihood. The default is +estimation ranked by profile likelihood. The default is \eqn{min(1500,0.1*G)}, where \eqn{G} is the number of genes.} \item{ngene.selected.for.pi}{The percentage or the number of genes used for proportion estimation. The difference between the expression levels from mixed tumor samples and the known component(s) are evaluated, and the most differential expressed genes are selected, which is called DE. It is enabled -when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where -\eqn{G} is the number of genes. Users can also try using more genes, -ranging from \eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} +when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where \eqn{G} +is the number of genes. Users can also try using more genes, ranging from +\eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} \item{mean.diff.in.CM}{Threshold of expression difference for selecting -genes in the component merging strategy. We merge three-component to +genes in the component merging strategy. We merge three-component to two-component by selecting genes with similar expressions for the two known -components. Genes with the mean differences less than the threshold will -be selected for component merging. It is used in the three-component -setting, and is enabled when if.filter = TRUE. The default is 0.25.} +components. Genes with the mean differences less than the threshold will be +selected for component merging. It is used in the three-component setting, +and is enabled when if.filter = TRUE. The default is 0.25.} \item{nspikein}{The number of spikes in normal reference used for proportion -estimation. The default value is \eqn{ min(200, 0.3*My)}, where -\eqn{My} the number of mixed samples. If it is set to 0, proportion -estimation is performed without any spike in normal reference.} +estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +number of mixed samples. If it is set to 0, proportion estimation is +performed without any spike in normal reference.} \item{tol}{The convergence criterion. The default is 10^(-5).} -\item{pi01}{Initialized proportion for first kown component. The default is +\item{pi01}{Initialized proportion for first kown component. The default is \eqn{Null} and pi01 will be generated randomly from uniform distribution.} -\item{pi02}{Initialized proportion for second kown component. pi02 is needed -only for running a three-component model. The default is \eqn{Null} and pi02 +\item{pi02}{Initialized proportion for second kown component. pi02 is needed +only for running a three-component model. The default is \eqn{Null} and pi02 will be generated randomly from uniform distribution.} -\item{nthread}{The number of threads used for deconvolution when OpenMP -is available in the system. The default is the number of whole threads -minus one. In our no-OpenMP version, it is set to 1.} +\item{nthread}{The number of threads used for deconvolution when OpenMP is +available in the system. The default is the number of whole threads minus +one. In our no-OpenMP version, it is set to 1.} } \value{ -\item{pi}{A matrix of estimated proportion. First row and second row -corresponds to the proportion estimate for the known components and unkown -component respectively for two or three component settings, and each column -corresponds to one sample.} -\item{pi.iter}{Estimated proportions in each iteration. It is a \eqn{niter -*My*p} array, where \eqn{p} is the number of components. This is -enabled only when output.more.info = TRUE.} -\item{gene.name}{The names of genes used in estimating the proportions. -If no gene names are provided in the original data set, the genes will -be automatically indexed.} +\item{pi}{A matrix of estimated proportion. First row and second row +corresponds to the proportion estimate for the known components and unkown +component respectively for two or three component settings, and each column +corresponds to one sample.} \item{pi.iter}{Estimated proportions in each +iteration. It is a \eqn{niter *My*p} array, where \eqn{p} is the number of +components. This is enabled only when output.more.info = TRUE.} +\item{gene.name}{The names of genes used in estimating the proportions. If +no gene names are provided in the original data set, the genes will be +automatically indexed.} } \description{ -This function is designed to estimate the proportions of all -mixed samples for each mixing component with a new proposed profile likelihood -based gene selection, which can select most identifiable genes as reference -gene sets to achieve better model fitting quality. We first calculated the -Hessian matrix of the parameter spaces and then derive the confidence interval -of the profile likelihood of each gene. We then utilized the length of -confidence interval as a metric to rank the identifiability of genes. As -a result, the proposed gene selection approach can improve the tumor-specific +This function is designed to estimate the proportions of all mixed samples +for each mixing component with a new proposed profile likelihood based gene +selection, which can select most identifiable genes as reference gene sets +to achieve better model fitting quality. We first calculated the Hessian +matrix of the parameter spaces and then derive the confidence interval of +the profile likelihood of each gene. We then utilized the length of +confidence interval as a metric to rank the identifiability of genes. As a +result, the proposed gene selection approach can improve the tumor-specific transcripts proportion estimation. } \note{ A Hessian matrix file will be created in the working directory and the corresponding Hessian matrix with an encoded name from the mixed tumor -sample data will be saved under this file. If a user reruns this function -with the same dataset, this Hessian matrix will be loaded to in place of +sample data will be saved under this file. If a user reruns this function +with the same dataset, this Hessian matrix will be loaded to in place of running the profile likelihood method and reduce running time. } \examples{ + # Example 1: estimate proportions for simulated two-component data # with spike-in normal reference data(test.data.2comp) @@ -146,11 +145,11 @@ # tol = 10^(-5)) + } \references{ -Gene Selection and Identifiability Analysis of RNA -Deconvolution Models using Profile Likelihood. Manuscript in -preparation. +Gene Selection and Identifiability Analysis of RNA Deconvolution +Models using Profile Likelihood. Manuscript in preparation. } \seealso{ http://bioinformatics.mdanderson.org/main/DeMixT diff -Nru r-bioc-demixt-1.12.0+dfsg/man/DeMixT_preprocessing.Rd r-bioc-demixt-1.14.0+dfsg/man/DeMixT_preprocessing.Rd --- r-bioc-demixt-1.12.0+dfsg/man/DeMixT_preprocessing.Rd 1970-01-01 00:00:00.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/DeMixT_preprocessing.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -0,0 +1,39 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/DeMixT_Preprocessing.R +\name{DeMixT_preprocessing} +\alias{DeMixT_preprocessing} +\title{DeMixT_preprocessing} +\usage{ +DeMixT_preprocessing( + count.matrix, + normal.id, + tumor.id, + cutoff_normal_range = c(0.1, 1), + cutoff_tumor_range = c(0, 2.5), + cutoff_step = 0.2 +) +} +\arguments{ +\item{count.matrix}{A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes + column names are sample ids.} + +\item{normal.id}{A vector of normal sample ids} + +\item{tumor.id}{A vector of tumor sample ids} + +\item{cutoff_normal_range}{A vector of two numeric values, indicating the lower and upper bounds of standard deviation of +log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6)} + +\item{cutoff_tumor_range}{A vector of two numeric values, indicating the lower and upper bounds to search standard deviation of +log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6)} + +\item{cutoff_step}{A scatter value indicating the step size of changing cutoff_normal_range and cutoff_tumor_range to find a +suitable subset of count matrix for downstream analysis} +} +\value{ +processed count matrix +} +\description{ +DeMixT preprocessing in one go +} diff -Nru r-bioc-demixt-1.12.0+dfsg/man/DeMixT.Rd r-bioc-demixt-1.14.0+dfsg/man/DeMixT.Rd --- r-bioc-demixt-1.12.0+dfsg/man/DeMixT.Rd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/DeMixT.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -15,7 +15,7 @@ filter.sd = 0.5, ngene.selected.for.pi = NA, mean.diff.in.CM = 0.25, - nspikein = NULL, + nspikein = min(200, ceiling(ncol(data.Y) * 0.3)), gene.selection.method = "GS", ngene.Profile.selected = NA, tol = 10^(-5), @@ -26,25 +26,24 @@ ) } \arguments{ -\item{data.Y}{A SummarizedExperiment object of expression data from mixed -tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the number -of genes and \eqn{My} is the number of mixed samples. Samples with the same -tissue type should be placed together in columns.} - -\item{data.N1}{A SummarizedExperiment object of expression data -from reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix -where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +\item{data.Y}{A SummarizedExperiment object of expression data from mixed +tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +number of genes and \eqn{My} is the number of mixed samples. Samples with +the same tissue type should be placed together in columns.} + +\item{data.N1}{A SummarizedExperiment object of expression data from +reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +where \eqn{G} is the number of genes and \eqn{M1} is the number of samples for component 1.} \item{data.N2}{A SummarizedExperiment object of expression data from -additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where \eqn{G} is the number of genes and \eqn{M2} is the number of samples for component 2. Component 2 is needed only for running a three-component model.} -\item{niter}{The maximum number of iterations used in the algorithm of -iterated conditional modes. A larger value better guarantees -the convergence in estimation but increases the running time. The default is -10.} +\item{niter}{The maximum number of iterations used in the algorithm of +iterated conditional modes. A larger value better guarantees the convergence +in estimation but increases the running time. The default is 10.} \item{nbin}{The number of bins used in numerical integration for computing complete likelihood. A larger value increases accuracy in estimation but @@ -54,7 +53,7 @@ \item{if.filter}{The logical flag indicating whether a predetermined filter rule is used to select genes for proportion estimation. The default is TRUE.} -\item{filter.sd}{The cut-off for the standard deviation of lognormal +\item{filter.sd}{The cut-off for the standard deviation of lognormal distribution. Genes whose log transferred standard deviation smaller than the cut-off will be selected into the model. The default is 0.5.} @@ -62,47 +61,47 @@ proportion estimation. The difference between the expression levels from mixed tumor samples and the known component(s) are evaluated, and the most differential expressed genes are selected, which is called DE. It is enabled -when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where -\eqn{G} is the number of genes. Users can also try using more genes, -ranging from \eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} - -\item{mean.diff.in.CM}{Threshold of expression difference for selecting genes -in the component merging strategy. We merge three-component to two-component -by selecting genes with similar expressions for the two known components. -Genes with the mean differences less than the threshold will be selected for -component merging. It is used in the three-component setting, and is enabled -when if.filter = TRUE. The default is 0.25.} +when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where \eqn{G} +is the number of genes. Users can also try using more genes, ranging from +\eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome.} + +\item{mean.diff.in.CM}{Threshold of expression difference for selecting +genes in the component merging strategy. We merge three-component to +two-component by selecting genes with similar expressions for the two known +components. Genes with the mean differences less than the threshold will be +selected for component merging. It is used in the three-component setting, +and is enabled when if.filter = TRUE. The default is 0.25.} \item{nspikein}{The number of spikes in normal reference used for proportion -estimation. The default value is \eqn{ min(200, 0.3*My)}, where -\eqn{My} the number of mixed samples. If it is set to 0, proportion -estimation is performed without any spike in normal reference.} - -\item{gene.selection.method}{The method of gene selection used for proportion -estimation. The default method is 'GS', which applies a profile likelihood based -method for gene selection. If it is set to 'DE', the most differential expressed -genes are selected.} +estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +number of mixed samples. If it is set to 0, proportion estimation is +performed without any spike in normal reference.} + +\item{gene.selection.method}{The method of gene selection used for +proportion estimation. The default method is 'GS', which applies a profile +likelihood based method for gene selection. If it is set to 'DE', the most +differential expressed genes are selected.} \item{ngene.Profile.selected}{The number of genes used for proportion -estimation ranked by profile likelihood. The default is -\eqn{min(1500,0.1*G)}, where \eqn{G} is the number of genes. -This is enabled only when gene.selection.method is set to 'GS'.} +estimation ranked by profile likelihood. The default is +\eqn{min(1500,0.1*G)}, where \eqn{G} is the number of genes. This is +enabled only when gene.selection.method is set to 'GS'.} \item{tol}{The convergence criterion. The default is 10^(-5).} \item{output.more.info}{The logical flag indicating whether to show the estimated proportions in each iteration in the output.} -\item{pi01}{Initialized proportion for first kown component. The default is +\item{pi01}{Initialized proportion for first kown component. The default is \eqn{Null} and pi01 will be generated randomly from uniform distribution.} -\item{pi02}{Initialized proportion for second kown component. pi02 is needed -only for running a three-component model. The default is \eqn{Null} and pi02 +\item{pi02}{Initialized proportion for second kown component. pi02 is needed +only for running a three-component model. The default is \eqn{Null} and pi02 will be generated randomly from uniform distribution.} \item{nthread}{The number of threads used for deconvolution when OpenMP is -available in the system. The default is the number of whole threads minus one. -In our no-OpenMP version, it is set to 1.} +available in the system. The default is the number of whole threads minus +one. In our no-OpenMP version, it is set to 1.} } \value{ \item{pi}{A matrix of estimated proportion. First row and second row @@ -131,6 +130,30 @@ \item{gene.name}{The names of genes used in estimating the proportions. If no gene names are provided in the original data set, the genes will be automatically indexed.} + +\item{pi}{A matrix of estimated proportion. First row and second row +corresponds to the proportion estimate for the known components and unkown +component respectively for two or three component settings, and each column +corresponds to one sample.} \item{pi.iter}{Estimated proportions in each +iteration. It is a \eqn{niter* My*p} array, where \eqn{p} is the number of +components. This is enabled only when output.more.info = TRUE.} +\item{ExprT}{A matrix of deconvolved expression profiles corresponding to +T-component in mixed samples for a given subset of genes. Each row +corresponds to one gene and each column corresponds to one sample.} +\item{ExprN1}{A matrix of deconvolved expression profiles corresponding to +N1-component in mixed samples for a given subset of genes. Each row +corresponds to one gene and each column corresponds to one sample.} +\item{ExprN2}{A matrix of deconvolved expression profiles corresponding to +N2-component in mixed samples for a given subset of genes in a +three-component setting. Each row corresponds to one gene and each column +corresponds to one sample.} \item{Mu}{A matrix of estimated \eqn{Mu} of +log2-normal distribution for both known (\eqn{MuN1, MuN2}) and unknown +component (\eqn{MuT}). Each row corresponds to one gene.} +\item{Sigma}{Estimated \eqn{Sigma} of log2-normal distribution for both +known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each +row corresponds to one gene.} \item{gene.name}{The names of genes used in +estimating the proportions. If no gene names are provided in the original +data set, the genes will be automatically indexed.} } \description{ DeMixT is a software that performs deconvolution on transcriptome @@ -175,15 +198,63 @@ # example <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE) # example.se <- SummarizedExperiment(assays = list(counts = example)) + +# Example 1: simulated two-component data by using GS(gene selection method) + data(test.data.2comp) +# res <- DeMixT(data.Y = test.data.2comp$data.Y, +# data.N1 = test.data.2comp$data.N1, +# data.N2 = NULL, nspikein = 50, +# gene.selection.method = 'GS', +# niter = 10, nbin = 50, if.filter = TRUE, +# ngene.selected.for.pi = 150, +# mean.diff.in.CM = 0.25, tol = 10^(-5)) +# res$pi +# head(res$ExprT, 3) +# head(res$ExprN1, 3) +# head(res$Mu, 3) +# head(res$Sigma, 3) +# +# Example 2: simulated two-component data by using DE(gene selection method) +# data(test.data.2comp) +# res <- DeMixT(data.Y = test.data.2comp$data.Y, +# data.N1 = test.data.2comp$data.N1, +# data.N2 = NULL, nspikein = 50, g +# ene.selection.method = 'DE', +# niter = 10, nbin = 50, if.filter = TRUE, +# ngene.selected.for.pi = 150, +# mean.diff.in.CM = 0.25, tol = 10^(-5)) +# +# Example 3: three-component mixed cell line data applying +# component merging strategy +# data(test.data.3comp) +# res <- DeMixT(data.Y = test.data.3comp$data.Y, +# data.N1 = test.data.3comp$data.N1, +# data.N2 = test.data.3comp$data.N2, +# if.filter = TRUE) +# +# Example: convert a matrix into the SummarizedExperiment format +# library(SummarizedExperiment) +# example <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE) +# example.se <- SummarizedExperiment(assays = list(counts = example)) + + } \references{ Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: 451-460. + +Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of +Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: +451-460. } \seealso{ http://bioinformatics.mdanderson.org/main/DeMixT + +http://bioinformatics.mdanderson.org/main/DeMixT } \author{ Zeya Wang, Wenyi Wang + +Zeya Wang, Wenyi Wang } \keyword{DeMixT} diff -Nru r-bioc-demixt-1.12.0+dfsg/man/DeMixT_S2.Rd r-bioc-demixt-1.14.0+dfsg/man/DeMixT_S2.Rd --- r-bioc-demixt-1.12.0+dfsg/man/DeMixT_S2.Rd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/DeMixT_S2.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -14,58 +14,58 @@ ) } \arguments{ -\item{data.Y}{A SummarizedExperiment object of expression data from mixed -tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the number -of genes and \eqn{My} is the number of mixed samples. Samples with the same -tissue type should be placed together in columns.} - -\item{data.N1}{A SummarizedExperiment object of expression data -from reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix -where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +\item{data.Y}{A SummarizedExperiment object of expression data from mixed +tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +number of genes and \eqn{My} is the number of mixed samples. Samples with +the same tissue type should be placed together in columns.} + +\item{data.N1}{A SummarizedExperiment object of expression data from +reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +where \eqn{G} is the number of genes and \eqn{M1} is the number of samples for component 1.} \item{data.N2}{A SummarizedExperiment object of expression data from -additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where \eqn{G} is the number of genes and \eqn{M2} is the number of samples for component 2. Component 2 is needed only for running a three-component model.} -\item{givenpi}{A vector of proportions for all mixed tumor samples. -In two-component analysis, it gives the proportions of the unknown reference -component, and in three-component analysis, it gives the proportions for the +\item{givenpi}{A vector of proportions for all mixed tumor samples. In +two-component analysis, it gives the proportions of the unknown reference +component, and in three-component analysis, it gives the proportions for the two known components.} -\item{nbin}{Number of bins used in numerical integration for computing +\item{nbin}{Number of bins used in numerical integration for computing complete likelihood. A larger value increases accuracy in estimation but -increases the running time, especially in a three-component deconvolution +increases the running time, especially in a three-component deconvolution problem. The default is 50.} \item{nthread}{The number of threads used for deconvolution when OpenMP is -available in the system. The default is the number of whole threads minus one. -In our no-OpenMP version, it is set to 1.} +available in the system. The default is the number of whole threads minus +one. In our no-OpenMP version, it is set to 1.} } \value{ -\item{decovExprT}{A matrix of deconvolved expression profiles corresponding to -T-component in mixed samples for a given subset of genes. Each row -corresponds to one gene and each column corresponds to one sample.} -\item{decovExprN1}{A matrix of deconvolved expression profiles corresponding to -N1-component in mixed samples for a given subset of genes. Each row -corresponds to one gene and each column corresponds to one sample.} -\item{decovExprN2}{A matrix of deconvolved expression profiles corresponding to -N2-component in mixed samples for a given subset of genes in a -three-component setting. Each row corresponds to one gene and each -column corresponds to one sample.} -\item{decovMu}{A matrix of estimated \eqn{Mu} of log2-normal distribution for -both known (\eqn{MuN1, MuN2}) and unknown component (\eqn{MuT}). Each row -corresponds to one gene.} -\item{decovSigma}{Estimated \eqn{Sigma} of log2-normal distribution for both -known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each +\item{decovExprT}{A matrix of deconvolved expression profiles +corresponding to T-component in mixed samples for a given subset of genes. +Each row corresponds to one gene and each column corresponds to one sample.} +\item{decovExprN1}{A matrix of deconvolved expression profiles corresponding +to N1-component in mixed samples for a given subset of genes. Each row +corresponds to one gene and each column corresponds to one sample.} +\item{decovExprN2}{A matrix of deconvolved expression profiles corresponding +to N2-component in mixed samples for a given subset of genes in a +three-component setting. Each row corresponds to one gene and each column +corresponds to one sample.} \item{decovMu}{A matrix of estimated \eqn{Mu} of +log2-normal distribution for both known (\eqn{MuN1, MuN2}) and unknown +component (\eqn{MuT}). Each row corresponds to one gene.} +\item{decovSigma}{Estimated \eqn{Sigma} of log2-normal distribution for both +known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each row corresponds to one gene.} } \description{ -This function is designed to estimate the deconvolved expressions -of individual mixed tumor samples for unknown component for each gene. +This function is designed to estimate the deconvolved expressions of +individual mixed tumor samples for unknown component for each gene. } \examples{ + # Example 1: two-component deconvolution given proportions data(test.data.2comp) givenpi <- c(t(as.matrix(test.data.2comp$pi[-2,]))) @@ -84,10 +84,12 @@ # givenpi = givenpi, # nbin = 50) + } \references{ -Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of -Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: 451-460. +Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of +Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: +451-460. } \seealso{ http://bioinformatics.mdanderson.org/main/DeMixT diff -Nru r-bioc-demixt-1.12.0+dfsg/man/detect_suspicious_sample_by_hierarchical_clustering_2comp.Rd r-bioc-demixt-1.14.0+dfsg/man/detect_suspicious_sample_by_hierarchical_clustering_2comp.Rd --- r-bioc-demixt-1.12.0+dfsg/man/detect_suspicious_sample_by_hierarchical_clustering_2comp.Rd 1970-01-01 00:00:00.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/detect_suspicious_sample_by_hierarchical_clustering_2comp.Rd 2022-11-01 19:11:00.000000000 +0000 @@ -0,0 +1,56 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/DeMixT_Preprocessing.R +\name{detect_suspicious_sample_by_hierarchical_clustering_2comp} +\alias{detect_suspicious_sample_by_hierarchical_clustering_2comp} +\alias{plot_sd} +\alias{plot_dim} +\title{detect_suspicious_sample_by_hierarchical_clustering_2comp} +\usage{ +detect_suspicious_sample_by_hierarchical_clustering_2comp( + count.matrix, + normal.id, + tumor.id +) + +plot_sd(count.matrix, normal.id, tumor.id) + +plot_dim( + count.matrix, + labels, + legend.position = "bottomleft", + legend.cex = 1.2 +) +} +\arguments{ +\item{count.matrix}{A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes + column names are sample ids.} + +\item{normal.id}{A vector of normal sample ids} + +\item{tumor.id}{A vector of tumor sample ids} + +\item{legend.position}{Position of legend in the plot. Default is bottomleft.} + +\item{legend.cex}{Character expansion factor relative to current par("cex"). Default = 1.2} +} +\value{ +list object + + + + +} +\description{ +Detect suspicious samples by a hierarchical clustering + +This function is designed to evaluate the separation of tumor samples and normal samples in a PCA space. +If some normal samples appear in the tumor-sample dominated cluster, these normal samples are likely to +be tumor samples and they are supposed to be filtered out before downstream analysis. But for those tumor +samples appearing in the normal-sample dominated cluster, we do not remove them since they may be the ones +with low tumor purity. + +Plot the standard deviation of log2 raw expression + +Plot the distribution of tumor and normal samples in a 2D PCA space based on their expressions +} diff -Nru r-bioc-demixt-1.12.0+dfsg/man/Optimum_KernelC.rd r-bioc-demixt-1.14.0+dfsg/man/Optimum_KernelC.rd --- r-bioc-demixt-1.12.0+dfsg/man/Optimum_KernelC.rd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/man/Optimum_KernelC.rd 2022-11-01 19:11:00.000000000 +0000 @@ -23,56 +23,56 @@ } \arguments{ \item{inputdata}{A matrix of expression data (e.g gene expressions) from -reference (e.g. normal) and mixed samples (e.g. mixed tumor samples). It is a -\eqn{G*M} matrix where \eqn{G} is the number of genes and \eqn{M} is the -number of samples including reference and mixed samples. Samples with the +reference (e.g. normal) and mixed samples (e.g. mixed tumor samples). It is +a \eqn{G*M} matrix where \eqn{G} is the number of genes and \eqn{M} is the +number of samples including reference and mixed samples. Samples with the same tissue type should be placed together in columns (e.g. cbind(normal amples, mixed tumor samples).} \item{groupid}{A vector of indicators to denote if the corresponding samples are reference samples or mixed tumor samples. DeMixT is able to deconvolve mixed tumor samples with at most three components. We use 1 and 2 to denote -the samples referencing the first and the second known component in mixed +the samples referencing the first and the second known component in mixed tumor samples. We use 3 to indicate mixed tumor samples prepared to be -deconvolved. For example, in two-component deconvolution, we have -c(1,1,...,3,3) and in three-component deconvolution, we have +deconvolved. For example, in two-component deconvolution, we have +c(1,1,...,3,3) and in three-component deconvolution, we have c(1,1,...,2,2,...,3,3).} \item{nspikein}{The number of spikes in normal reference used for proportion -estimation. The default value is \eqn{ min(200, 0.3*My)}, where -\eqn{My} the number of mixed tumor samples. If it is set to 0, proportion -estimation is performed without any spike in normal reference.} - -\item{setting.pi}{If it is set to 0, then deconvolution is performed without -any given proportions; if set to 1, deconvolution with given proportions -for the first and the second known component is run; if set to 2, -deconvolution is run with given tumor proportions. This option helps to -perform deconvolution in different settings. In estimation of -component-specific proportions, we use a subset of genes ; so when -it is required to deconvolve another subset of genes, we just easily plug -back our estimated proportions by setting this option to 1. In our two-step -estimation strategy in a three-component setting, this option is set to 2 to -implement the second step.} +estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +number of mixed tumor samples. If it is set to 0, proportion estimation is +performed without any spike in normal reference.} + +\item{setting.pi}{If it is set to 0, then deconvolution is performed without +any given proportions; if set to 1, deconvolution with given proportions for +the first and the second known component is run; if set to 2, deconvolution +is run with given tumor proportions. This option helps to perform +deconvolution in different settings. In estimation of component-specific +proportions, we use a subset of genes ; so when it is required to deconvolve +another subset of genes, we just easily plug back our estimated proportions +by setting this option to 1. In our two-step estimation strategy in a +three-component setting, this option is set to 2 to implement the second +step.} \item{givenpi}{\eqn{ST}-Vector of proportions. Given the number of mixed tumor samples is \eqn{My(My quantile(gene.normal.mean, 0.25) + gene.tumor.index <- gene.tumor.mean > quantile(gene.tumor.mean, 0.25) + sTable = count.matrix[gene.normal.index | gene.tumor.index, ] + rank.test.pvalue = NULL + + labels = c(rep("o", length(normal.id)), rep(".", length(tumor.id))) + pch = c(rep(1, length(normal.id)), rep(20, length(tumor.id))) + + y <- rep("Normal", dim(count.matrix)[2]) + y[colnames(count.matrix) %in% tumor.id] <- "Tumor" + names(y) <- colnames(count.matrix) + + test.pvalue <- apply(sTable, 1, function(x) wilcox.test(x ~ y)$p.value) + sorted.pvalue = sort.int(test.pvalue, index.return = TRUE) + #select the top 1000 genes with smallest p-value + top.gene.index = sorted.pvalue$ix[1:1000] + + #pca analysis + principal.res = prcomp(t(sTable[top.gene.index, ]), + retx = T, + center = T, + scale = T) + + top2.pcs <- principal.res$x[, 1:2] + top2.pcs.dis <- dist(top2.pcs, method = "euclidean") + top2.pcs.hclust <- hclust(top2.pcs.dis, method = "ward.D2") + top2.pcs.two.clusters <- cutree(top2.pcs.hclust, k = 2) + + labels <- y[top2.pcs.hclust$order] + hc_labels <- list() + hc_labels[["label"]] <- labels + hc_labels[["cluster"]] <- top2.pcs.two.clusters + + labels <- y[top2.pcs.hclust$order] + labels[labels == "Tumor"] <- "+" + labels[labels == "Normal"] <- "." + labels_colors <- labels + labels_colors[labels_colors == '+'] <- "blue" + labels_colors[labels_colors == '.'] <- "red" + + + par(mar = c(1, 2.5, 2, 0)) + as.dendrogram(top2.pcs.hclust) %>% set("labels", labels) %>% set("labels_col", labels_colors) %>% plot(main = "Hierarchical clustering of tumor and normal samples") + as.dendrogram(top2.pcs.hclust) %>% rect.dendrogram(k=2, border = 8, lty = 2, lwd = 1) + + legend( + "topright", + legend = c("Normal", "Tumor"), + pch = c(20, 3), + col = c("red", "blue"), + bty = "n" + ) + + return(hc_labels) +} + +#' @title plot_sd +#' @description Plot the standard deviation of log2 raw expression +#' @name plot_sd +#' @rdname detect_suspicious_sample_by_hierarchical_clustering_2comp +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @param normal.id A vector of normal sample ids +#' @param tumor.id A vector of tumor sample ids +#' @return +#' +#' @export plot_sd +plot_sd <- function(count.matrix, normal.id, tumor.id){ + if(length(normal.id) + length(tumor.id) != ncol(count.matrix)){ + stop("Total number of normal and tumor samples in normal.id and tumor.id must be the same with the numbe of columns in count.matrix.") + } + count.matrix[which(count.matrix == 0, arr.ind = T)] = 1 + sdn.obs <- apply(log2(count.matrix[, match(normal.id, colnames(count.matrix))]), 1, sd) + sdm.obs <- apply(log2(count.matrix[, match(tumor.id, colnames(count.matrix))]), 1, sd) + par(mfrow = c(1, 2)) + plot(sdn.obs, + xlab = 'Genes', ylab = 'Standard Deviation', main = 'Normal Reference') + plot(sdm.obs, + xlab = 'Genes', ylab = 'Standard Deviation', main = 'Mixed Tumor') + par(mfrow = c(1,1)) +} + +#' @title subset_sd +#' @description Subset a count matrix given the the ranges of the standard deviations of the +#' log2 expressions from the tumor and normal samples +#' @name subset_sd +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @param normal.id A vector of normal sample ids +#' @param tumor.id A vector of tumor sample ids +#' @param cutoff_normal A vector of two numeric values, indicating the lower and upper bounds of standard deviation of +#' log2 count matrix from the normal samples to subset. Default is c(0.1, 0.6) +#' @param cutoff_tumor A vector of two numeric values, indicating the lower and upper bounds of standard deviation of +#' log2 count matrix from the tumor samples to subset. Default is c(0.2, 0.8) +#' @return A subset of the count matrix +#' +#' @export subset_sd +subset_sd <- function(count.matrix, normal.id, tumor.id, + cutoff_normal = c(0.1, 0.6), + cutoff_tumor = c(0.2, 0.8)){ + + count.matrix[which(count.matrix == 0, arr.ind = T)] = 1 + sdn.obs <- apply(log2(count.matrix[, match(normal.id, colnames(count.matrix))]), 1, sd) + sdm.obs <- apply(log2(count.matrix[, match(tumor.id, colnames(count.matrix))]), 1, sd) + indx <- which(sdn.obs > cutoff_normal[1] & sdn.obs < cutoff_normal[2] & sdm.obs > cutoff_tumor[1] & sdm.obs < cutoff_tumor[2]) + count.matrix.subset <- count.matrix[indx, ] + return(count.matrix.subset) +} + +#' @title plot_dim +#' @description Plot the distribution of tumor and normal samples in a 2D PCA space based on their expressions +#' @name plot_dim +#' @rdname detect_suspicious_sample_by_hierarchical_clustering_2comp +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @param normal.id A vector of normal sample ids +#' @param tumor.id A vector of tumor sample ids +#' @param legend.position Position of legend in the plot. Default is bottomleft. +#' @param legend.cex Character expansion factor relative to current par("cex"). Default = 1.2 +#' @return +#' +#' @export plot_dim +plot_dim <- function(count.matrix, labels, + legend.position = 'bottomleft', + legend.cex = 1.2){ + + if(length(labels) != ncol(count.matrix)){ + stop("The length of labels must be the same with the numbe of columns in count.matrix.") + } + + N.label <- length(unique(labels)) + + count.matrix[which(count.matrix == 0, arr.ind = T)] = 1 + ## PCA dimension reduction + res = 0 + res <- prcomp(t(log2(count.matrix)), center = T, scale. = T) + ## PC variance + pct1 = round(res$sdev[1]/sum(res$sdev), 3)*100 + pct2 = round(res$sdev[2]/sum(res$sdev), 3)*100 + ## Dimension + dim1 = res$x[,1]; dim2 = res$x[,2] + ## x/ylab text + xlab.text = paste("First Comp: ", as.character(pct1), "% variance", sep="") + ylab.text = paste("Second Comp: ", as.character(pct2), "% variance", sep="") + main.text = 'PCA' + + ## Plot + plot(dim1, dim2, cex = 1, + xlab=xlab.text, ylab=ylab.text, main = main.text, + col=rainbow(N.label)[as.numeric(factor(labels))], + pch = as.numeric(factor((labels))), lwd=1.5) + abline(h=0, v=0, col="brown", lty=2) + abline(h=0, v=0, col="brown", lty=2) + center1<-tapply(dim1, factor(labels), mean) + center2<-tapply(dim2, factor(labels), mean) + ## add lines and centers + for (ii in 1:length(center1)) { + groupi<-cbind(dim1,dim2)[as.numeric(factor(labels))==ii, 1:2] + if (class(groupi)[1] =="matrix") { + for (j in (1:nrow(groupi))) { + segments(groupi[j,1], groupi[j,2], center1[ii], center2[ii], col = rainbow(N.label)[ii], lwd = 0.5) + } + }else { + segments(groupi[1], groupi[2], center1[ii], center2[ii], col = rainbow(N.label)[ii], lwd = 0.5) + } + } + points(center1, center2, pch = 7, lwd = 1.5, col = rainbow(N.label)) + legend(legend.position, legend=names(table(factor(labels))), + text.col=rainbow(N.label), pch=1:N.label, + col=rainbow(N.label), lty=1, cex = legend.cex) +} + + +#' @title scale_normalization_75th_percentile +#' @description Quantile normalization for the raw count matrix of tumor and normal reference using the 0.75 quantile scale normalization +#' @name scale_normalization_75th_percentile +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @return the scale normalized count matrix +#' +#' @export scale_normalization_75th_percentile +scale_normalization_75th_percentile <- function(count.matrix){ + newt <- count.matrix + colnames(newt) = NULL + rownames(newt) = NULL + + designs=c(rep("0", dim(count.matrix)[2])) + seqData=newSeqCountSet(as.matrix(newt), designs) + + # quantile normalization/total/median ###try different normalization method### + seqData=estNormFactors(seqData, "quantile") + k3=seqData@normalizationFactor + mk3=median(k3) + k3=k3/mk3 + + temp<-newt + + for(i in 1:ncol(newt)){ + temp[,i] = temp[,i]/k3[i] + } + count.matrix.normalized<-temp + colnames(count.matrix.normalized)<-colnames(count.matrix) + rownames(count.matrix.normalized)<-rownames(count.matrix) + + return(count.matrix.normalized) +} + + +#' @title subset_sd_gene_remaining +#' @description Find the cutoffs to filter out genes with large standard deviations of log2 expressions in both normal and tumor samples +#' @name subset_sd_gene_remaining +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @param normal.id A vector of normal sample ids +#' @param tumor.id A vector of tumor sample ids +#' @param cutoff_normal_range A vector of two numeric values, indicating the lower and upper bounds of standard deviation of +#' log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6) +#' @param cutoff_tumor_range A vector of two numeric values, indicating the lower and upper bounds to search standard deviation of +#' log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6) +#' @param cutoff_step A scatter value indicating the step size of changing cutoff_normal_range and cutoff_tumor_range to find a +#' suitable subset of count matrix for downstream analysis +#' +#' @export subset_sd_gene_remaining +subset_sd_gene_remaining <- function(count.matrix, normal.id, tumor.id, + cutoff_normal_range = c(0.2, 0.6), + cutoff_tumor_range = c(0.2, 0.8), + cutoff_step = 0.2){ + + if(length(normal.id) + length(tumor.id) != ncol(count.matrix)){ + stop("Total number of normal and tumor samples in normal.id and tumor.id must be the same with the numbe of columns in count.matrix.") + } + count.matrix[which(count.matrix == 0, arr.ind = T)] = 1 + sdn.obs <- apply(log2(count.matrix[, match(normal.id, colnames(count.matrix))]), 1, sd) + sdm.obs <- apply(log2(count.matrix[, match(tumor.id, colnames(count.matrix))]), 1, sd) + + cutoff_normal_range <- seq(cutoff_normal_range[1], cutoff_normal_range[2], by = cutoff_step) + cutoff_tumor_range <- seq(cutoff_tumor_range[1], cutoff_tumor_range[2], by = cutoff_step) + + num_gene_remaining_different_cutoffs <- NULL + + for(cutoff_normal in cutoff_normal_range){ + if(cutoff_normal == cutoff_normal_range[1]) next + + for(cutoff_tumor in cutoff_tumor_range){ + if(cutoff_tumor == cutoff_tumor_range[1]) next + + indx <- which(sdn.obs > cutoff_normal_range[1] & sdn.obs < cutoff_normal & + sdm.obs > cutoff_tumor_range[1] & sdm.obs < cutoff_tumor) + + num_gene_remaining_different_cutoffs <- rbind(num_gene_remaining_different_cutoffs, + data.frame(normal.cutoff.low=cutoff_normal_range[1], + normal.cutoff.high=cutoff_normal, + tumor.cutoff.low=cutoff_tumor_range[1], + tumor.cutoff.high=cutoff_tumor, + num.gene.remaining=length(indx))) + + } + } + return(num_gene_remaining_different_cutoffs) +} + +#' @title DeMixT_preprocessing +#' @description DeMixT preprocessing in one go +#' @name DeMixT_preprocessing +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My + M1)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed samples and \eqn{M1} is the number of normal samples. Row names are genes +#' column names are sample ids. +#' @param normal.id A vector of normal sample ids +#' @param tumor.id A vector of tumor sample ids +#' @param cutoff_normal_range A vector of two numeric values, indicating the lower and upper bounds of standard deviation of +#' log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6) +#' @param cutoff_tumor_range A vector of two numeric values, indicating the lower and upper bounds to search standard deviation of +#' log2 count matrix from the normal samples to subset. Default is c(0.2, 0.6) +#' @param cutoff_step A scatter value indicating the step size of changing cutoff_normal_range and cutoff_tumor_range to find a +#' suitable subset of count matrix for downstream analysis +#' +#' @return processed count matrix +#' +#' @export DeMixT_preprocessing +DeMixT_preprocessing <- function(count.matrix, normal.id, tumor.id, + cutoff_normal_range=c(0.1, 1.0), + cutoff_tumor_range=c(0, 2.5), + cutoff_step=0.2){ + + num_gene_remaining_different_cutoffs <- subset_sd_gene_remaining(count.matrix, normal.id, tumor.id, + cutoff_normal_range, + cutoff_tumor_range, + cutoff_step) + + + num_gene_remaining_different_cutoffs_filter <- num_gene_remaining_different_cutoffs[which(num_gene_remaining_different_cutoffs$num.gene.remaining - 9000 > 0), ] + num_gene_remaining_different_cutoffs_filter <- num_gene_remaining_different_cutoffs_filter[order(num_gene_remaining_different_cutoffs_filter$num.gene.remaining), ] + sd_cutoff_normal <- c(num_gene_remaining_different_cutoffs_filter$normal.cutoff.low[1], num_gene_remaining_different_cutoffs_filter$normal.cutoff.high[1]) + sd_cutoff_tumor <- c(num_gene_remaining_different_cutoffs_filter$tumor.cutoff.low[1], num_gene_remaining_different_cutoffs_filter$tumor.cutoff.high[1]) + + count.matrix <- subset_sd(count.matrix, normal.id, tumor.id, cutoff_normal = sd_cutoff_normal, cutoff_tumor = sd_cutoff_tumor) + + count.matrix = quantile_normalization(count.matrix) + + preprocessing_output <- list() + + preprocessing_output[['count.matrix']] <- count.matrix + + preprocessing_output[["sd_cutoff_normal"]] <- sd_cutoff_normal + preprocessing_output[["sd_cutoff_tumor"]] <- sd_cutoff_tumor + + return(preprocessing_output) + +} + +#' @title batch_correction +#' @description Batch effect correction for multiple batches of tumor samples using ComBat +#' @name batch_correction +#' @param count.matrix A matrix of raw expression count with \eqn{G} by \eqn{(My)}, where \eqn{G} is the number +#' of genes, \eqn{My} is the number of mixed tumor samples. Row names are genes +#' column names are tumor sample ids. +#' @param batch_labels Factor of tumor samples from different batches +#' @return Batch effect corrected count matrix for tumor samples +#' +#' @export batch_correction +batch_correction <- function(count.matrix, batch_labels){ + + if(length(batch_labels) != ncol(count.matrix)){ + stop("Total number of normal and tumor samples in normal.id and tumor.id must be the same with the numbe of columns in count.matrix.") + } + count.matrix[which(count.matrix == 0, arr.ind = T)] = 1 + count.matrix.combat.log2 = ComBat(dat = log2(count.matrix), + batch = batch_labels, mod = NULL, + par.prior=TRUE, mean.only = F) + + count.matrix.combat.combat <- 2^count.matrix.combat.log2 + return(count.matrix.combat.combat) +} + + diff -Nru r-bioc-demixt-1.12.0+dfsg/R/DeMixT.R r-bioc-demixt-1.14.0+dfsg/R/DeMixT.R --- r-bioc-demixt-1.12.0+dfsg/R/DeMixT.R 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/R/DeMixT.R 2022-11-01 19:11:00.000000000 +0000 @@ -144,11 +144,155 @@ #' #' @export + + + + + + +#' Deconvolution of heterogeneous tumor samples with two or three components +#' using expression data from RNAseq or microarray platforms +#' +#' DeMixT is a software that performs deconvolution on transcriptome data from +#' a mixture of two or three components. +#' +#' +#' @param data.Y A SummarizedExperiment object of expression data from mixed +#' tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +#' number of genes and \eqn{My} is the number of mixed samples. Samples with +#' the same tissue type should be placed together in columns. +#' @param data.N1 A SummarizedExperiment object of expression data from +#' reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +#' where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +#' for component 1. +#' @param data.N2 A SummarizedExperiment object of expression data from +#' additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +#' \eqn{G} is the number of genes and \eqn{M2} is the number of samples for +#' component 2. Component 2 is needed only for running a three-component model. +#' @param niter The maximum number of iterations used in the algorithm of +#' iterated conditional modes. A larger value better guarantees the convergence +#' in estimation but increases the running time. The default is 10. +#' @param nbin The number of bins used in numerical integration for computing +#' complete likelihood. A larger value increases accuracy in estimation but +#' increases the running time, especially in a three-component deconvolution +#' problem. The default is 50. +#' @param if.filter The logical flag indicating whether a predetermined filter +#' rule is used to select genes for proportion estimation. The default is TRUE. +#' @param filter.sd The cut-off for the standard deviation of lognormal +#' distribution. Genes whose log transferred standard deviation smaller than +#' the cut-off will be selected into the model. The default is 0.5. +#' @param ngene.selected.for.pi The percentage or the number of genes used for +#' proportion estimation. The difference between the expression levels from +#' mixed tumor samples and the known component(s) are evaluated, and the most +#' differential expressed genes are selected, which is called DE. It is enabled +#' when if.filter = TRUE. The default is \eqn{min(1500, 0.3*G)}, where \eqn{G} +#' is the number of genes. Users can also try using more genes, ranging from +#' \eqn{0.3*G} to \eqn{0.5*G}, and evaluate the outcome. +#' @param mean.diff.in.CM Threshold of expression difference for selecting +#' genes in the component merging strategy. We merge three-component to +#' two-component by selecting genes with similar expressions for the two known +#' components. Genes with the mean differences less than the threshold will be +#' selected for component merging. It is used in the three-component setting, +#' and is enabled when if.filter = TRUE. The default is 0.25. +#' @param nspikein The number of spikes in normal reference used for proportion +#' estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +#' number of mixed samples. If it is set to 0, proportion estimation is +#' performed without any spike in normal reference. +#' @param gene.selection.method The method of gene selection used for +#' proportion estimation. The default method is 'GS', which applies a profile +#' likelihood based method for gene selection. If it is set to 'DE', the most +#' differential expressed genes are selected. +#' @param ngene.Profile.selected The number of genes used for proportion +#' estimation ranked by profile likelihood. The default is +#' \eqn{min(1500,0.1*G)}, where \eqn{G} is the number of genes. This is +#' enabled only when gene.selection.method is set to 'GS'. +#' @param tol The convergence criterion. The default is 10^(-5). +#' @param output.more.info The logical flag indicating whether to show the +#' estimated proportions in each iteration in the output. +#' @param pi01 Initialized proportion for first kown component. The default is +#' \eqn{Null} and pi01 will be generated randomly from uniform distribution. +#' @param pi02 Initialized proportion for second kown component. pi02 is needed +#' only for running a three-component model. The default is \eqn{Null} and pi02 +#' will be generated randomly from uniform distribution. +#' @param nthread The number of threads used for deconvolution when OpenMP is +#' available in the system. The default is the number of whole threads minus +#' one. In our no-OpenMP version, it is set to 1. +#' @return \item{pi}{A matrix of estimated proportion. First row and second row +#' corresponds to the proportion estimate for the known components and unkown +#' component respectively for two or three component settings, and each column +#' corresponds to one sample.} \item{pi.iter}{Estimated proportions in each +#' iteration. It is a \eqn{niter* My*p} array, where \eqn{p} is the number of +#' components. This is enabled only when output.more.info = TRUE.} +#' \item{ExprT}{A matrix of deconvolved expression profiles corresponding to +#' T-component in mixed samples for a given subset of genes. Each row +#' corresponds to one gene and each column corresponds to one sample.} +#' \item{ExprN1}{A matrix of deconvolved expression profiles corresponding to +#' N1-component in mixed samples for a given subset of genes. Each row +#' corresponds to one gene and each column corresponds to one sample.} +#' \item{ExprN2}{A matrix of deconvolved expression profiles corresponding to +#' N2-component in mixed samples for a given subset of genes in a +#' three-component setting. Each row corresponds to one gene and each column +#' corresponds to one sample.} \item{Mu}{A matrix of estimated \eqn{Mu} of +#' log2-normal distribution for both known (\eqn{MuN1, MuN2}) and unknown +#' component (\eqn{MuT}). Each row corresponds to one gene.} +#' \item{Sigma}{Estimated \eqn{Sigma} of log2-normal distribution for both +#' known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each +#' row corresponds to one gene.} \item{gene.name}{The names of genes used in +#' estimating the proportions. If no gene names are provided in the original +#' data set, the genes will be automatically indexed.} +#' @author Zeya Wang, Wenyi Wang +#' @seealso http://bioinformatics.mdanderson.org/main/DeMixT +#' @references Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of +#' Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: +#' 451-460. +#' @keywords DeMixT +#' @examples +#' +#' # Example 1: simulated two-component data by using GS(gene selection method) +#' data(test.data.2comp) +#' # res <- DeMixT(data.Y = test.data.2comp$data.Y, +#' # data.N1 = test.data.2comp$data.N1, +#' # data.N2 = NULL, nspikein = 50, +#' # gene.selection.method = 'GS', +#' # niter = 10, nbin = 50, if.filter = TRUE, +#' # ngene.selected.for.pi = 150, +#' # mean.diff.in.CM = 0.25, tol = 10^(-5)) +#' # res$pi +#' # head(res$ExprT, 3) +#' # head(res$ExprN1, 3) +#' # head(res$Mu, 3) +#' # head(res$Sigma, 3) +#' # +#' # Example 2: simulated two-component data by using DE(gene selection method) +#' # data(test.data.2comp) +#' # res <- DeMixT(data.Y = test.data.2comp$data.Y, +#' # data.N1 = test.data.2comp$data.N1, +#' # data.N2 = NULL, nspikein = 50, g +#' # ene.selection.method = 'DE', +#' # niter = 10, nbin = 50, if.filter = TRUE, +#' # ngene.selected.for.pi = 150, +#' # mean.diff.in.CM = 0.25, tol = 10^(-5)) +#' # +#' # Example 3: three-component mixed cell line data applying +#' # component merging strategy +#' # data(test.data.3comp) +#' # res <- DeMixT(data.Y = test.data.3comp$data.Y, +#' # data.N1 = test.data.3comp$data.N1, +#' # data.N2 = test.data.3comp$data.N2, +#' # if.filter = TRUE) +#' # +#' # Example: convert a matrix into the SummarizedExperiment format +#' # library(SummarizedExperiment) +#' # example <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE) +#' # example.se <- SummarizedExperiment(assays = list(counts = example)) +#' +#' +#' @export DeMixT DeMixT <- function( data.Y, data.N1, data.N2 = NULL, niter = 10, nbin = 50, if.filter = TRUE, filter.sd = 0.5, ngene.selected.for.pi = NA, - mean.diff.in.CM = 0.25, nspikein = NULL, + mean.diff.in.CM = 0.25, nspikein = min(200, ceiling(ncol(data.Y)*0.3)), gene.selection.method = 'GS', ngene.Profile.selected = NA, tol = 10^(-5), output.more.info = FALSE, @@ -213,4 +357,4 @@ Mu = res.S2$decovMu, Sigma = res.S2$decovSigma, gene.name = res.pi$gene.name)) } -} \ No newline at end of file +} diff -Nru r-bioc-demixt-1.12.0+dfsg/R/DeMixT_S2.R r-bioc-demixt-1.14.0+dfsg/R/DeMixT_S2.R --- r-bioc-demixt-1.12.0+dfsg/R/DeMixT_S2.R 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/R/DeMixT_S2.R 2022-11-01 19:11:00.000000000 +0000 @@ -1,55 +1,55 @@ -#' @title Deconvolves expressions of each individual sample for unknown component -#' -#' @description This function is designed to estimate the deconvolved expressions -#' of individual mixed tumor samples for unknown component for each gene. -#' -#' @param data.Y A SummarizedExperiment object of expression data from mixed -#' tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the number -#' of genes and \eqn{My} is the number of mixed samples. Samples with the same -#' tissue type should be placed together in columns. -#' @param data.N1 A SummarizedExperiment object of expression data -#' from reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix -#' where \eqn{G} is the number of genes and \eqn{M1} is the number of samples -#' for component 1. +#' Deconvolves expressions of each individual sample for unknown component +#' +#' This function is designed to estimate the deconvolved expressions of +#' individual mixed tumor samples for unknown component for each gene. +#' +#' +#' @param data.Y A SummarizedExperiment object of expression data from mixed +#' tumor samples. It is a \eqn{G} by \eqn{My} matrix where \eqn{G} is the +#' number of genes and \eqn{My} is the number of mixed samples. Samples with +#' the same tissue type should be placed together in columns. +#' @param data.N1 A SummarizedExperiment object of expression data from +#' reference component 1 (e.g., normal). It is a \eqn{G} by \eqn{M1} matrix +#' where \eqn{G} is the number of genes and \eqn{M1} is the number of samples +#' for component 1. #' @param data.N2 A SummarizedExperiment object of expression data from -#' additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where +#' additional reference samples. It is a \eqn{G} by \eqn{M2} matrix where #' \eqn{G} is the number of genes and \eqn{M2} is the number of samples for #' component 2. Component 2 is needed only for running a three-component model. -#' @param givenpi A vector of proportions for all mixed tumor samples. -#' In two-component analysis, it gives the proportions of the unknown reference -#' component, and in three-component analysis, it gives the proportions for the +#' @param givenpi A vector of proportions for all mixed tumor samples. In +#' two-component analysis, it gives the proportions of the unknown reference +#' component, and in three-component analysis, it gives the proportions for the #' two known components. -#' @param nbin Number of bins used in numerical integration for computing +#' @param nbin Number of bins used in numerical integration for computing #' complete likelihood. A larger value increases accuracy in estimation but -#' increases the running time, especially in a three-component deconvolution +#' increases the running time, especially in a three-component deconvolution #' problem. The default is 50. #' @param nthread The number of threads used for deconvolution when OpenMP is -#' available in the system. The default is the number of whole threads minus one. -#' In our no-OpenMP version, it is set to 1. -#' -#' @return -#' \item{decovExprT}{A matrix of deconvolved expression profiles corresponding to -#' T-component in mixed samples for a given subset of genes. Each row -#' corresponds to one gene and each column corresponds to one sample.} -#' \item{decovExprN1}{A matrix of deconvolved expression profiles corresponding to -#' N1-component in mixed samples for a given subset of genes. Each row -#' corresponds to one gene and each column corresponds to one sample.} -#' \item{decovExprN2}{A matrix of deconvolved expression profiles corresponding to -#' N2-component in mixed samples for a given subset of genes in a -#' three-component setting. Each row corresponds to one gene and each -#' column corresponds to one sample.} -#' \item{decovMu}{A matrix of estimated \eqn{Mu} of log2-normal distribution for -#' both known (\eqn{MuN1, MuN2}) and unknown component (\eqn{MuT}). Each row -#' corresponds to one gene.} -#' \item{decovSigma}{Estimated \eqn{Sigma} of log2-normal distribution for both -#' known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each +#' available in the system. The default is the number of whole threads minus +#' one. In our no-OpenMP version, it is set to 1. +#' @return \item{decovExprT}{A matrix of deconvolved expression profiles +#' corresponding to T-component in mixed samples for a given subset of genes. +#' Each row corresponds to one gene and each column corresponds to one sample.} +#' \item{decovExprN1}{A matrix of deconvolved expression profiles corresponding +#' to N1-component in mixed samples for a given subset of genes. Each row +#' corresponds to one gene and each column corresponds to one sample.} +#' \item{decovExprN2}{A matrix of deconvolved expression profiles corresponding +#' to N2-component in mixed samples for a given subset of genes in a +#' three-component setting. Each row corresponds to one gene and each column +#' corresponds to one sample.} \item{decovMu}{A matrix of estimated \eqn{Mu} of +#' log2-normal distribution for both known (\eqn{MuN1, MuN2}) and unknown +#' component (\eqn{MuT}). Each row corresponds to one gene.} +#' \item{decovSigma}{Estimated \eqn{Sigma} of log2-normal distribution for both +#' known (\eqn{SigmaN1, SigmaN2}) and unknown component (\eqn{SigmaT}). Each #' row corresponds to one gene.} -#' #' @author Zeya Wang, Wenyi Wang -#' #' @seealso http://bioinformatics.mdanderson.org/main/DeMixT -#' +#' @references Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of +#' Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: +#' 451-460. +#' @keywords DeMixT_S2 #' @examples +#' #' # Example 1: two-component deconvolution given proportions #' data(test.data.2comp) #' givenpi <- c(t(as.matrix(test.data.2comp$pi[-2,]))) @@ -68,12 +68,8 @@ #' # givenpi = givenpi, #' # nbin = 50) #' -#' @references Wang Z, Cao S, Morris J S, et al. Transcriptome Deconvolution of -#' Heterogeneous Tumor Samples with Immune Infiltration. iScience, 2018, 9: 451-460. -#' -#' @keywords DeMixT_S2 #' -#' @export +#' @export DeMixT_S2 DeMixT_S2 <- function(data.Y, data.N1, data.N2 = NULL, givenpi, nbin = 50, nthread = parallel::detectCores() - 1) @@ -205,4 +201,4 @@ decovExprN2 = decovExprN2, decovMu = decovMu, decovSigma = decovSigma)) } -} \ No newline at end of file +} diff -Nru r-bioc-demixt-1.12.0+dfsg/R/Optimum_KernelC.R r-bioc-demixt-1.14.0+dfsg/R/Optimum_KernelC.R --- r-bioc-demixt-1.12.0+dfsg/R/Optimum_KernelC.R 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/R/Optimum_KernelC.R 2022-11-01 19:11:00.000000000 +0000 @@ -1,54 +1,55 @@ -#' @title Kernel function for optimizing parameters and hidden variables in DeMixT +#' Kernel function for optimizing parameters and hidden variables in DeMixT #' -#' @description This function is invoked by DeMixT_GS or DeMixT_DE and DeMixT_S2 to -#' finish parameter estimation by iterated conditional mode algorithm and reconstitute +#' This function is invoked by DeMixT_GS or DeMixT_DE and DeMixT_S2 to finish +#' parameter estimation by iterated conditional mode algorithm and reconstitute #' gene expression profile of all components. -#' +#' +#' #' @param inputdata A matrix of expression data (e.g gene expressions) from -#' reference (e.g. normal) and mixed samples (e.g. mixed tumor samples). It is a -#' \eqn{G*M} matrix where \eqn{G} is the number of genes and \eqn{M} is the -#' number of samples including reference and mixed samples. Samples with the +#' reference (e.g. normal) and mixed samples (e.g. mixed tumor samples). It is +#' a \eqn{G*M} matrix where \eqn{G} is the number of genes and \eqn{M} is the +#' number of samples including reference and mixed samples. Samples with the #' same tissue type should be placed together in columns (e.g. cbind(normal #' amples, mixed tumor samples). #' @param groupid A vector of indicators to denote if the corresponding samples #' are reference samples or mixed tumor samples. DeMixT is able to deconvolve #' mixed tumor samples with at most three components. We use 1 and 2 to denote -#' the samples referencing the first and the second known component in mixed +#' the samples referencing the first and the second known component in mixed #' tumor samples. We use 3 to indicate mixed tumor samples prepared to be -#' deconvolved. For example, in two-component deconvolution, we have -#' c(1,1,...,3,3) and in three-component deconvolution, we have +#' deconvolved. For example, in two-component deconvolution, we have +#' c(1,1,...,3,3) and in three-component deconvolution, we have #' c(1,1,...,2,2,...,3,3). #' @param nspikein The number of spikes in normal reference used for proportion -#' estimation. The default value is \eqn{ min(200, 0.3*My)}, where -#' \eqn{My} the number of mixed tumor samples. If it is set to 0, proportion -#' estimation is performed without any spike in normal reference. -#' @param setting.pi If it is set to 0, then deconvolution is performed without -#' any given proportions; if set to 1, deconvolution with given proportions -#' for the first and the second known component is run; if set to 2, -#' deconvolution is run with given tumor proportions. This option helps to -#' perform deconvolution in different settings. In estimation of -#' component-specific proportions, we use a subset of genes ; so when -#' it is required to deconvolve another subset of genes, we just easily plug -#' back our estimated proportions by setting this option to 1. In our two-step -#' estimation strategy in a three-component setting, this option is set to 2 to -#' implement the second step. +#' estimation. The default value is \eqn{ min(200, 0.3*My)}, where \eqn{My} the +#' number of mixed tumor samples. If it is set to 0, proportion estimation is +#' performed without any spike in normal reference. +#' @param setting.pi If it is set to 0, then deconvolution is performed without +#' any given proportions; if set to 1, deconvolution with given proportions for +#' the first and the second known component is run; if set to 2, deconvolution +#' is run with given tumor proportions. This option helps to perform +#' deconvolution in different settings. In estimation of component-specific +#' proportions, we use a subset of genes ; so when it is required to deconvolve +#' another subset of genes, we just easily plug back our estimated proportions +#' by setting this option to 1. In our two-step estimation strategy in a +#' three-component setting, this option is set to 2 to implement the second +#' step. #' @param givenpi \eqn{ST}-Vector of proportions. Given the number of mixed #' tumor samples is \eqn{My(My #include #include + #include #include #include /* DBL_EPSILON */ #include "DeMixTH.h" diff -Nru r-bioc-demixt-1.12.0+dfsg/src/RcppExports.cpp r-bioc-demixt-1.14.0+dfsg/src/RcppExports.cpp --- r-bioc-demixt-1.12.0+dfsg/src/RcppExports.cpp 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/src/RcppExports.cpp 2022-11-01 19:11:00.000000000 +0000 @@ -5,6 +5,11 @@ using namespace Rcpp; +#ifdef RCPP_USE_GLOBAL_ROSTREAM +Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); +Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); +#endif + // ft_y double ft_y(double y, double mung, double mutg, double sng, double stg, double pi1, double pi2); RcppExport SEXP _DeMixT_ft_y(SEXP ySEXP, SEXP mungSEXP, SEXP mutgSEXP, SEXP sngSEXP, SEXP stgSEXP, SEXP pi1SEXP, SEXP pi2SEXP) { diff -Nru r-bioc-demixt-1.12.0+dfsg/vignettes/demixt.Rmd r-bioc-demixt-1.14.0+dfsg/vignettes/demixt.Rmd --- r-bioc-demixt-1.12.0+dfsg/vignettes/demixt.Rmd 2022-04-26 18:26:17.000000000 +0000 +++ r-bioc-demixt-1.14.0+dfsg/vignettes/demixt.Rmd 2022-11-01 19:11:00.000000000 +0000 @@ -1,7 +1,6 @@ --- title: "A Vignette for DeMixT" -author: "Zeya Wang, Fan Gao, Shaolong Cao and Peng Yang" -date: "`r Sys.Date()`" +date: "Last updated: `r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{DeMixT.Rmd} @@ -9,6 +8,7 @@ \usepackage[utf8]{inputenc} --- +
```{r setup, include=FALSE, message=FALSE} library(ggplot2) @@ -117,39 +117,46 @@ # 3. Installation -## 3.1 Source file - -DeMixT source files are compatible with Windows, Linux and macOS. +The DeMixT package is compatible with Windows, Linux and MacOS. The user can install it from ``Bioconductor``: +``` +if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") -DeMixT_1.8.2 is the latest version, which is for a computer that has OpenMP. -It can be downloaded from Bioconductor website. -https://bioconductor.org/packages/release/bioc/html/DeMixT.html +BiocManager::install("DeMixT") +``` -To install DeMixT_1.8.2 from GitHub, start R and enter: +For Linux and MacOS, the user can also install the latest DeMixT from GitHub: +``` +if (!require("devtools", quietly = TRUE)) + install.packages('devtools') -```{r install_DeMixT} -# devtools::install_github("wwylab/DeMixT") +devtools::install_github("wwylab/DeMixT") ``` -For more information, please visit: - +Check if DeMixT is installed successfully: +``` +# load package +library(DeMixT) +``` +**Note**: DeMixT relies on OpenMP for parallel computing. Starting from R 4.00, R no longer supports OpenMP on MacOS, meaning the user can only run DeMixT with one core on MacOS. We therefore recommend the users to mainly use Linux system for running DeMixT to take advantage of the multi-core parallel computation. -## 3.2 Functions +# 4. Functions The following table shows the functions included in DeMixT. Table Header | Second Header ------- | ---------------------------------- -DeMixT | Deconvolution of tumor samples with two or three components -DeMixT_GS | Estimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihood -DeMixT_DE | Estimates the proportions of mixed samples for each mixing component -DeMixT_S2 |Deconvolves expressions of each sample for unknown component -Optimum_KernelC | Call the C function used for parameter estimation in DeMixT +DeMixT | Deconvolution of tumor samples with two or three components. +DeMixT_GS | Estimates the proportions of mixed samples for each mixing component based on a new approach to select genes that utilizes profile likelihood. +DeMixT_DE | Estimates the proportions of mixed samples for each mixing component. +DeMixT_S2 |Deconvolves expressions of each sample for unknown component. +Optimum_KernelC | Call the C function used for parameter estimation in DeMixT. +DeMixT_Preprocessing | Preprocessing functions before running DeMixT. -# 4. Methods +# 5. Methods -## 4.1 Model +## 5.1 Model Let \(Y_{ig}\) be the observed expression levels of the raw measured data from clinically derived malignant tumor samples for gene \(g, g = 1, \cdots, G\) and @@ -188,9 +195,9 @@ Consequently, our model can be expressed as the convolution of the density function for three \(\log_2\)-normal distributions. Because there is no closed form of this convolution, we use numerical integration to evaluate the complete -likelihood function (see the full likelihood in the Supplementary Materials). +likelihood function (see the full likelihood in the Supplementary Materials in [1]). -## 4.2 The DeMixT algorithm for deconvolution +## 5.2 The DeMixT algorithm for deconvolution DeMixT estimates all distribution parameters and cellular proportions and reconstitutes the expression profiles for all three components for each gene @@ -213,7 +220,7 @@ DeMixT_GS for the first step and DeMixT_S2 for the second, which are combined in the function DeMixT(Note: DeMixT_GS is the default function for first step). -In version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, +Since version 1.8.2, DeMixT added simulated normal reference samples, i.e., spike-in, based on the observed normal reference samples. It has been shown to improve accuracy in proportion estimation for the scenario where a dataset consists of samples where true tumor proportions are skewed to the high end. @@ -223,9 +230,9 @@ ``` -# 5. Examples +# 6. Examples -## 5.1 Simulated two-component data +## 6.1 Simulated two-component data ```{r, sim_2comp_GS, results="hide", message=FALSE} data("test.data.2comp") @@ -259,7 +266,7 @@ head(res.S2$decovSigma,3) ``` -## 5.2 Simulated two-component data with DeMixT Spike-in Normal +## 6.2 Simulated two-component data In the simulation, ````markdown @@ -337,7 +344,7 @@ labs(x = 'True Proportion', y = 'Estimated Proportion') ``` -## 5.3 Simulated three-component data +## 6.3 Simulated three-component data In this simulation, ````markdown @@ -422,30 +429,132 @@ ``` +## 6.4 Real data: PRAD in TCGA dataset -## 5.4 PRAD in TCGA Dataset +### 6.4.1 Preprocessing +For the deconvolution of real data with DeMixT, the user may apply the following preprocessing steps first. Here, we use the PRAD (prostate adenocarcinoma) from TCGA as an example. -```{r read_data_PRAD, warning=FALSE, message=FALSE} -load('res.PRAD.RData'); +1. **(Optional) Remove suspicious samples by hierarchical clustering** + +It is possible that the some of the tumor and normal samples are mislabelled. We use the function ``` +# hc_labels <- detect_suspicious_sample_by_hierarchical_clustering_2comp(count.matrix, normal.id, tumor.id) +``` +to visually inspect the separation of tumor and normal samples based on the hierarchical clustering of their expressions, in which ``count.matrix`` is the raw count matrix; ``normal.id`` and ``tumor.id`` are the vectors of normal and tumor sample ids, respectively. Generally, one cluster contains tumor samples and the other contains normal samples. Any samples that are clustered outside of its own group label, e.g., tumor samples clustered within the normal sample cluster or normal samples in the tumor cluster, are considered as mislabelled samples and filtered out. -```{r PRAD, fig.height = 4, fig.width = 6, fig.align='center', warning=FALSE} -res.PRAD.df = as.data.frame(cbind(res.PRAD$res.GS.PRAD, res.PRAD$res.GS.SP.PRAD)) -res.PRAD.df$V1 <- as.numeric(as.character(res.PRAD.df$V1)) -res.PRAD.df$V2 <- as.numeric(as.character(res.PRAD.df$V2)) -names(res.PRAD.df) = c('Estimated.Proportion', 'Estimated.Proportion.SP') -## Plot -ggplot(res.PRAD.df, aes(x=Estimated.Proportion, y=Estimated.Proportion.SP)) + - geom_point() + - geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", lwd = 0.5) + - xlim(0,1) + ylim(0,1) + - scale_shape_manual(values=c(seq(1:3))) + - labs(x = 'Estimated Proportion', y = 'Estimated Proportion After Spike-in') +```{r hierarchical_clustering, echo=FALSE, out.width='85%', fig.align='center'} +knitr::include_graphics(path = paste0("hierarchical_clustering.png")) ``` -Tumor proportion estimation remains consistence with and without spike-in normal, when its distribution is roughly symmetric around 0.5. +``count.matrix``, ``normal.id `` and ``tumor.id`` are updated by +```{r} +# normal.id <- setdiff(normal.id, names(hc_labels$cluster[hc_labels$cluster == 1])) +# tumor.id <- setdiff(tumor.id, names(hc_labels$cluster[hc_labels$cluster == 2])) +# count.matrix <- count.matrix[, c(normal.id, tumor.id)] +``` + +2. **Select genes with small variation in gene expression across samples** + +In this step, we select a subset of ~9000 genes from the original gene set (>50,000) before running DeMixT with the GS (Gene Selection) method so that our model-based gene selection maintains good statistical properties. We use the function + +``` +# plot_sd(count.matrix, normal.id, tumor.id) +``` +to visualize the distribution of standard deviation of log2 expressions space of normal and tumor samples, +```{r plot_sd, echo=FALSE, out.width='85%', fig.align='center'} +knitr::include_graphics(path = paste0("sd_expresion.png")) +``` +and use the function + +``` +# num_gene_remaining_different_cutoffs <- subset_sd_gene_remaining(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0), cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1) +``` +to find he a range of variance in genes from normal samples ``(cutoff_normal_range)`` and from tumor samples ``(cutoff_tumor_range)`` which results in roughly 9,000 genes. + +We then use the function +``` +# count.matrix <- subset_sd(count.matrix, normal.id, tumor.id, cutoff_normal = cutoff_normal_range, cutoff_tumor = cutoff_tumor_range) +``` +to update the ``count.matrix`` such that it only includes the selected genes. + +3. **Scale normalization** + +We apply a scale normalization at the 75th percentile across all the tumor and normal samples using the function + +``` +# count.matrix <- scale_normalization_75th_percentile(count.matrix) +``` +to adjust the expression levels in the samples. + +**Note**: The user may also use the function +``` +# count.matrix <- DeMixT_preprocessing(count.matrix, normal.id, tumor.id, cutoff_normal_range = c(0.1, 1.0), cutoff_tumor_range = c(0, 2.5), cutoff_step = 0.1) +``` +to perform the preprocessing steps of 2) and 3) in one go. + + +4. **(Optional) Batch effect correction for tumor samples from different batches by ComBat** + +If the tumor samples are from different batches, we recommend the user to inspect the batch effect using the function before running DeMixT +``` +# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2) +``` +This function will generate a PCA plot, in which the samples are colored by the ``labels`` indicating different batches of tumor samples, as well as normal samples. If there is a clear separation between different batches of tumor samples, there is likely batch effects. We use the function +``` +# count.matrix.tumor <- batch_correction(count.matrix.tumor, labels) +``` +to reduce this effect, where ``count.matrix.tumor`` is the raw count matrix only from tumor samples and ``labels`` is the factor of tumor batches. The user may choose other batch effect correction methods at this step. + +The user may inspect the batch effect again after the above step using the mentioned function +``` +# plot_dim(count.matrix.tumor, labels, legend.position = 'bottomleft', legend.cex = 1.2) +``` + +### 6.4.2 Deconvolution using DeMixT +To optimize the DeMixT parameter setting for the input data, we recommend testing an array of combinations of the number of spike-ins and the number of selected genes. + +The number of CPU cores used by the DeMixT function for parallel computing is specified by the parameter ``nthread``. By default (such as in the code block below), ``nthread = total_number_of_cores_on_the_machine - 1``. The user can change ``nthread`` to a number between 0 and the total number of cores on the machine. + +``` +## Because of the random initial values and the spike-in samples within the DeMixT function, +## we would like to remind the user to set seeds to keep track. This seed setting will be +## internalized in DeMixT in the next update. + +# set.seed(1234) + +# data.Y = SummarizedExperiment(assays = list(counts = count.matrix[, tumor.id])) +# data.N1 <- SummarizedExperiment(assays = list(counts = count.matrix[, normal.id])) + +## In practice, we set the maximum number of spike-in as min(n/3, 200), +## where n is the number of samples. +# nspikesin_list = c(0, 50, 100, 150) +## One may set a wider range than provided below for studies other than TCGA. +# ngene.selected_list = c(500, 1000, 1500, 2500) + +#for(nspikesin in nspikesin_list){ +# for(ngene.selected in ngene.selected_list){ +# name = paste("PRAD_demixt_GS_res_nspikesin", nspikesin, "ngene.selected", +# ngene.selected, sep = "_"); +# name = paste(name, ".RData", sep = ""); +# res = DeMixT(data.Y = data.Y, +# data.N1 = data.N1, +# ngene.selected.for.pi = ngene.selected, +# ngene.Profile.selected = ngene.selected, +# filter.sd = 0.7, # same upper bound of gene expression standard deviation +# # for normal reference. i.e., preprocessed_data$sd_cutoff_normal[2] +# gene.selection.method = "GS", +# nspikein = nspikesin) +# save(res, file = name) +# } +#} +``` +We suggest selecting the optimal parameter combination that produces the largest average correlation of estimated tumor propotions with those produced by other combinations. The location of the mode of the Pi estimation may also be considered. The mode located too high or too low may suggest biased estimation. + +Instead of selecting using the parameter combination with the highest correlation, one can also select the parameter combination that produces estimated tumor proportions that are most biologically meaningful. + +A comprehensive tutorial of using DeMixT for real data deconvolution can be found at [https://wwylab.github.io/DeMixT/tutorial.html](https://wwylab.github.io/DeMixT/tutorial.html). -## 5.5 Deconvolution using normal reference samples from GTEx +## 6.5 Deconvolution using normal reference samples from GTEx We conducted experiments across cancer types to evaluate the impact of technical artifacts such as batch effects to the proportion estimation when using a different cohort. We applied GTEx expression data from normal prostate samples as the normal reference to deconvolute the TCGA prostate cancer samples, where normal tissues were selected without significant pathology. The estimated proportions showed a reasonable correlation (Spearman correlation coefficient = 0.65) with those generated using TCGA normal prostate samples as the normal reference. @@ -469,8 +578,10 @@ knitr::include_graphics(path = paste0("GTEx_normal.png")) ``` +# 7. Reference +[1]. Wang, Z. et al. Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration. iScience 9, 451–460 (2018). -# 6. Session Info +# 8. Session Info ```{r} sessionInfo(package = "DeMixT") Binary files /tmp/tmp9u2nr5ih/JXU5aINcP1/r-bioc-demixt-1.12.0+dfsg/vignettes/hierarchical_clustering.png and /tmp/tmp9u2nr5ih/xxrad0_Ugz/r-bioc-demixt-1.14.0+dfsg/vignettes/hierarchical_clustering.png differ Binary files /tmp/tmp9u2nr5ih/JXU5aINcP1/r-bioc-demixt-1.12.0+dfsg/vignettes/sd_expresion.png and /tmp/tmp9u2nr5ih/xxrad0_Ugz/r-bioc-demixt-1.14.0+dfsg/vignettes/sd_expresion.png differ