Binary files /tmp/tmpaQHFqE/BABP0sRDz_/r-cran-metamix-0.2/build/vignette.rds and /tmp/tmpaQHFqE/Ug_NtA_Qf0/r-cran-metamix-0.3/build/vignette.rds differ diff -Nru r-cran-metamix-0.2/debian/changelog r-cran-metamix-0.3/debian/changelog --- r-cran-metamix-0.2/debian/changelog 2018-05-23 20:11:50.000000000 +0000 +++ r-cran-metamix-0.3/debian/changelog 2019-02-13 06:26:21.000000000 +0000 @@ -1,3 +1,11 @@ +r-cran-metamix (0.3-1) unstable; urgency=medium + + * New upstream version + * debhelper 12 + * Standards-Version: 4.3.0 + + -- Dylan Aïssi Wed, 13 Feb 2019 07:26:21 +0100 + r-cran-metamix (0.2-1) unstable; urgency=medium * Initial upload (Closes: #899401) diff -Nru r-cran-metamix-0.2/debian/compat r-cran-metamix-0.3/debian/compat --- r-cran-metamix-0.2/debian/compat 2018-05-23 20:11:50.000000000 +0000 +++ r-cran-metamix-0.3/debian/compat 2019-02-13 06:26:21.000000000 +0000 @@ -1 +1 @@ -11 +12 diff -Nru r-cran-metamix-0.2/debian/control r-cran-metamix-0.3/debian/control --- r-cran-metamix-0.2/debian/control 2018-05-23 20:11:50.000000000 +0000 +++ r-cran-metamix-0.3/debian/control 2019-02-13 06:26:21.000000000 +0000 @@ -1,20 +1,20 @@ Source: r-cran-metamix Maintainer: Debian R Packages Maintainers Uploaders: Alba Crespi , - Dylan Aïssi + Dylan Aïssi Section: gnu-r Testsuite: autopkgtest-pkg-r Priority: optional -Build-Depends: debhelper (>= 11~), +Build-Depends: debhelper (>= 12~), dh-r, - libopenmpi-dev, r-base-dev, + libopenmpi-dev, r-cran-data.table, r-cran-matrix, r-cran-gtools, r-cran-rmpi, r-cran-ggplot2 -Standards-Version: 4.1.4 +Standards-Version: 4.3.0 Vcs-Browser: https://salsa.debian.org/r-pkg-team/r-cran-metamix Vcs-Git: https://salsa.debian.org/r-pkg-team/r-cran-metamix.git Homepage: https://cran.r-project.org/package=metaMix diff -Nru r-cran-metamix-0.2/DESCRIPTION r-cran-metamix-0.3/DESCRIPTION --- r-cran-metamix-0.2/DESCRIPTION 2015-03-15 13:03:01.000000000 +0000 +++ r-cran-metamix-0.3/DESCRIPTION 2019-02-11 16:20:03.000000000 +0000 @@ -1,9 +1,9 @@ Package: metaMix Title: Bayesian Mixture Analysis for Metagenomic Community Profiling -Version: 0.2 +Version: 0.3 Author: Sofia Morfopoulou Maintainer: Sofia Morfopoulou -Depends: R (>= 3.0.1) +Depends: R (>= 3.2) Imports: data.table (>= 1.9.2), Matrix, gtools, Rmpi, ggplot2 Suggests: knitr VignetteBuilder: knitr @@ -15,7 +15,9 @@ License: GPL-3 LazyData: true SystemRequirements: Open MPI (>=1.4.3) -Packaged: 2015-03-15 12:32:53 UTC; psofaki +RoxygenNote: 6.1.1 +Encoding: UTF-8 NeedsCompilation: yes +Packaged: 2019-02-07 15:41:40 UTC; sophia Repository: CRAN -Date/Publication: 2015-03-15 14:03:01 +Date/Publication: 2019-02-11 16:20:03 UTC Binary files /tmp/tmpaQHFqE/BABP0sRDz_/r-cran-metamix-0.2/inst/doc/guide.pdf and /tmp/tmpaQHFqE/Ug_NtA_Qf0/r-cran-metamix-0.3/inst/doc/guide.pdf differ diff -Nru r-cran-metamix-0.2/inst/doc/guide.Rnw r-cran-metamix-0.3/inst/doc/guide.Rnw --- r-cran-metamix-0.2/inst/doc/guide.Rnw 2015-03-15 12:32:53.000000000 +0000 +++ r-cran-metamix-0.3/inst/doc/guide.Rnw 2019-02-07 15:39:43.000000000 +0000 @@ -24,9 +24,15 @@ options(tidy=TRUE, width=80) @ + + \section{What is new} \textbf{Version 0.2:} Added Bayes Factor computation. + \textbf{Version 0.3:} Allow user to specify number of MCMC iterations (Step 3) and Burn-in percentage (Step 4). Due to NCBI phasing out GI numbers, added support for protein accession identifier file for mapping to taxon (STEP1). Fixed bug in step4 for Bayes Factor calculation when only one species is present, that is only human and unknown. Also fixed bug with semi-colon separated taxonids in custom BLAST format, as well as NAs (STEP 1). Replaced deprecated cBind function with cbind. + + + \section{Installation} You will need to have openMPI (Message Passage Interface) installed to be able to install the R package \verb!Rmpi!, which provides the interface to openMPI. @@ -270,6 +276,9 @@ At iter. 13 species 2 was present, while at the next one, it no longer is there. That means that the attempt at swapping the values of the two neighboring chains was successful. This information is also recorded, i.e how many swaps were attempted and how many accepted. +Each chain will produce a log file that will be printed in your working directory. + + \subsection*{Step4} Having explored the different possible models, the final step is to perform model averaging. @@ -315,17 +324,15 @@ \section{Submit jobs on cluster compute servers} In order to run steps 1, 2 and 4 of metaMix (i.e \verb!generative.prob!, \verb!reduce.space!, \verb!bayes.model.aver!) efficiently, these should be submitted as jobs to a compute cluster. -In our experience, 4G of memory, 1 hour of wall clock time and 1 processor should be plenty. - -In order to run the parallel tempering efficiently, we need at least 12 parallel chains, each with at least 1G-2G of RAM. +In our experience, 4G of memory, 1 hour of wall clock time and 1 processor should be plenty. +In order to run the parallel tempering efficiently (step3), we need at least 12 parallel chains. I usually request 6 or 12 threads, each with 2G of RAM. The wall clock time depends on how many iterations will be performed. -Also a larger number of reads mean that the computations will become slower. -We typically ask for 12 hours to be on the safe side. +A larger number of reads mean that the computations will become slower. +We typically submit all 4 steps in one go using one submission script. +I usually assess the size of the dataset, but 12 hours for all 4 steps should be safe. - -This is a sample submission script for the third step. -It requests 12 processors on 1 node for 12 hours. +This is a sample submission script, it requests 6 processors on 1 node for 12 hours (more processors, less time necessary). \begin{verbatim} #!/bin/bash @@ -333,17 +340,30 @@ #$ -o cluster/out #$ -e cluster/error #$ -cwd -#$ -pe smp 12 -#$ -l tmem=1.1G,h_vmem=1.1G +#$ -pe smp 6 +#$ -l tmem=2G,h_vmem=2G #$ -l h_rt=12:00:00 #$ -V #$ -R y -mpirun -np 1 R-3.0.1/bin/R --slave CMD BATCH --no-save --no-restore step3.R +mpirun -np 1 R-3.0.1/bin/R --slave CMD BATCH --no-save --no-restore allsteps.R +\end{verbatim} + +an example allsteps.R looks like this: +\begin{verbatim} +library(metaMix) + +step1<-generative.prob(blast.output.file="sample_diamond.tab", read.length.file="read_lengths.tab", contig.weight.file="contig_weights.tab", outDir="./", gi.taxon.file="gi_taxid_prot.dmp", blast.default=TRUE, gi.or.prot="gi") + +step2<-reduce.space(step1="step1.RData") + +step3<-parallel.temper(step2="step2.RData") + +step4<-bayes.model.aver(step2="step2.RData", step3="step3.RData", taxon.name.map="names.dmp") + + \end{verbatim} -in step3.R, we simply load the object produced from \verb!reduce.space! and then call \verb!generative.prob!. -Each chain will produce a log file that will be printed in your working directory \section{Technical information about the R session} diff -Nru r-cran-metamix-0.2/man/bayes.model.aver.Rd r-cran-metamix-0.3/man/bayes.model.aver.Rd --- r-cran-metamix-0.2/man/bayes.model.aver.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/bayes.model.aver.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,36 +1,39 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/step4_bayesModelAver.R \name{bayes.model.aver} \alias{bayes.model.aver} \alias{bayes.model.aver.explicit} \title{Bayesian Model Averaging} \usage{ -bayes.model.aver(step2, step3, taxon.name.map = NULL, poster.prob.thr = 0.9) +bayes.model.aver(step2, step3, taxon.name.map = NULL, + poster.prob.thr = 0.9, burnin = 0.4) bayes.model.aver.explicit(result, pij.sparse.mat, read.weights, outDir, - gen.prob.unknown, taxon.name.map = NULL, poster.prob.thr = 0.9) + gen.prob.unknown, taxon.name.map = NULL, poster.prob.thr = 0.9, + burnin = 0.4) } \arguments{ -\item{step3}{list. The output from parallel.temper(), i.e the third step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step3 list was saved.} - \item{step2}{list. The output from reduce.space(), i.e the second step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step2 list was saved.} +\item{step3}{list. The output from parallel.temper(), i.e the third step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step3 list was saved.} + \item{taxon.name.map}{The 'names.dmp' taxonomy names file, mapping each taxon identifier to the corresponding scientific name. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz}} \item{poster.prob.thr}{Posterior probability of presence of species threshold for reporting in the species summary.} +\item{burnin}{Percentage of burn in iterations, default value is 0.4} + \item{result}{The list produced by parallel.temper() (or paraller.temper.nucl()) . It holds a detailed record for each chain, what moves were proposed, which were accepted and which were rejected as well the log-likelihood through the iterations.} \item{pij.sparse.mat}{see ?reduce.space} \item{read.weights}{see ?reduce.space} -\item{gen.prob.unknown}{see ?reduce.space} - \item{outDir}{see ?reduce.space} + +\item{gen.prob.unknown}{see ?reduce.space} } \description{ -Bayesian Model Averaging - Perform Bayesian Model Averaging. We concentrate on the chain with temperature=1 , i.e the untempered posterior, to study the distribution over the model choices and perform model averaging. We consider as present the species that have a posterior probability greater than 0.9. We then fit the mixture model with these species in order to obtain relative abundances and read classification probabilities. A tab seperated file that has a species summary is produced, as well as log-likelihood traceplots and cumulative histogram plots. bayes.model.aver.explicit is the same function as bayes.model.aver with a more involved syntax. @@ -48,8 +51,7 @@ step4<-bayes.model.aver(step2="pathtoFile/step2.RData", step3="pathtoFile/step3.RData", taxon.name.map="pathtoFile/taxon.file") -} +} } \keyword{bayes.model.aver} \keyword{bayes.model.aver.explicit} - diff -Nru r-cran-metamix-0.2/man/generative.prob.Rd r-cran-metamix-0.3/man/generative.prob.Rd --- r-cran-metamix-0.2/man/generative.prob.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/generative.prob.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,25 +1,31 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/step1_preprocess.R \name{generative.prob} \alias{generative.prob} \alias{generative.prob.nucl} \title{Compute generative probabilities from BLAST output} \usage{ generative.prob(blast.output.file = NULL, read.length.file = 80, - contig.weight.file = 1, gi.taxon.file = NULL, gen.prob.unknown = 1e-06, - outDir = NULL, blast.default = TRUE) + contig.weight.file = 1, gi.taxon.file = NULL, + protaccession.taxon.file = NULL, gi.or.prot = "prot", + gen.prob.unknown = 1e-06, outDir = NULL, blast.default = TRUE) generative.prob.nucl(blast.output.file = NULL, read.length.file = 80, contig.weight.file = 1, gi.taxon.file, gen.prob.unknown = 1e-20, outDir = NULL, genomeLength = NULL, blast.default = TRUE) } \arguments{ -\item{blast.output.file}{This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon identifier. Please see the vignette for column order and the exact BLAST command to use.} +\item{blast.output.file}{This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon identifier. Please see the vignette for column order and the exact BLAST command to use. You can also use DIAMOND instead of BLASTx which is much faster and produces default format according to BLAST default output specifications.} \item{read.length.file}{This argument can either be a file mapping each read to its length or a numerical value, representing the average read length.} \item{contig.weight.file}{This argument can either be a file where weights are assigned to reads and contigs. For unassembled reads the weight is equal to 1 while for contigs the weight should reflect the number of reads that assembled it.} -\item{gi.taxon.file}{For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz} . For generative.prob.nucl() this would be the 'gi_taxid_nucl.dmp' taxonomy file, mapping each nucleotide gi identifier to the corresponding taxon identifier. It an be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz}.} +\item{gi.taxon.file}{For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz} . For generative.prob.nucl() this would be the 'gi_taxid_nucl.dmp' taxonomy file, mapping each nucleotide gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz}.} + +\item{protaccession.taxon.file}{This parameter has been added as NCBI is phasing out the usage of GI identifiers. For generative.prob() this would be the prot.accession2taxid taxonomy file, mapping each protein accession identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz}. I have found that it is useful to concatenate it with \url{ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz} so you can search in both files for the protein identifier (sometimes obsolete sequences can still be present in latest RefSeq releases but not in taxonomy files and vice versa and these mismatches can cause loss of information). TODO add support for nucleotides as well.} + +\item{gi.or.prot}{This parameter specifies whether the user is using the GI identifiers or protein accession identifiers to map to taxon identifiers. Values are 'gi' or 'prot'. The default value is 'prot'.} \item{gen.prob.unknown}{User-defined generative probability for unknown category. Default value for generative.prob() is 1e-06, while for generative.prob.nucl() is 1e-20.} @@ -33,8 +39,6 @@ step1: A list with five elements. The first one (pij.sparse.mat) is a sparse matrix with the generative probability between each read and each species. The second (ordered.species) is matrix containing all the potential species as recorded by BLAST, ordered by the number of reads. The third one (read.weights) is a data.frame mapping each contig to a weight which is essentially the number of reads that were used to assemble it. For unassembled reads the weight is equal to one. The fourth one is the generative probability for the unknown category (gen.prob.unknown), to be used in all subsequent steps. Finally we also record the output directory (outDir) where the results will be stored. } \description{ -Compute generative probabilities from BLAST output - generative.prob() computes the probability for a read to be generated by a certain species, given that it originates from this species. The input for this function is the tabular BLAST output format, either default or custom. The function uses the recorded mismatches to produce a Poisson probability. generative.prob.nucl() for when we have nucleotide similarity, i.e we have performed BLASTn. @@ -55,4 +59,3 @@ } \keyword{generative.prob} \keyword{generative.prob.nucl} - diff -Nru r-cran-metamix-0.2/man/parallel.temper.nucl.Rd r-cran-metamix-0.3/man/parallel.temper.nucl.Rd --- r-cran-metamix-0.2/man/parallel.temper.nucl.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/parallel.temper.nucl.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,11 +1,12 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/step3_parallelTemper_nucl.R \name{parallel.temper.nucl} \alias{parallel.temper.nucl} \alias{parallel.temper.nucl.explicit} \title{Parallel Tempering MCMC} \usage{ -parallel.temper.nucl(step2, readSupport = 30, noChains = 12, seed = 1, - median.genome.length = 284332) +parallel.temper.nucl(step2, readSupport = 30, noChains = 12, + seed = 1, median.genome.length = 284332) parallel.temper.nucl.explicit(readSupport = 30, noChains = 12, pij.sparse.mat, read.weights, ordered.species, gen.prob.unknown, outDir, @@ -36,8 +37,6 @@ step3: A list with two elements. The first one (result) is a list that records MCMC information from each parallel chain. The second one (duration) records how much time the MCMC exploration took. } \description{ -Parallel Tempering MCMC - Performs Parallel Tempering MCMC to explore the species state space. Two types of moves are implemented: a mutation step (within chain) and an exchange step (between neighboring chains). If working with BLASTx data, use parallel.temper(). parallel.temper.nucl.explicit is the same function as parallel.temper.nucl with a more involved syntax. @@ -47,4 +46,3 @@ } \keyword{parallel.temper.nucl} \keyword{parallel.temper.nucl.explicit} - diff -Nru r-cran-metamix-0.2/man/parallel.temper.Rd r-cran-metamix-0.3/man/parallel.temper.Rd --- r-cran-metamix-0.2/man/parallel.temper.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/parallel.temper.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,22 +1,29 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/step3_parallelTemper.R \name{parallel.temper} \alias{parallel.temper} \alias{parallel.temper.explicit} \title{Parallel Tempering MCMC} \usage{ -parallel.temper(step2, readSupport = 10, noChains = 12, seed = 1) +parallel.temper(step2, readSupport = 10, noChains = 12, seed = 1, + iter = 500, bypass = FALSE) -parallel.temper.explicit(readSupport = 10, noChains = 12, pij.sparse.mat, - read.weights, ordered.species, gen.prob.unknown, outDir, seed = 1) +parallel.temper.explicit(readSupport = 10, noChains = 12, + pij.sparse.mat, read.weights, ordered.species, gen.prob.unknown, outDir, + seed = 1, iter = 500, bypass = FALSE) } \arguments{ +\item{step2}{list. The output from reduce.space(). Alternatively, it can be a character string containing the path name of the ".RData" file where step2 list was saved.} + \item{readSupport}{The number of reads the user requires in order to believe in the presence of the species. It is used to compute the penalty factor. The default value is 10. We compute the logarithmic penalty value as the log-likelihood difference between two models: one where all N reads belong to the "unknown" category and one where r reads have a perfect match to some unspecified species and the remaining reads belong to the "unknown" category.} \item{noChains}{The number of parallel chains to run. The default value is 12.} \item{seed}{Optional argument that sets the random seed (default is 1) to make results reproducible.} -\item{step2}{list. The output from reduce.space(). Alternatively, it can be a character string containing the path name of the ".RData" file where step2 list was saved.} +\item{iter}{The number of MCMC iterations. The default behavior of metaMix is to take into account the number of potential species after step 2 in order in order to compute the number of MCMC iterations. By default metaMix will choose the greater value between a) the user-specified value for iter and b) the product of (5 * the number of potential species). This behavior can by bypassed by setting the bypass parameter to TRUE. Then the MCMC will run for exactly the user-specified number iter.} + +\item{bypass}{A logical flag. If set to TRUE the MCMC will run for exactly "iter" iterations. If FALSE, metaMix defaults to choosing the greater value between "iter" and "5*(nrow(ordered.sepcies))".} \item{pij.sparse.mat}{sparse matrix of generative probabilities, see value of ?reduce.space.} @@ -32,8 +39,6 @@ step3: A list with two elements. The first one (result) is a list that records MCMC information from each parallel chain. The second one (duration) records how much time the MCMC exploration took. } \description{ -Parallel Tempering MCMC - Performs Parallel Tempering MCMC to explore the species state space. Two types of moves are implemented: a mutation step (within chain) and an exchange step (between neighboring chains). If working with BLASTn data, use parallel.temper.nucl(). parallel.temper.explicit is the same function as parallel.temper but with a more involved syntax. @@ -55,4 +60,3 @@ } \keyword{parallel.temper} \keyword{parallel.temper.explicit} - diff -Nru r-cran-metamix-0.2/man/reduce.space.Rd r-cran-metamix-0.3/man/reduce.space.Rd --- r-cran-metamix-0.2/man/reduce.space.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/reduce.space.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,4 +1,5 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/step2_allInclusive_MM.R \name{reduce.space} \alias{reduce.space} \alias{reduce.space.explicit} @@ -6,8 +7,8 @@ \usage{ reduce.space(step1, read.cutoff = 1, EMiter = 500, seed = 1) -reduce.space.explicit(pij.sparse.mat, ordered.species, read.weights, outDir, - gen.prob.unknown, read.cutoff = 1, EMiter = 500, seed = 1) +reduce.space.explicit(pij.sparse.mat, ordered.species, read.weights, + outDir, gen.prob.unknown, read.cutoff = 1, EMiter = 500, seed = 1) } \arguments{ \item{step1}{list. The output from generative.prob() (or generative.prob.nucl(), that is the first step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step1 list was saved.} @@ -32,8 +33,6 @@ step2: A list with six elements. The first one (ordered.species) is a data.frame containing all the non-empty species categories, as decided by the all inclusive mixture model, ordered by the number of reads assigned to them. The second one (pij.sparse.mat) is a sparse matrix with the generative probability between each read and each species. read.weights, gen.prob.unknown, outDir are all carried forward from the "step1" object. Finally outputEM which records the species abundances throughout the EM iterations (not used in step3 and step4). } \description{ -Reduce the space of potential species by fitting the mixture model with all potential species as categories - Having the generative probabilities from step1 (generative.prob() or generative.prob.nucl()), we could proceed directly with the PT MCMC to explore the state space. Typically the number of total potential species is large. Therefore we reduce the size of the state-space, by decreasing the number of species to the low hundreds. We achieve this by fitting a Mixture Model with as many categories as all the potential species. Post fitting, we retain only the species categories that are not empty, that is categories that have at least one read assigned to them. reduce.space.explicit is the same function as reduce.space but with more involved syntax. @@ -52,4 +51,3 @@ } \keyword{reduce.space} \keyword{reduce.space.explicit} - diff -Nru r-cran-metamix-0.2/man/step1.Rd r-cran-metamix-0.3/man/step1.Rd --- r-cran-metamix-0.2/man/step1.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/step1.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,4 +1,5 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/metaMix-package.r \name{step1} \alias{step1} \title{Example output of generative.prob() for use in the vignette/examples} @@ -6,4 +7,3 @@ \description{ Example output of generative.prob() for use in the vignette/examples } - diff -Nru r-cran-metamix-0.2/man/step2.Rd r-cran-metamix-0.3/man/step2.Rd --- r-cran-metamix-0.2/man/step2.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/step2.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,4 +1,5 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/metaMix-package.r \name{step2} \alias{step2} \title{Example output of reduce.space() for use in the vignette/examples} @@ -6,4 +7,3 @@ \description{ Example output of reduce.space() for use in the vignette/examples } - diff -Nru r-cran-metamix-0.2/man/step3.Rd r-cran-metamix-0.3/man/step3.Rd --- r-cran-metamix-0.2/man/step3.Rd 2015-03-15 10:33:17.000000000 +0000 +++ r-cran-metamix-0.3/man/step3.Rd 2019-02-04 13:12:26.000000000 +0000 @@ -1,4 +1,5 @@ -% Generated by roxygen2 (4.0.1): do not edit by hand +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/metaMix-package.r \name{step3} \alias{step3} \title{Example output of parallel.temper() for use in the vignette/examples} @@ -6,4 +7,3 @@ \description{ Example output of parallel.temper() for use in the vignette/examples } - diff -Nru r-cran-metamix-0.2/MD5 r-cran-metamix-0.3/MD5 --- r-cran-metamix-0.2/MD5 2015-03-15 13:03:02.000000000 +0000 +++ r-cran-metamix-0.3/MD5 2019-02-11 16:20:03.000000000 +0000 @@ -1,24 +1,24 @@ -9dd81d1d788990a863170ba6deeedc35 *DESCRIPTION -e66e614738dfa021a3a21c766ee7b940 *NAMESPACE +80a03a2bd6c13bdd20ad4e225ae1b957 *DESCRIPTION +b4f6522a319bf8d2a154097df39d8d65 *NAMESPACE f7981051fdc87cd0143a44803371667d *R/EM.R -a605a7a62d94245fcd421804721c2f3c *R/Gibbs.R +76c3280254ae796e1b3a580625400f2c *R/Gibbs.R 9601c3cccbb67a14d787cdf2be14eb24 *R/computePenalty.R f590768dbed832890df0ea1db8fd39e6 *R/getArgs.R d889a02088cf85fb26cf073d14748ead *R/logmean.R 56107e17c3594b8cb228ded8ec364066 *R/metaMix-package.r -461534451d91e1e9e70310b1113228dc *R/step1_preprocess.R +aae30c54a5c29cb1b32e0bad89cf3bc4 *R/step1_preprocess.R 7b4dc11d4a1451e034cd4ae52afd8c4d *R/step2_allInclusive_MM.R -c3de500e56eac8cfeab2e301ebae2877 *R/step3_parallelTemper.R -c87e556577fce9e7a38935c99fc15db4 *R/step3_parallelTemper_nucl.R -48a09cd6ccd020d527c10b7a2f6df3e5 *R/step4_bayesModelAver.R +e3b4bda895cb90eb59d6ab692169dd18 *R/step3_parallelTemper.R +fc5b75bb51bd9b71e7811886bfb9c472 *R/step3_parallelTemper_nucl.R +53147ec4e352d030fe46b0a2a9f5b500 *R/step4_bayesModelAver.R a80ff79ae5bd51e3928d200932b61940 *R/zzz.R -364413a8569dba7c2d700554cd514631 *build/vignette.rds +2be6d8e7d363324ccfeb8a016b90b00f *build/vignette.rds b873ed41550485a23a9ef1d07c340dcf *data/step1.RData 2c2debad39291aa20f133159ec44ad7f *data/step2.RData e06525e0f87c09ff5241a1c1bc7913db *data/step3.RData eaea8c38db30c6e0cefb736fcc994087 *inst/doc/guide.R -0f2434e09c960732a14cfd8cf46fcaf7 *inst/doc/guide.Rnw -21e319045554055867d99fe9acd2c4ae *inst/doc/guide.pdf +ff1f2e74fca0c2ccf5955266abba88b9 *inst/doc/guide.Rnw +c46aae1b9a99e49b78bbcb5bfadb54a5 *inst/doc/guide.pdf 42a7d808b5f0df7e6030c9f32c5eca6f *inst/extdata/PT_plots.RData f943aed8a86d436f375cd866e5963f5a *inst/extdata/blastOut_custom.tab 12ae20acc7ac927796fbd9c6d9182e03 *inst/extdata/blastOut_default.tab @@ -39,14 +39,15 @@ 969c0d1f53be01283db36d9b2ddea6bd *inst/extdata/names_example.dmp 0a3116df0500ab3b0826fbaea931c3ac *inst/extdata/read_lengths.tab 6da431dbc9f4fbcfc47e177bcb2d9a22 *inst/extdata/read_weights.tab -34951018b9e1d31ccddec2d51e87c17c *man/bayes.model.aver.Rd -f99cb11f746b179e9f2ac42804379003 *man/generative.prob.Rd -91fc553a31865685ebf8486c5f9590a4 *man/parallel.temper.Rd -67dac4f76a6221f654eb8198d472c1a8 *man/parallel.temper.nucl.Rd -419b51e838d6c648937497158cd6fa84 *man/reduce.space.Rd -96927c5374e301f9e60fd06727de6a61 *man/step1.Rd -55f3e04c84e04a5656501483056b94fb *man/step2.Rd -750cf9fdcb6c749591fd1be146f23c98 *man/step3.Rd +a4bfaea0e759be02b6d9aa1fbc27f96d *man/bayes.model.aver.Rd +b62837a5026cdab4273c7d11f6d2214d *man/generative.prob.Rd +d7fd9e7b31ae70dfdd5367b7afa685ca *man/parallel.temper.Rd +2c11b5bb703e28571cd3e73c2a7c51fa *man/parallel.temper.nucl.Rd +e3a5aed79ae31189612b4dc186203be1 *man/reduce.space.Rd +738fa72ad3beeeb4f4f63e49a59b522f *man/step1.Rd +4e4c624ca379fad02d08fdc7c24c14bd *man/step2.Rd +54da4a8fc9494bdc2d957375bb77f4c9 *man/step3.Rd 14cd8c3d566b0b93a63b2df78347b84c *src/fast_multinom_weight.cpp -0f2434e09c960732a14cfd8cf46fcaf7 *vignettes/guide.Rnw +fe7c8d7aca8b273b92379e458b9857a0 *src/init.c +ff1f2e74fca0c2ccf5955266abba88b9 *vignettes/guide.Rnw 03a8857d3f74dd307bb5b668ce392843 *vignettes/guide.Rnw.bck diff -Nru r-cran-metamix-0.2/NAMESPACE r-cran-metamix-0.3/NAMESPACE --- r-cran-metamix-0.2/NAMESPACE 2015-03-15 10:33:16.000000000 +0000 +++ r-cran-metamix-0.3/NAMESPACE 2019-02-06 13:49:48.000000000 +0000 @@ -1,4 +1,4 @@ -# Generated by roxygen2 (4.0.1): do not edit by hand +# Generated by roxygen2: do not edit by hand export(bayes.model.aver) export(bayes.model.aver.explicit) @@ -14,5 +14,12 @@ import(Rmpi) import(data.table) import(ggplot2) +importFrom(grDevices,dev.off) +importFrom(grDevices,pdf) +importFrom(graphics,plot) importFrom(gtools,rdirichlet) +importFrom(stats,ppois) +importFrom(stats,quantile) +importFrom(stats,runif) +importFrom(utils,write.table) useDynLib(metaMix) diff -Nru r-cran-metamix-0.2/R/Gibbs.R r-cran-metamix-0.3/R/Gibbs.R --- r-cran-metamix-0.2/R/Gibbs.R 2015-03-15 11:54:19.000000000 +0000 +++ r-cran-metamix-0.3/R/Gibbs.R 2018-12-10 15:03:34.000000000 +0000 @@ -39,7 +39,7 @@ nj <- colSums(assignmWeighted) -######## STEP4. Generate w^(t) from \pi (w|z^(t)). Sample w from Dirichlet (a1+n1, ..., ak+nk) +######## STEP4. Generate w^(t) from \pi(w|z^(t)). Sample w from Dir~(a1+n1, ..., ak+nk) ####new parameters for Dirichlet alpha<-nj + hyperParam abund<-rdirichlet(1,alpha) diff -Nru r-cran-metamix-0.2/R/step1_preprocess.R r-cran-metamix-0.3/R/step1_preprocess.R --- r-cran-metamix-0.2/R/step1_preprocess.R 2015-03-15 10:28:26.000000000 +0000 +++ r-cran-metamix-0.3/R/step1_preprocess.R 2019-02-07 13:12:08.000000000 +0000 @@ -1,4 +1,3 @@ - ################################################ CALCULATE pij: PROBABILITY READ IS GENERATED BY CERTAIN SPECIES ###############################################Use Poisson probabilities to incorporate quality of match. #' @name generative.prob #' @aliases generative.prob @@ -8,10 +7,12 @@ #' @rdname generative.prob #' @title generative.prob #' @description generative.prob() computes the probability for a read to be generated by a certain species, given that it originates from this species. The input for this function is the tabular BLAST output format, either default or custom. The function uses the recorded mismatches to produce a Poisson probability. -#' @param blast.output.file This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon identifier. Please see the vignette for column order and the exact BLAST command to use. +#' @param blast.output.file This is the tabular BLASTx output format for generative.prob(), while it is the tabular BLASTn output format for generative.prob.nucl(). It can either be the default output format or a specific custom output format, incorporating read length and taxon identifier. Please see the vignette for column order and the exact BLAST command to use. You can also use DIAMOND instead of BLASTx which is much faster and produces default format according to BLAST default output specifications. #' @param read.length.file This argument can either be a file mapping each read to its length or a numerical value, representing the average read length. #' @param contig.weight.file This argument can either be a file where weights are assigned to reads and contigs. For unassembled reads the weight is equal to 1 while for contigs the weight should reflect the number of reads that assembled it. -#' @param gi.taxon.file For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz} . For generative.prob.nucl() this would be the 'gi_taxid_nucl.dmp' taxonomy file, mapping each nucleotide gi identifier to the corresponding taxon identifier. It an be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz}. +#' @param gi.taxon.file For generative.prob() this would be the 'gi_taxid_prot.dmp' taxonomy file, mapping each protein gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz} . For generative.prob.nucl() this would be the 'gi_taxid_nucl.dmp' taxonomy file, mapping each nucleotide gi identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz}. +#' @param protaccession.taxon.file This parameter has been added as NCBI is phasing out the usage of GI identifiers. For generative.prob() this would be the prot.accession2taxid taxonomy file, mapping each protein accession identifier to the corresponding taxon identifier. It can be downloaded from \url{ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz}. I have found that it is useful to concatenate it with \url{ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz} so you can search in both files for the protein identifier (sometimes obsolete sequences can still be present in latest RefSeq releases but not in taxonomy files and vice versa and these mismatches can cause loss of information). TODO add support for nucleotides as well. +#' @param gi.or.prot This parameter specifies whether the user is using the GI identifiers or protein accession identifiers to map to taxon identifiers. Values are 'gi' or 'prot'. The default value is 'prot'. #' @param gen.prob.unknown User-defined generative probability for unknown category. Default value for generative.prob() is 1e-06, while for generative.prob.nucl() is 1e-20. #' @param outDir Output directory. #' @param blast.default logical. Is the input the default blast output tabular format? Default value is TRUE. That means that the BLAST output file needs to have the following fields:Query id, Subject id, percent identity, alignment length, mismatches, gap openings, query start, query end, subject start, subject end, e-value, bit score. Alternatively we can use the 'blast.default=FALSE' option, providing a custom blast output that has been produced using the option -outfmt '6 qacc qlen sacc slen stitle bitscore length pident evalue staxids'. @@ -19,6 +20,8 @@ #' @keywords generative.prob #' @export generative.prob #' @import Matrix data.table +#' @importFrom stats ppois +#' @importFrom stats quantile #' @examples #' # See vignette for more details #' @@ -34,9 +37,28 @@ #' } ############################################################################################################################################################## -generative.prob = function(blast.output.file=NULL, read.length.file=80, contig.weight.file=1, gi.taxon.file=NULL, gen.prob.unknown=1e-06, outDir=NULL, blast.default=TRUE){ +generative.prob = function(blast.output.file=NULL, read.length.file=80, contig.weight.file=1, gi.taxon.file=NULL, protaccession.taxon.file=NULL, gi.or.prot="prot", gen.prob.unknown=1e-06, outDir=NULL, blast.default=TRUE){ if (blast.default){ + + if (gi.or.prot=="gi") { + if (is.null(gi.taxon.file)) { + stop("Please provide the 'gi_taxid_prot.dmp' file. This can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz") + } else { + taxon.prot<-fread(input=gi.taxon.file, sep="\t", header=F) + setnames(x=taxon.prot, old=c("proteinID","taxonID" )) + } + } + + if (gi.or.prot=="prot") { + if (is.null(protaccession.taxon.file)) { + stop("Please provide the 'prot.accession2taxid' file. This can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz . I have found that it is useful to concatenate it with ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_prot.accession2taxid.gz so you can search in both for the protein identifier, since obsolete sequences can still be present in RefSeq releases. However due to its size you need to increase the memory requirements quite a bit, more than 20G.") + } else { + taxon.prot<-fread(input=protaccession.taxon.file, sep="\t", header=T, select=c(2,3)) + setnames(x=taxon.prot, old=c("proteinID", "taxonID")) + } + } + if (!is.null(blast.output.file)) { @@ -48,8 +70,13 @@ } else if (ncol(check.output)==12){ blast.output<-fread(input=blast.output.file, sep="\t", header=F, select=c(1,3, 4, 12)) setnames(x=blast.output, old=c("read", "ident", "aln", "bit.score")) - + + if (gi.or.prot=="prot"){ + protID<-fread(input=blast.output.file, sep="|", header=F, select=c(4)) + } else { protID<-fread(input=blast.output.file, sep="|", header=F, select=c(2)) + + } setnames(x=protID, old="proteinID") blast.out.gi<-cbind.data.frame(blast.output, protID) @@ -79,21 +106,22 @@ } else { stop("Please provide either a file containing the weights for reads (=1) and contigs (>1) or, assuming you haven't assembled contigs enter contig.weight.file=1.") } + + rm(blast.output) + gc() - if (is.null(gi.taxon.file)) { - stop("Please provide the 'gi_taxid_prot.dmp' file. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz") - } else { - taxon.prot<-fread(input=gi.taxon.file, sep="\t", header=F) - setnames(x=taxon.prot, old=c("proteinID","taxonID" )) - } - - blast.length<-merge(blast.out.gi, read.length, by="read") + rm(blast.out.gi) + gc() + blast.length$mismatch <- (blast.length[["length"]]/3)-(blast.length[["ident"]]* blast.length[["aln"]])/100 ## the mismatches throughout read length not hsp length blast.length.weight<-merge(blast.length, contig.weights, by ="read", all.x=T) + rm(blast.length) + gc() + indx<-which(is.na(blast.length.weight[["weight"]])) blast.length.weight$weight[indx]<-1 read.weights<- unique(blast.length.weight[,c("read","weight"), with=FALSE]) @@ -102,8 +130,8 @@ allInfo.temp<-merge(blast.length.weight, taxon.prot, by="proteinID", all.x=T) - if (length(unique(allInfo.temp[which(is.na(allInfo.temp[["taxonID"]])),get('proteinID')]))!=0){ - message("Some of the proteins in your reference database are not in the 'gi_taxon_prot.dmp' file. Therefore these cannot be assigned a taxon identifier. \n We will remove these hits from subsequent analyses. \n For reference the respective gis are: ") + if (length(unique(allInfo.temp[which(is.na(allInfo.temp[["taxonID"]])), get('proteinID')]))!=0){ + message("Some of the proteins in your reference database are not in the 'gi_taxon_prot.dmp' or the 'prot.accession2taxid' file (depends on which you used. If you used the latter 'prot.accession2taxid' it may be worth adding to it dead_prot.accession2taxid.gz, see help to see download details). Therefore these cannot be assigned a taxon identifier. \n We will remove these hits from subsequent analyses. \n For reference the respective gis are: ") print(unique(allInfo.temp[which(is.na(allInfo.temp[["taxonID"]])),get('proteinID')])) allInfo<-merge(blast.length.weight, taxon.prot, by="proteinID") } else { @@ -118,7 +146,9 @@ data.grouped<-data.dt[,list(pij=max(pij)),by=c("read", "weight", "taxonID")] ### one hit per taxonid (best one) pij<-as.data.frame(data.grouped) + taxonID<-NULL + ############use Sparse Matrix "reshape" long -> wide fromat pij.sparse<-with(pij, sparseMatrix(i = as.numeric(factor(get('read'))), j=as.numeric((factor(taxonID))), x=pij, dimnames=list(levels(factor(get('read'))), levels(factor(taxonID))))) @@ -129,11 +159,11 @@ ordered.species<-allspecies[order(-allspecies[,2]) , ] #### order them by read count ### add unknown bin and assign a pij - pij.sparse.mat<-cBind(pij.sparse, gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse, gen.prob.unknown) colnames(pij.sparse.mat)<-c(colnames(pij.sparse), "unknown") if (length(which(!(rownames(read.weights)%in%rownames(pij.sparse.mat))))!=0) { - message("The following reads match only proteins that are no longer supported, i.e their gis are not in 'gi_taxon_prot.dmp'. \n Therefore the reads were removed from subsequent analyses.") + message("The following reads match only proteins that are no longer supported, i.e their gis are not in 'gi_taxon_prot.dmp' or their protein accession not in 'prot.accession2taxid' (depends on which you used. If you used the latter 'prot.accession2taxid' it may be worth adding to it dead_prot.accession2taxid.gz, see help to see download details). \n Therefore the reads were removed from subsequent analyses as these could be misclassified.") print(rownames(read.weights[which(!(rownames(read.weights)%in%rownames(pij.sparse.mat))),])) read.weights<-read.weights[which(rownames(read.weights)%in%rownames(pij.sparse.mat)),] read.weights<-data.frame(read.weights, row.names=read.weights[["read"]] ) @@ -178,17 +208,32 @@ check.output<-fread(input=blast.output.file, sep="\t", header=F) if (ncol(check.output)!=10) { ###quick check for format - stop("Please check your blast output file. The custom blast tabular format we accept as input has the following 10 columns 'qacc qlen sacc slen stitle bitscore length pident evalue staxids'. Alternatively you can use the default blast tabular output and use it along with 'blast.default=TRUE' option. You will need also to provide the 'gi_taxid_prot.dmp' file, which can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz") + stop("Please check your blast output file. The custom blast tabular format we accept as input has the following 10 columns 'qacc qlen sacc slen stitle bitscore length pident evalue staxids'. Alternatively you can use the default blast tabular output and use it along with 'blast.default=TRUE' option. You will need also to provide the 'gi_taxid_prot.dmp' or the ' prot.accession2taxid' file. The first can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz and the latter from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz") } else if (ncol(check.output==10)) { - blast.output.length<-fread(input=blast.output.file, sep="\t", header=F, select=c(1,2, 7, 8, 10)) - setnames(x=blast.output.length, old=c("read", "length", "aln", "ident", "taxonID")) + blast.output.length.temp<-fread(input=blast.output.file, sep="\t", header=F, select=c(1,2, 7, 8, 10)) + setnames(x=blast.output.length.temp, old=c("read", "length", "aln", "ident", "taxonID")) } else { stop("Please provide the output file from BLASTx. The default tabular format is accepted, using the '-outfmt 6' flag in the BLASTx command.") } } - + + taxonID<-NULL + read<-NULL + length<-NULL + aln<-NULL + ident<-NULL + + if ( is.character(blast.output.length.temp[,taxonID])){ + + blast.output.length.temp2 <- blast.output.length.temp[, list(taxonID = as.integer(unlist(strsplit(taxonID, ";")))), by=list(read, length, aln, ident)] ## handle comma separated taxonIDS and NAs + blast.output.length<- blast.output.length.temp2[which(!is.na(blast.output.length.temp2[,taxonID])),] + + } else { + blast.output.length<-blast.output.length.temp + + } if (!is.integer(blast.output.length[["taxonID"]])) { stop("Your blast output does not have correctly formatted taxon identifiers. Please use the 'blast.default=TRUE' option, providing the default blast tabular format (using the blast option '-outfmt 6' and provide seperately the 'gi_taxid_prot.dmp' file.") @@ -203,7 +248,7 @@ contig.weights<-cbind.data.frame("read"=unique(blast.output.length[["read"]]), "weight"=contig.weight.file) rownames(contig.weights)<-contig.weights[,"read"] } else { - stop("Please provide either a file containing the weights for reads (=1) and contigs (>1) or, assuming you haven't assembled contigs enter contig.weight.file=1.") + stop("Please provide either a file containing the weights for reads (=1) and contigs (>1) or, assuming you haven't assembled contigs enter contig.weight.file=1") } @@ -211,9 +256,13 @@ blast.output.length$mismatch <- (blast.output.length[["length"]]/3)-(blast.output.length[["ident"]]* blast.output.length[["aln"]])/100 ## the mismatches throughout read length not hsp length blast.length.weight<-merge(blast.output.length, contig.weights, by ="read", all.x=T) + + rm(blast.length) + gc() + indx<-which(is.na(blast.length.weight[["weight"]])) blast.length.weight$weight[indx]<-1 - + read.weights<- unique(blast.length.weight[,c("read","weight"), with=FALSE]) rownames(read.weights)<-read.weights[["read"]] @@ -224,7 +273,7 @@ data.dt<- data.table(data_0.03) data.grouped<-data.dt[,list(pij=max(pij)),by=c("read", "weight", "taxonID")] ### one hit per taxonid (best one) pij<-as.data.frame(data.grouped) - + ############use Sparse Matrix "reshape" long -> wide fromat pij.sparse<-with(pij, sparseMatrix(i = as.numeric(factor(get('read'))), j=as.numeric((factor(taxonID))), x=pij, dimnames=list(levels(factor(get('read'))), levels(factor(taxonID))))) @@ -235,7 +284,7 @@ ordered.species<-allspecies[order(-allspecies[,2]) , ] #### order them by read count ### add unknown bin and assign a pij - pij.sparse.mat<-cBind(pij.sparse, gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse, gen.prob.unknown) colnames(pij.sparse.mat)<-c(colnames(pij.sparse), "unknown") if (length(which(!(rownames(read.weights)%in%rownames(pij.sparse.mat))))!=0) { @@ -401,6 +450,9 @@ data.grouped<-data.dt[,list(pij=max(pij)),by=c("read", "weight", "taxonID")] ### one hit per taxonid (best one) pij<-as.data.frame(data.grouped) + + taxonID<-NULL + ############use Sparse Matrix "reshape" long -> wide fromat pij.sparse<-with(pij, sparseMatrix(i = as.numeric(factor(get('read'))), j=as.numeric((factor(taxonID))), x=pij, dimnames=list(levels(factor(get('read'))), levels(factor(taxonID))))) @@ -410,7 +462,7 @@ ordered.species<-allspecies[order(-allspecies[,2]) , ] #### order them by read count ### add unknown bin and assign a pij - pij.sparse.mat<-cBind(pij.sparse, gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse, gen.prob.unknown) colnames(pij.sparse.mat)<-c(colnames(pij.sparse), "unknown") if (length(which(!(rownames(read.weights)%in%rownames(pij.sparse.mat))))!=0) { @@ -464,14 +516,34 @@ } else if (ncol(check.output)==10){ - blast.output.length<-fread(input=blast.output.file, sep="\t", header=F, select=c(1,2, 4, 7, 8, 10)) - setnames(x=blast.output.length, old=c("read", "length", "genome.length", "aln", "ident", "taxonID")) + blast.output.length.temp<-fread(input=blast.output.file, sep="\t", header=F, select=c(1,2, 4, 7, 8, 10)) + setnames(x=blast.output.length.temp, old=c("read", "length", "genome.length", "aln", "ident", "taxonID")) } else { stop("Please provide the output file from BLASTx. The default tabular format is accepted, using the '-outfmt 6' flag in the BLASTx command.") } } - + + taxonID<-NULL + read<-NULL + length<-NULL + aln<-NULL + ident<-NULL + genome.length<-NULL + + + if ( is.character(blast.output.length.temp[,taxonID])){ + + blast.output.length.temp2 <- blast.output.length.temp[, list(taxonID = as.integer(unlist(strsplit(taxonID, ";")))), by=list(read, length, genome.length, aln, ident)] ## handle comma separated taxonIDS and NAs + blast.output.length<- blast.output.length.temp2[which(!is.na(blast.output.length.temp2[,taxonID])),] + + } else { + blast.output.length<-blast.output.length.temp + + } + + + if (!is.integer(blast.output.length[["taxonID"]])) { stop("Your blast output does not have correctly formatted taxon identifiers. Please use the 'blast.default=TRUE' option, providing the default blast tabular format (using the blast option '-outfmt 6' and provide seperately the 'gi_taxid_prot.dmp' file.") } @@ -488,7 +560,7 @@ } - blast.output.length$mismatch<-ifelse(blast.output.length[["length"]]>=200, (1-blast.output.length[["ident"]]/100)*blast.output.length[["aln"]], blast.output.length[["length"]]-(blast.output.length[["ident"]]* blast.output.length[["aln"]])/100) +blast.output.length$mismatch<-ifelse(blast.output.length[["length"]]>=200, (1-blast.output.length[["ident"]]/100)*blast.output.length[["aln"]], blast.output.length[["length"]]-(blast.output.length[["ident"]]* blast.output.length[["aln"]])/100) # blast.output.length$mismatch<-mismatchNew @@ -519,7 +591,7 @@ ordered.species<-allspecies[order(-allspecies[,2]) , ] #### order them by read count ### add unknown bin and assign a pij - pij.sparse.mat<-cBind(pij.sparse, gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse, gen.prob.unknown) colnames(pij.sparse.mat)<-c(colnames(pij.sparse), "unknown") if (length(which(!(rownames(read.weights)%in%rownames(pij.sparse.mat))))!=0) { diff -Nru r-cran-metamix-0.2/R/step3_parallelTemper_nucl.R r-cran-metamix-0.3/R/step3_parallelTemper_nucl.R --- r-cran-metamix-0.2/R/step3_parallelTemper_nucl.R 2015-03-15 10:28:26.000000000 +0000 +++ r-cran-metamix-0.3/R/step3_parallelTemper_nucl.R 2019-02-07 13:15:35.000000000 +0000 @@ -16,6 +16,7 @@ #' @export parallel.temper.nucl #' @import Rmpi Matrix #' @importFrom gtools rdirichlet +#' @importFrom stats runif ###################################################################################################################### parallel.temper.nucl = function(step2, readSupport=30, noChains=12, seed=1, median.genome.length=284332){ @@ -69,12 +70,13 @@ } print("Please use mpi.quit() to quit R") .Call("mpi_finalize", PACKAGE='metaMix') + } } - pij.sparse.mat<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) gc() @@ -458,7 +460,7 @@ } mpi.close.Rslaves(dellog=FALSE) - # mpi.quit() + mpi.quit() return(step3) } @@ -518,12 +520,13 @@ } print("Please use mpi.quit() to quit R") .Call("mpi_finalize", PACKAGE='metaMix') + } } -pij.sparse.mat<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) +pij.sparse.mat<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) gc() @@ -908,8 +911,11 @@ mpi.close.Rslaves(dellog=FALSE) -#mpi.quit() +#mpi.quit() ## terminates MPI execution environment and quits R + mpi.exit() ##terminates MPI execution environment and detaches the library Rmpi. After that, you can still work on R + + return(step3) } diff -Nru r-cran-metamix-0.2/R/step3_parallelTemper.R r-cran-metamix-0.3/R/step3_parallelTemper.R --- r-cran-metamix-0.2/R/step3_parallelTemper.R 2015-03-15 10:28:26.000000000 +0000 +++ r-cran-metamix-0.3/R/step3_parallelTemper.R 2019-02-07 13:12:50.000000000 +0000 @@ -7,6 +7,8 @@ #' @description Performs Parallel Tempering MCMC to explore the species state space. Two types of moves are implemented: a mutation step (within chain) and an exchange step (between neighboring chains). If working with BLASTn data, use parallel.temper.nucl(). #' @param readSupport The number of reads the user requires in order to believe in the presence of the species. It is used to compute the penalty factor. The default value is 10. We compute the logarithmic penalty value as the log-likelihood difference between two models: one where all N reads belong to the "unknown" category and one where r reads have a perfect match to some unspecified species and the remaining reads belong to the "unknown" category. #' @param noChains The number of parallel chains to run. The default value is 12. +#' @param iter The number of MCMC iterations. The default behavior of metaMix is to take into account the number of potential species after step 2 in order in order to compute the number of MCMC iterations. By default metaMix will choose the greater value between a) the user-specified value for iter and b) the product of (5 * the number of potential species). This behavior can by bypassed by setting the bypass parameter to TRUE. Then the MCMC will run for exactly the user-specified number iter. +#' @param bypass A logical flag. If set to TRUE the MCMC will run for exactly "iter" iterations. If FALSE, metaMix defaults to choosing the greater value between "iter" and "5*(nrow(ordered.sepcies))". #' @param seed Optional argument that sets the random seed (default is 1) to make results reproducible. #' @param step2 list. The output from reduce.space(). Alternatively, it can be a character string containing the path name of the ".RData" file where step2 list was saved. #' @return step3: A list with two elements. The first one (result) is a list that records MCMC information from each parallel chain. The second one (duration) records how much time the MCMC exploration took. @@ -15,6 +17,7 @@ #' @export parallel.temper #' @import Rmpi Matrix #' @importFrom gtools rdirichlet +#' @importFrom stats runif #' @examples #' ## See vignette for more details #' @@ -27,7 +30,7 @@ #' step3 <- parallel.temper(step2="/pathtoFile/step2.RData") #' } ###################################################################################################################### -parallel.temper = function(step2, readSupport=10, noChains=12, seed=1){ +parallel.temper = function(step2, readSupport=10, noChains=12, seed=1, iter=500, bypass=FALSE){ if (is.character(step2)) { load(step2) @@ -42,7 +45,9 @@ stop() } else { - parallel.temper.wrapped<-function(readSupport.internal=readSupport, noChains.internal=noChains, pij.sparse.mat=step2$pij.sparse.mat, read.weights=step2$read.weights, ordered.species=step2$ordered.species, gen.prob.unknown=step2$gen.prob.unknown, outDir=step2$outDir, seed.internal=seed){ +# parallel.temper.wrapped<-function(readSupport.internal=readSupport, noChains.internal=noChains, pij.sparse.mat=step2$pij.sparse.mat, read.weights=step2$read.weights, ordered.species=step2$ordered.species, gen.prob.unknown=step2$gen.prob.unknown, outDir=step2$outDir, seed.internal=seed){ + + parallel.temper.wrapped<-function(readSupport.internal=readSupport, noChains.internal=noChains, pij.sparse.mat=step2$pij.sparse.mat, read.weights=step2$read.weights, ordered.species=step2$ordered.species, gen.prob.unknown=step2$gen.prob.unknown, outDir=step2$outDir, seed.internal=seed, iter.internal=iter){ set.seed(seed.internal); #print(warnings()) @@ -77,12 +82,13 @@ } print("Please use mpi.quit() to quit R") .Call("mpi_finalize", PACKAGE='metaMix') + } } - pij.sparse.mat<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) + pij.sparse.mat<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) #rm(step2) gc() @@ -97,7 +103,17 @@ ### PT parameters exchangeInterval<-1 ###leave chains run in parallel for that many iterations - ExternIter<-5*(nrow(ordered.species)) ### make chains communicate 1 times + + if (bypass==FALSE){ + if (iter.internal > 5*(nrow(ordered.species))) { ### choose the number of iterations that is greater between 1) the user-defined number 2) the product of the number of potential species times 5 + ExternIter<-iter.internal + } else { + ExternIter<-5*(nrow(ordered.species)) ### make chains communicate 1 times + } + } else { + ExternIter<-iter.internal + } + TotalIter<-exchangeInterval * ExternIter ##Tempering Vector --Power Decay @@ -467,8 +483,9 @@ mpi.close.Rslaves(dellog=FALSE) - # mpi.quit() - + #mpi.quit() ## terminates MPI execution environment and quits R +# mpi.exit() ##terminates MPI execution environment and detaches the library Rmpi. After that, you can still work on R, problems with detaching Rmpi + mpi.finalize() return(step3) } parallel.temper.wrapped() @@ -489,7 +506,7 @@ #' @import Rmpi Matrix #' @importFrom gtools rdirichlet -parallel.temper.explicit<-function(readSupport=10, noChains=12, pij.sparse.mat, read.weights, ordered.species, gen.prob.unknown, outDir, seed = 1){ +parallel.temper.explicit<-function(readSupport=10, noChains=12, pij.sparse.mat, read.weights, ordered.species, gen.prob.unknown, outDir, seed = 1, iter=500, bypass=FALSE){ set.seed(seed); #print(warnings()) @@ -524,12 +541,13 @@ } print("Please use mpi.quit() to quit R") .Call("mpi_finalize", PACKAGE='metaMix') + } } -pij.sparse.mat<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) +pij.sparse.mat<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) #rm(step2) gc() @@ -544,9 +562,21 @@ ### PT parameters exchangeInterval<-1 ###leave chains run in parallel for that many iterations before attempting exchange -ExternIter<-5*(nrow(ordered.species)) ### make chains communicate 1 times +#ExternIter<-5*(nrow(ordered.species)) ### make chains communicate 1 times + + if (bypass==FALSE){ + if (iter > 5*(nrow(ordered.species))) { ### choose the number of iterations that is greater between 1) the user-defined number 2) the product of the number of potential species times 5 + ExternIter<-iter + } else { + ExternIter<-5*(nrow(ordered.species)) ### make chains communicate 1 times + } + } else { + ExternIter<-iter + } + TotalIter<-exchangeInterval * ExternIter + ##Tempering Vector --Power Decay temper<-vector() K<-0.001 @@ -914,7 +944,9 @@ mpi.close.Rslaves(dellog=FALSE) -#mpi.quit() - +## mpi.quit() ## terminates MPI execution environment and quits R +## mpi.exit() ##terminates MPI execution environment and detaches the library Rmpi. After that, you can still work on R + mpi.finalize() + return(step3) } diff -Nru r-cran-metamix-0.2/R/step4_bayesModelAver.R r-cran-metamix-0.3/R/step4_bayesModelAver.R --- r-cran-metamix-0.2/R/step4_bayesModelAver.R 2015-03-15 10:28:26.000000000 +0000 +++ r-cran-metamix-0.3/R/step4_bayesModelAver.R 2019-02-07 13:13:07.000000000 +0000 @@ -8,11 +8,16 @@ #' @param step3 list. The output from parallel.temper(), i.e the third step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step3 list was saved. #' @param step2 list. The output from reduce.space(), i.e the second step of the pipeline. Alternatively, it can be a character string containing the path name of the ".RData" file where step2 list was saved. #' @param taxon.name.map The 'names.dmp' taxonomy names file, mapping each taxon identifier to the corresponding scientific name. It can be downloaded from \url{ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz} -#' @param poster.prob.thr Posterior probability of presence of species threshold for reporting in the species summary. +#' @param poster.prob.thr Posterior probability of presence of species threshold for reporting in the species summary. +#' @param burnin Percentage of burn in iterations, default value is 0.4 #' @keywords bayes.model.aver #' @export bayes.model.aver #' @import Matrix ggplot2 data.table #' @importFrom gtools rdirichlet +#' @importFrom grDevices dev.off +#' @importFrom grDevices pdf +#' @importFrom graphics plot +#' @importFrom utils write.table #' @useDynLib metaMix #' @examples #' ## See vignette for more details @@ -30,7 +35,7 @@ #' } ###################################################################################################################### -bayes.model.aver = function(step2, step3, taxon.name.map=NULL, poster.prob.thr=0.9){ +bayes.model.aver = function(step2, step3, taxon.name.map=NULL, poster.prob.thr=0.9, burnin=0.4){ if (is.character(step2)) { load(step2) @@ -70,7 +75,7 @@ nIter<- nrow(result$slave1$record) - pijSparseUnknown<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) + pijSparseUnknown<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) # message("Associate taxonIDs with scientific names: reading \"names.dmp\" can take a few minutes") @@ -82,7 +87,7 @@ taxonNames<-as.data.frame(taxonNames) - ofInterest<-result$slave1$record[round(nIter/5):nIter,4:(ncol(result$slave1$record)-1)] ##burn-in 20% + ofInterest<-result$slave1$record[round(nIter*burnin):nIter,4:(ncol(result$slave1$record)-1)] ##burn-in 40% present.probabilities<- round(apply(ofInterest, MARGIN=2, function(x) sum(x)/length(x)), digits=2) poster.prob.all<-present.probabilities[present.probabilities>0] @@ -101,41 +106,53 @@ if (length(present.probabilities[present.probabilities>0.5])==1) { - stop("The method did not find any present organisms in this dataset, defining as present species with posterior probability greater than 0.5. Maybe you used the wrong reference database to annotate your sequences?") + stop("The method did not find any present organisms in this dataset, defined as present species with posterior probability greater than 0.5. Maybe you used the wrong reference database to annotate your sequences?") } poster.probM<-as.data.frame(poster.prob) poster.probM$taxonID<-rownames(poster.probM) poster.prob.final<-merge(taxonNames, poster.probM, by.y="taxonID", by.x="taxonID", all.y=T) + + ### compute Bayes Factor namesBF<-rownames(poster.probM)[!(rownames(poster.probM)%in%"unknown")] - ofInterestBF<-result$slave1$record[round(nIter/5):nIter,c(namesBF, "logL")] ##burn-in 20% + ofInterestBF<-result$slave1$record[round(nIter*burnin):nIter,c(namesBF, "logL")] ##burn-in 20% BayesFactor<-matrix(0, ncol=2, nrow=(length(colnames(ofInterestBF))-1) ) BayesFactor<-as.data.frame(BayesFactor) colnames(BayesFactor)<-c("taxonID", "log10BF") EMiter<-10 - pij.sparse.unknown<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) - + pij.sparse.unknown<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) + + for ( i in 1:(length(colnames(ofInterestBF))-1) ){ BayesFactor[i,"taxonID"]<-colnames(ofInterestBF)[i] - if (!all(ofInterestBF[,colnames(ofInterestBF)[i]]==1)) { - BayesFactor[i,"log10BF"]<-logmean(ofInterestBF[which(ofInterestBF[,colnames(ofInterestBF)[i]]==1),"logL"])/log(10) - logmean(ofInterestBF[which(ofInterestBF[,colnames(ofInterestBF)[i]]==0),"logL"])/log(10) + + if (length(colnames(ofInterestBF))-1==1) { + onlyUnknown.logL<- sum(read.weights[,"weight"])*log(gen.prob.unknown) + + BayesFactor[i,"log10BF"]<-logmean(ofInterestBF[which(ofInterestBF[,colnames(ofInterestBF)[i]]==1),"logL"])/log(10) - onlyUnknown.logL + } else { - maxLog<- ofInterestBF[which(ofInterestBF[,"logL"]==max(ofInterestBF[which(ofInterestBF[,i]==1),"logL"]))[1],] - namesSp<-colnames(maxLog[which(maxLog==1)]) - tempSet<- namesSp[!(namesSp %in% BayesFactor[i, "taxonID"])] - tentSet<- c(tempSet,"unknown") - noSpecies<-length(tentSet) - hyperP<-rep(1, noSpecies) - startW<-rdirichlet(1, hyperP) - output10Tent<-EM(pij=pij.sparse.unknown, iter=EMiter, species=tentSet, abund=startW, readWeights = read.weights) + + if (!all(ofInterestBF[,colnames(ofInterestBF)[i]]==1)) { + BayesFactor[i,"log10BF"]<-logmean(ofInterestBF[which(ofInterestBF[,colnames(ofInterestBF)[i]]==1),"logL"])/log(10) - logmean(ofInterestBF[which(ofInterestBF[,colnames(ofInterestBF)[i]]==0),"logL"])/log(10) + } else { + maxLog<- ofInterestBF[which(ofInterestBF[,"logL"]==max(ofInterestBF[which(ofInterestBF[,i]==1),"logL"]))[1],] + namesSp<-colnames(maxLog[which(maxLog==1)]) + tempSet<- namesSp[!(namesSp %in% BayesFactor[i, "taxonID"])] + tentSet<- c(tempSet,"unknown") + noSpecies<-length(tentSet) + hyperP<-rep(1, noSpecies) + startW<-rdirichlet(1, hyperP) + output10Tent<-EM(pij=pij.sparse.unknown, iter=EMiter, species=tentSet, abund=startW, readWeights = read.weights) #lpenalty<-(computePenalty(readSupport=result$readSupport, readWeights=read.weights, pUnknown=gen.prob.unknown))/log(10) - lpenalty<-(result$slave1$lpenalty)/log(10) - estimator <- (output10Tent$logL[EMiter,2])/log(10) + (noSpecies * lpenalty) - BayesFactor[i,"log10BF"]<-(maxLog[,"logL"])/log(10) - estimator + lpenalty<-(result$slave1$lpenalty)/log(10) + estimator <- (output10Tent$logL[EMiter,2])/log(10) + (noSpecies * lpenalty) + BayesFactor[i,"log10BF"]<-(maxLog[,"logL"])/log(10) - estimator + } } } @@ -240,9 +257,9 @@ plot(result$slave1$record[1:nIter,"logL"], type="l", xlab="All iterations", ylab="Log-likelihood", main="Parallel Tempering - Coldest Chain", lwd=1.5) dev.off() - traceplot2<-paste(outDir, "/logLikelihood_traceplot_80.pdf", sep="") + traceplot2<-paste(outDir, "/logLikelihood_traceplot_60.pdf", sep="") pdf(traceplot2) - plot(result$slave1$record[(nIter/5):nIter,"logL"], type="l", col="dodgerblue", xlab="Last 80% of iterations", ylab="Log-likelihood", main="Parallel Tempering - Coldest Chain", lwd=1.5) + plot(result$slave1$record[(nIter*burnin):nIter,"logL"], type="l", col="dodgerblue", xlab="Last 60% of iterations", ylab="Log-likelihood", main="Parallel Tempering - Coldest Chain", lwd=1.5) dev.off() @@ -285,7 +302,7 @@ #' @useDynLib metaMix ##########################-------------------------------------------- MAIN --------------------------------------------------------------------------------- -bayes.model.aver.explicit<-function(result, pij.sparse.mat, read.weights, outDir, gen.prob.unknown, taxon.name.map=NULL, poster.prob.thr=0.9){ +bayes.model.aver.explicit<-function(result, pij.sparse.mat, read.weights, outDir, gen.prob.unknown, taxon.name.map=NULL, poster.prob.thr=0.9, burnin=0.4){ fast.rmultinom.weight <- function(proba.matrix, z.matrix, seed, weights) { return( .Call("C_rmultinom_weight", proba.matrix, z.matrix, weights, PACKAGE='metaMix') ) @@ -300,7 +317,7 @@ nIter<- nrow(result$slave1$record) - pijSparseUnknown<-cBind(pij.sparse.mat, "unknown"=gen.prob.unknown) + pijSparseUnknown<-cbind(pij.sparse.mat, "unknown"=gen.prob.unknown) message("Associate taxonIDs with scientific names: reading \"names.dmp\" could take a couple of minutes") @@ -312,7 +329,7 @@ taxonNames<-as.data.frame(taxonNames) - ofInterest<-result$slave1$record[round(nIter/5):nIter,4:(ncol(result$slave1$record)-1)] ##burn-in 20% + ofInterest<-result$slave1$record[round(nIter*burnin):nIter,4:(ncol(result$slave1$record)-1)] ##burn-in 20% present.probabilities<- round(apply(ofInterest, MARGIN=2, function(x) sum(x)/length(x)), digits=2) poster.prob.all<-present.probabilities[present.probabilities>0] @@ -433,7 +450,7 @@ traceplot2<-paste(outDir, "/logLikelihood_traceplot_80.pdf", sep="") pdf(traceplot2) - plot(result$slave1$record[(nIter/5):nIter,"logL"], type="l", col="dodgerblue", xlab="Last 80% of iterations", ylab="Log-likelihood", main="Parallel Tempering - Coldest Chain", lwd=1.5) + plot(result$slave1$record[(nIter*burnin):nIter,"logL"], type="l", col="dodgerblue", xlab="Last 80% of iterations", ylab="Log-likelihood", main="Parallel Tempering - Coldest Chain", lwd=1.5) dev.off() #message("Results in ", summary.name) diff -Nru r-cran-metamix-0.2/src/init.c r-cran-metamix-0.3/src/init.c --- r-cran-metamix-0.2/src/init.c 1970-01-01 00:00:00.000000000 +0000 +++ r-cran-metamix-0.3/src/init.c 2019-02-07 15:41:40.000000000 +0000 @@ -0,0 +1,24 @@ +#include +#include +#include // for NULL +#include + +/* FIXME: + Check these declarations against the C/Fortran source code. +*/ + +/* .Call calls */ +extern SEXP C_rmultinom_weight(SEXP, SEXP, SEXP); +/*extern SEXP mpi_finalize();*/ + +static const R_CallMethodDef CallEntries[] = { + {"C_rmultinom_weight", (DL_FUNC) &C_rmultinom_weight, 3}, + /* {"mpi_finalize", (DL_FUNC) &mpi_finalize, 0},*/ + {NULL, NULL, 0} +}; + +void R_init_metaMix(DllInfo *dll) +{ + R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); + R_useDynamicSymbols(dll, FALSE); +} diff -Nru r-cran-metamix-0.2/vignettes/guide.Rnw r-cran-metamix-0.3/vignettes/guide.Rnw --- r-cran-metamix-0.2/vignettes/guide.Rnw 2015-03-15 11:17:29.000000000 +0000 +++ r-cran-metamix-0.3/vignettes/guide.Rnw 2019-02-07 15:39:43.000000000 +0000 @@ -24,9 +24,15 @@ options(tidy=TRUE, width=80) @ + + \section{What is new} \textbf{Version 0.2:} Added Bayes Factor computation. + \textbf{Version 0.3:} Allow user to specify number of MCMC iterations (Step 3) and Burn-in percentage (Step 4). Due to NCBI phasing out GI numbers, added support for protein accession identifier file for mapping to taxon (STEP1). Fixed bug in step4 for Bayes Factor calculation when only one species is present, that is only human and unknown. Also fixed bug with semi-colon separated taxonids in custom BLAST format, as well as NAs (STEP 1). Replaced deprecated cBind function with cbind. + + + \section{Installation} You will need to have openMPI (Message Passage Interface) installed to be able to install the R package \verb!Rmpi!, which provides the interface to openMPI. @@ -270,6 +276,9 @@ At iter. 13 species 2 was present, while at the next one, it no longer is there. That means that the attempt at swapping the values of the two neighboring chains was successful. This information is also recorded, i.e how many swaps were attempted and how many accepted. +Each chain will produce a log file that will be printed in your working directory. + + \subsection*{Step4} Having explored the different possible models, the final step is to perform model averaging. @@ -315,17 +324,15 @@ \section{Submit jobs on cluster compute servers} In order to run steps 1, 2 and 4 of metaMix (i.e \verb!generative.prob!, \verb!reduce.space!, \verb!bayes.model.aver!) efficiently, these should be submitted as jobs to a compute cluster. -In our experience, 4G of memory, 1 hour of wall clock time and 1 processor should be plenty. - -In order to run the parallel tempering efficiently, we need at least 12 parallel chains, each with at least 1G-2G of RAM. +In our experience, 4G of memory, 1 hour of wall clock time and 1 processor should be plenty. +In order to run the parallel tempering efficiently (step3), we need at least 12 parallel chains. I usually request 6 or 12 threads, each with 2G of RAM. The wall clock time depends on how many iterations will be performed. -Also a larger number of reads mean that the computations will become slower. -We typically ask for 12 hours to be on the safe side. +A larger number of reads mean that the computations will become slower. +We typically submit all 4 steps in one go using one submission script. +I usually assess the size of the dataset, but 12 hours for all 4 steps should be safe. - -This is a sample submission script for the third step. -It requests 12 processors on 1 node for 12 hours. +This is a sample submission script, it requests 6 processors on 1 node for 12 hours (more processors, less time necessary). \begin{verbatim} #!/bin/bash @@ -333,17 +340,30 @@ #$ -o cluster/out #$ -e cluster/error #$ -cwd -#$ -pe smp 12 -#$ -l tmem=1.1G,h_vmem=1.1G +#$ -pe smp 6 +#$ -l tmem=2G,h_vmem=2G #$ -l h_rt=12:00:00 #$ -V #$ -R y -mpirun -np 1 R-3.0.1/bin/R --slave CMD BATCH --no-save --no-restore step3.R +mpirun -np 1 R-3.0.1/bin/R --slave CMD BATCH --no-save --no-restore allsteps.R +\end{verbatim} + +an example allsteps.R looks like this: +\begin{verbatim} +library(metaMix) + +step1<-generative.prob(blast.output.file="sample_diamond.tab", read.length.file="read_lengths.tab", contig.weight.file="contig_weights.tab", outDir="./", gi.taxon.file="gi_taxid_prot.dmp", blast.default=TRUE, gi.or.prot="gi") + +step2<-reduce.space(step1="step1.RData") + +step3<-parallel.temper(step2="step2.RData") + +step4<-bayes.model.aver(step2="step2.RData", step3="step3.RData", taxon.name.map="names.dmp") + + \end{verbatim} -in step3.R, we simply load the object produced from \verb!reduce.space! and then call \verb!generative.prob!. -Each chain will produce a log file that will be printed in your working directory \section{Technical information about the R session}