diff -Nru swarm-cluster-2.1.5/debian/changelog swarm-cluster-2.1.6/debian/changelog --- swarm-cluster-2.1.5/debian/changelog 2015-09-22 16:59:05.000000000 +0000 +++ swarm-cluster-2.1.6/debian/changelog 2015-12-31 12:49:47.000000000 +0000 @@ -1,3 +1,9 @@ +swarm-cluster (2.1.6-1) unstable; urgency=medium + + * New upstream version + + -- Andreas Tille Thu, 31 Dec 2015 13:49:43 +0100 + swarm-cluster (2.1.5-1) unstable; urgency=medium * New upstream version diff -Nru swarm-cluster-2.1.5/man/swarm.1 swarm-cluster-2.1.6/man/swarm.1 --- swarm-cluster-2.1.5/man/swarm.1 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/man/swarm.1 2015-12-14 10:01:36.000000000 +0000 @@ -1,5 +1,5 @@ .\" ============================================================================ -.TH swarm 1 "September 8, 2015" "version 2.1.5" "USER COMMANDS" +.TH swarm 1 "December 14, 2015" "version 2.1.6" "USER COMMANDS" .\" ============================================================================ .SH NAME swarm \(em find clusters of nearly-identical nucleotide amplicons @@ -14,25 +14,26 @@ .SH DESCRIPTION Environmental or clinical molecular studies generate large volumes of amplicons (e.g., 16S or 18S SSU-rRNA sequences) that need to be -clustered into molecular operational taxonomic units. Common +clustered into molecular operational taxonomic units (OTUs). Common clustering methods are based on greedy, input-order dependent algorithms, with arbitrary selection of global cluster size and cluster centroids. To address that problem, we developed \fBswarm\fR, a fast and robust method that recursively groups amplicons with -\fId\fR or less differences. \fBswarm\fR produces stable clusters (or -"swarms"), free from centroid selection induced input-order -dependency. +\fId\fR or less differences. \fBswarm\fR produces natural and stable +clusters centered on local peaks of abundance, free from centroid +selection induced input-order dependency. .PP Exact clustering is impractical on large data sets when using a naïve all-vs-all approach (more precisely a 2-combination without repetitions), as it implies unrealistic numbers of pairwise comparisons. \fBswarm\fR is based on a maximum number of differences -\fId\fR, and focuses only on very close relationships. For \fId\fR = 1 -(default), swarm uses an algorithm of linear complexity that performs -exact-string matching by comparing hash-values. For \fId\fR = 2 or -greater, swarm uses an algorithm of quadratic complexity that performs -pairwise string comparisons. An efficient \fIk\fR-mer-based filtering -and an astute use of comparisons results obtained during the process +\fId\fR between two amplicons, and focuses only on very close local +relationships. For \fId\fR = 1 (default value), swarm uses an +algorithm of linear complexity that performs exact-string matching by +comparing hash-values. For \fId\fR = 2 or greater, swarm uses an +algorithm of quadratic complexity that performs pairwise string +comparisons. An efficient \fIk\fR-mer-based filtering and an astute +use of comparisons results obtained during the clustering process allows to avoid most of the amplicon comparisons needed in a naïve approach. To speed up the remaining amplicon comparisons, \fBswarm\fR implements an extremely fast Needleman-Wunsch algorithm making use of @@ -60,54 +61,55 @@ .SS General options .TP 9 .BI \-b\fP,\fB\ \-\-boundary\~ "positive integer" -when using the option \-\-fastidious (\-f) , define the minimum mass -of a large OTU as the number given with this option. The default value -is 3, indicating that any OTU with mass 3 or more is considered large. -By default, an OTU is small if it has a mass of 2 or less, meaning +when using the option \-\-fastidious (\-f), define the minimum mass of +a large OTU as the number given with this option. The default value is +3, indicating that any OTU with mass 3 or more is considered "large". +By default, an OTU is "small" if it has a mass of 2 or less, meaning that it is composed of either one amplicon of abundance 2, or two amplicons of abundance 1. Any positive value greater than 1 can be specified. Using higher boundary values will speed up the second pass, -but also reduce the taxonomical resolution of swarm results. +but also reduce the taxonomical resolution of \fBswarm\fR results. .TP .BI \-c\fP,\fB\ \-\-ceiling\~ "positive integer" -when using the option \-\-fastidious (\-f), define swarm's maximum -memory footprint (in megabytes). Swarm will adjust the \-\-bloom\-bits -(\-y) value of the Bloom filter to fit within the specified amount of -memory. That option is not active by default. +when using the option \-\-fastidious (\-f), define \fBswarm\fR's +maximum memory footprint (in megabytes). \fBswarm\fR will adjust the +\-\-bloom\-bits (\-y) value of the Bloom filter to fit within the +specified amount of memory. That option is not active by default. .TP .BI \-d\fP,\fB\ \-\-differences\~ "zero or positive integer" maximum number of differences allowed between two amplicons, meaning that two amplicons will be grouped if they have \fIinteger\fR (or -less) differences. This is swarm's most important parameter. The +less) differences. This is \fBswarm\fR's most important parameter. The number of differences is calculated as the number of mismatches (substitutions, insertions or deletions) between the two amplicons once the optimal pairwise global alignment has been found (see -"advanced options" for parameters influencing the pairwise -alignment). Any \fIinteger\fR between 0 and 256 can be used, but high -\fId\fR values will decrease the taxonomical resolution of swarm -results. Commonly used values are 1, 2 or 3, rarely higher. When using -\fId\fR = 0, swarm will output results corresponding to a strict -dereplication of the dataset (warning, swarm still requires fasta -entries to present abundance values). Default number of differences is -1. +"pairwise alignment advanced options" to influencing that step). Any +\fIinteger\fR between 0 and 256 can be used, but high \fId\fR values +will decrease the taxonomical resolution of \fBswarm\fR +results. Commonly used \fId\fR values are 1, 2 or 3, rarely +higher. When using \fId\fR = 0, \fBswarm\fR will output results +corresponding to a strict dereplication of the dataset, i.e. merging +identical amplicons. Warning, \fBswarm\fR still requires fasta entries +to present abundance values. Default number of differences is 1. .TP .B \-f\fP,\fB\ \-\-fastidious when working with \fId\fR = 1, perform a second clustering pass to -reduce the number of small OTUs (recommended option). During the stone -stepping process at \fId\fR = 1, a step can be missing for purely -stochastic reasons, interrupting the agglomerative process. That -option will create virtual steps to graft small OTUs upon bigger -ones. By default, an OTU is small if it has a mass of 2 or less (see -the \-\-boundary option to increase that value). To speed things up, -swarm uses a Bloom filter to store intermediate results. Warning, that -second pass can be 2 to 3 times slower than the first pass and -requires much more memory. See the options \-\-bloom\-bits (\-y) or -\-\-ceiling (\-c) to control the memory footprint of the Bloom -filter. Warning, the fastidious option modifies OTUs: output files +reduce the number of small OTUs (recommended option). During the +clustering process with \fId\fR = 1, an intermediate amplicon can be +missing for purely stochastic reasons, interrupting the aggregation +process. That option will create virtual amplicons, allowing to graft +small OTUs upon bigger ones. By default, an OTU is "small" if it has a +mass of 2 or less (see the \-\-boundary option to increase that +value). To speed things up, \fBswarm\fR uses a Bloom filter to store +intermediate results. Warning, that second pass can be 2 to 3 times +slower than the first pass and requires much more memory. See the +options \-\-bloom\-bits (\-y) or \-\-ceiling (\-c) to control the +memory footprint of the Bloom filter. Warning, the fastidious option +modifies clustering results. The output files produced by the options \-\-log (\-l), \-\-output\-file (\-o), \-\-mothur (\-r), \-\-uclust\-file, and \-\-seeds (\-w) are updated to reflect these modifications; the file \-\-statistics\-file (\-s) is partially -updated (columns 6 and 7 are not updated); output file +updated (columns 6 and 7 are not updated); the output file \-\-internal\-structure (\-i) is not updated. .TP .B \-h\fP,\fB\ \-\-help @@ -117,9 +119,9 @@ deactivate the built-in OTU refinement (not recommended). Amplicon abundance values are used to identify transitions among in-contact OTUs and to separate them, yielding higher-resolution clustering -results. That option prevents that, and in practice, allows the -creation of a link between amplicon A and B, even if the abundance of -B is higher than the abundance of A. +results. That option prevents that separation, and in practice, allows +the creation of a link between amplicons A and B, even if the +abundance of B is higher than the abundance of A. .TP .BI \-t\fP,\fB\ \-\-threads\~ "positive integer" number of computation threads to use. The number of threads should be @@ -137,15 +139,18 @@ will require more memory. Any value between 4 and 20 can be used. Default value is 16. See the \-\-ceiling (\-c) option for an alternative way to control the memory footprint. -.TP -.BI \-a\fP,\fB\ \-\-append\-abundance\~ "positive integer" -the abundance value to use when abundance information about the -sequence is missing in the input file. A fatal error will be -generated if abundance is not given and this option is not present. .LP .\" ---------------------------------------------------------------------------- .SS Input/output options .TP 9 +.BI \-a\fP,\fB\ \-\-append\-abundance\~ "positive integer" +set abundance value to use when some or all amplicons in the input +file lack abundance values. Warning, it is not recommended to use +\fBswarm\fR on datasets where abundance values are all identical. We +provide that option as a courtesy to advanced users, please use it +carefully. \fBswarm\fR exits with an error message if abundance values +are missing and if this option is not used. +.TP .BI \-i\fP,\fB\ \-\-internal\-structure \0filename output all pairs of nearly-identical amplicons to \fIfilename\fR using a five-columns tab-delimited format: @@ -176,29 +181,43 @@ (for example, with certain job schedulers). .TP .BI \-o\fP,\fB\ \-\-output\-file \0filename -output result to \fIfilename\fR. Result is a list of OTUs, one OTU -per line. A OTU is a list of amplicon identifiers separated by -spaces. Default is to write to standard output. +output clustering results to \fIfilename\fR. Results consist of a list +of OTUs, one OTU per line. An OTU is a list of amplicon identifiers +separated by spaces. Default is to write to standard output. .TP .B \-r\fP,\fB\ \-\-mothur -output results in a format compatible with Mothur. That option -modifies swarm's default output format. +output clustering results in a format compatible with Mothur. That +option modifies \fBswarm\fR's default output format. .TP .BI \-s\fP,\fB\ \-\-statistics\-file \0filename -output statistics to the specified file. The file is a tab-separated -table with one OTU per row and seven columns of information: number of -unique amplicons in the OTU, total copy number of amplicons in the -OTU, identifier of the initial seed, initial seed copy number (if -applicable), number of singletons (amplicons with a copy number of 1), -maximum number of generations (i.e., numbers of iterations before the -OTU reached its natural limits), and the theoretical maximum radius of -the OTU (i.e., number of cummulated differences between the seed and -the furthermost amplicon in the OTU. The actual pairwise distance and -max radius is often much smaller). +output statistics to \fIfilename\fR. The file is a tab-separated table +with one OTU per row and seven columns of information: +.RS +.RS +.nr step 1 1 +.IP \n[step]. 4 +number of unique amplicons in the OTU, +.IP \n+[step]. +total copy number of amplicons in the OTU, +.IP \n+[step]. +identifier of the initial seed, +.IP \n+[step]. +initial seed copy number, +.IP \n+[step]. +number of amplicons with a copy number of 1 in the OTU, +.IP \n+[step]. +maximum number of iterations before the OTU reached its natural +limits), +.IP \n+[step]. +theoretical maximum radius of the OTU (i.e., number of cummulated +differences between the seed and the furthermost amplicon in the +OTU). The actual maximum radius of the OTU is often much smaller. +.RE +.RE .TP .BI \-u\fP,\fB\ \-\-uclust\-file \0filename -output results in uclust-like file format to the specified file. That -option does not modify swarm default output format. +output clustering results in uclust-like file format to the specified +file. That option does not modify \fBswarm\fR's default output format. .TP .BI \-w\fP,\fB\ \-\-seeds \0filename output OTU representatives to \fIfilename\fR in fasta format. The @@ -206,46 +225,50 @@ all the amplicons in the OTU. .TP .B \-z\fP,\fB\ \-\-usearch\-abundance -accept amplicon abundances specified using the usearch/vsearch style -(">label;size=\fIinteger\fR[;]"). That option influences the +accept amplicon abundance values in usearch/vsearch's style +(>label;size=\fIinteger\fR[;]). That option influences the abundance annotation style used in output files. +.\" which files are modified? -w at least. .LP .\" ---------------------------------------------------------------------------- -.SS Pairwise alignment options +.SS Pairwise alignment advanced options when using \fId\fR > 1, \fBswarm\fR recognizes advanced command-line options modifying the pairwise global alignment scoring parameters: .RS .TP 9 .BI \-m\fP,\fB\ \-\-match\-reward\~ "positive integer" -reward for a nucleotide match. Default is 5. +set the reward for a nucleotide match. Default is 5. .TP .BI \-p\fP,\fB\ \-\-mismatch\-penalty\~ "positive integer" -penalty for a nucleotide mismatch. Default is 4. +set the penalty for a nucleotide mismatch. Default is 4. .TP .BI \-g\fP,\fB\ \-\-gap\-opening\-penalty\~ "positive integer" -gap open penalty. Default is 12. +set the gap open penalty. Default is 12. .TP .BI \-e\fP,\fB\ \-\-gap\-extension\-penalty\~ "positive integer" -gap extension penalty. Default is 4. +set the gap extension penalty. Default is 4. .LP -As \fBswarm\fR focuses on close relationships, final results are -resilient to model parameters modifications. Modifying model -parameters only impacts analysis using a high number of differences. +.RE +As \fBswarm\fR focuses on close relationships (i.e. \fId\fR = 2 or 3), +clustering results are resilient to pairwise alignment model +parameters modifications. Modifying model parameters has a stronger +impact when clustering using a higher \fId\fR value. .\" classic parameters are +5/-4/-12/-1 .\" ============================================================================ .SH EXAMPLES .PP -Divide the data set \fImyfile.fasta\fR into OTUs with the finest +Clusterize the data set \fImyfile.fasta\fR into OTUs with the finest resolution possible (1 difference, built-in breaking, fastidious option) using 4 computation threads. OTUs are written to the file -\fImyfile.swarms\fR. +\fImyfile.swarms\fR, and OTU representatives are written to +\fImyfile.representatives.fasta\fR. .PP .RS .B swarm -\-t 4 \-f \-o -.I myfile.swarms myfile.fasta +\-t 4 \-f \-w +.I myfile.representatives.fasta < myfile.fasta > myfile.swarms .RE -.PP +.LP .\" ============================================================================ .\" .SH LIMITATIONS .\" List known limitations or bugs. @@ -254,9 +277,13 @@ Concept by Frédéric Mahé, implementation by Torbjørn Rognes. .\" ============================================================================ .SH CITATION -Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: -robust and fast clustering method for amplicon-based -studies. \fIPeerJ\fR 2:e593 +Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) +Swarm: robust and fast clustering method for amplicon-based studies. +\fIPeerJ\fR 2:e593 +.PP +Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) +Swarm v2: highly-scalable and high-resolution amplicon clustering. +\fIPeerJ\fR 3:e1420 .\" ============================================================================ .SH REPORTING BUGS Submit suggestions and bug-reports at @@ -302,25 +329,32 @@ or minor bug releases are not mentioned): .RS .TP +.BR v2.1.6\~ "released December 14, 2015" +Version 2.1.6 fixes problems with older compilers that do not have +the x86intrin.h header file. It also fixes a bug in the output of seeds +with the `-w` option when d>1. +.TP .BR v2.1.5\~ "released September 8, 2015" Version 2.1.5 fixes minor bugs. .TP .BR v2.1.4\~ "released September 4, 2015" -Version 2.1.4 fixes minor bugs in the swarm algorithm used for d=1. +Version 2.1.4 fixes minor bugs in the swarm algorithm used for \fId\fR += 1. .TP .BR v2.1.3\~ "released August 28, 2015" Version 2.1.3 adds checks of numeric option arguments. .TP .BR v2.1.1\~ "released March 31, 2015" -Version 2.1.1 fixes a bug with the fastidious option that caused it -to ignore some connections between heavy and light swarms. +Version 2.1.1 fixes a bug with the fastidious option that caused it to +ignore some connections between large and small OTUs. .TP .BR v2.1.0\~ "released March 24, 2015" -Version 2.1.0 marks the first official release of swarm 2. +Version 2.1.0 marks the first official release of swarm v2. .TP .BR v2.0.7\~ "released March 18, 2015" Version 2.0.7 writes abundance information in usearch style when using -options \-w (\-\-seeds) in combination with \-z (\-\-usearch\-abundance). +options \-w (\-\-seeds) in combination with \-z +(\-\-usearch\-abundance). .TP .BR v2.0.6\~ "released March 13, 2015" Version 2.0.6 fixes a minor bug. @@ -330,16 +364,16 @@ adds options to control memory usage of the Bloom filter (\-y and \-c). In addition, an option (\-w) allows to output OTU representatives sequences with updated abundances (sum of all -abundances inside each OTU). This version also enables dereplication -when \fId\fR = 0. +abundances inside each OTU). This version also enables \fBswarm\fR to +run with \fId\fR = 0. .TP .BR v2.0.4\~ "released March 6, 2015" Version 2.0.4 includes a fully parallelised implementation of the fastidious option. .TP .BR v2.0.3\~ "released March 4, 2015" -Version 2.0.3 includes a working implementation of the fastidious option, -but only the initial clustering is parallelized. +Version 2.0.3 includes a working implementation of the fastidious +option, but only the initial clustering is parallelized. .TP .BR v2.0.2\~ "released February 26, 2015" Version 2.0.2 fixes SSSE3 problems. @@ -352,13 +386,13 @@ Version 2.0.0 is faster and easier to use, providing new output options (\-\-internal\-structure and \-\-log), new control options (\-\-boundary, \-\-fastidious, \-\-no\-otu\-breaking), and built-in -OTU refinement. When using default parameters, a novel and -considerably faster algorithmic approach is used, guaranteeing swarm's -scalability. +OTU refinement (no need to use the python script anymore). When using +default parameters, a novel and considerably faster algorithmic +approach is used, guaranteeing \fBswarm\fR's scalability. .TP .BR v1.2.21\~ "released February 26, 2015" -Version 1.2.21 is supposed to fix some problems related to the use of the -SSSE3 CPU instructions which are not always available. +Version 1.2.21 is supposed to fix some problems related to the use of +the SSSE3 CPU instructions which are not always available. .TP .BR v1.2.20\~ "released November 6, 2014" Version 1.2.20 presents a production-ready version of the alternative @@ -394,8 +428,8 @@ in order of decreasing abundance. .TP .BR v1.2.14\~ "released September 27, 2014" -Version 1.2.14 fixes a bug in the output with the swarm_breaker option -(\-b) when using the alternative algorithm (\-a). +Version 1.2.14 fixes a bug in the output with the \-\-swarm_breaker +option (\-b) when using the alternative algorithm (\-a). .TP .BR v1.2.12\~ "released August 18, 2014" Version 1.2.12 introduces an option \-\-alternative\-algorithm to use @@ -404,31 +438,32 @@ has been noticeably improved. .TP .BR v1.2.10\~ "released August 8, 2014" -allows amplicon abundances to be specified using the usearch style in -the sequence header (e.g. ">id;size=1") when the \-z option is chosen. +Version 1.2.10 allows amplicon abundances to be specified using the +usearch style in the sequence header (e.g. ">id;size=1") when the \-z +option is chosen. .TP .BR v1.2.8\~ "released August 5, 2014" -swarm 1.2.8 fixes an error with the gap extension penalty. Previous +Version 1.2.8 fixes an error with the gap extension penalty. Previous versions used a gap penalty twice as large as intended. That bug correction induces small changes in clustering results. .TP .BR v1.2.6\~ "released May 23, 2014" -Version 1.2.6 introduces an option \-\-mothur to output swarm results in -a format compatible with the microbial ecology community analysis -software suite Mothur. +Version 1.2.6 introduces an option \-\-mothur to output clustering +results in a format compatible with the microbial ecology community +analysis software suite Mothur (). .TP .BR v1.2.5\~ "released April 11, 2014" Version 1.2.5 removes the need for a POPCNT hardware instruction to be -present. Swarm now automatically checks whether POPCNT is available -and uses a slightly slower software implementation if not. Only basic -SSE2 instructions are now required to run swarm. +present. \fBswarm\fR now automatically checks whether POPCNT is +available and uses a slightly slower software implementation if +not. Only basic SSE2 instructions are now required to run \fBswarm\fR. .TP .BR v1.2.4\~ "released January 30, 2014" Version 1.2.4 introduces an option \-\-break\-swarms to output all pairs of amplicons with \fId\fR differences to standard error. That option is used by the companion script `swarm_breaker.py` to refine -swarm results. The syntax of the inline assembly code is changed for -compatibility with more compilers. +\fBswarm\fR results. The syntax of the inline assembly code is changed +for compatibility with more compilers. .TP .BR v1.2\~ "released May 16, 2013" Version 1.2 greatly improves speed by using alignment-free comparisons @@ -436,20 +471,20 @@ presence-absence of all possible 5-mers is computed and recorded in a 1024-bits vector. Vector comparisons are extremely fast and drastically reduce the number of costly pairwise alignments performed -by swarm. While remaining exact, swarm 1.2 can be more than 100-times -faster than swarm 1.1, when using a single thread with a large set of -sequences. The minor version 1.1.1, published just before, adds -compatibility with Apple computers, and corrects an issue in the -pairwise global alignment step that could lead to sub-optimal +by \fBswarm\fR. While remaining exact, \fBswarm\fR 1.2 can be more +than 100-times faster than \fBswarm\fR 1.1, when using a single thread +with a large set of sequences. The minor version 1.1.1, published just +before, adds compatibility with Apple computers, and corrects an issue +in the pairwise global alignment step that could lead to sub-optimal alignments. .TP .BR v1.1\~ "released February 26, 2013" Version 1.1 introduces two new important options: the possibility to -output swarming results using the uclust output format, and the -possibility to output detailed statistics on each swarms. Swarm 1.1 is -also faster: new filterings based on pairwise amplicon sequence +output clustering results using the uclust output format, and the +possibility to output detailed statistics on each OTU. \fBswarm\fR 1.1 +is also faster: new filterings based on pairwise amplicon sequence lengths and composition comparisons reduce the number of pairwise -alignments needed and speed up the swarming. +alignments needed and speed up the clustering. .TP .BR v1.0\~ "released November 10, 2012" First public release. Binary files /tmp/tmpyiOBiV/9dGftMXfdq/swarm-cluster-2.1.5/man/swarm_manual.pdf and /tmp/tmpyiOBiV/xgnc7d0eXQ/swarm-cluster-2.1.6/man/swarm_manual.pdf differ diff -Nru swarm-cluster-2.1.5/README.md swarm-cluster-2.1.6/README.md --- swarm-cluster-2.1.5/README.md 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/README.md 2015-12-14 10:01:36.000000000 +0000 @@ -42,6 +42,7 @@ * [Third-party pipelines](#pipelines) * [Alternatives](#alternatives) * [New features](#features) + * [version 2.1.6](#version216) * [version 2.1.5](#version215) * [version 2.1.4](#version214) * [version 2.1.3](#version213) @@ -80,8 +81,6 @@ * [version 1.2.0](#version120) * [version 1.1.1](#version111) * [version 1.1.0](#version110) - * [Statistics](#stats) - * [Uclust-like output format](#uclust) ## Common misconceptions ## @@ -236,8 +235,22 @@ [user manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf)). The role of the dereplication step is to identify, merge and sort -identical sequences by decreasing abundance. Here is an example using -standard command line tools: +identical sequences by decreasing abundance. Here is a command using +[vsearch](https://github.com/torognes/vsearch) v1.3.3 or superior: + +```sh +vsearch \ + --derep_fulllength amplicons.fasta \ + --sizeout \ + --relabel_sha1 \ + --fasta_width 0 \ + --output amplicons_linearized_dereplicated.fasta +``` + +The command performs the dereplication, the linearization +(`--fasta_width 0`) and the renaming with hashing values +(`--relabel_sha1`). If you can't or don't want to use vsearch, here is +an example using standard command line tools: ```sh grep -v "^>" amplicons_linearized.fasta | \ @@ -371,13 +384,13 @@ To cite **swarm**, please refer to: -Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: -robust and fast clustering method for amplicon-based studies. PeerJ -2:e593 http://dx.doi.org/10.7717/peerj.593 - -Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (in preparation) -Swarm v2: highly-scalable and high-resolution amplicon -clustering. https://dx.doi.org/10.7287/peerj.preprints.1222v2 +Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) +Swarm: robust and fast clustering method for amplicon-based studies. +PeerJ 2:e593 doi: [10.7717/peerj.593](http://dx.doi.org/10.7717/peerj.593) + +Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) +Swarm v2: highly-scalable and high-resolution amplicon clustering. +PeerJ 3:e1420 doi: [10.7717/peerj.1420](http://dx.doi.org/10.7717/peerj.1420) @@ -395,6 +408,10 @@ **swarm** is available in third-party pipelines: +* [FROGS](https://github.com/geraldinepascal/FROGS): a + [Galaxy](https://galaxyproject.org/)/CLI workflow designed to + produce an OTU count matrix from high depth sequencing amplicon + data. * [LotuS (v1.30)](http://psbweb05.psb.ugent.be/lotus/): extremely fast OTU building, annotation, phylogeny and abundance matrix pipeline, based on raw sequencer output. @@ -417,6 +434,13 @@ ## New features## + +### version 2.1.6 ### + +**swarm** 2.1.6 fixes problems with older compilers that do not have +the x86intrin.h header file. It also fixes a bug in the output of seeds +with the `-w` option when d>1. + ### version 2.1.5 ### @@ -685,25 +709,17 @@ the previous version on our test dataset. It also introduces two new output options: `statistics` and `uclust-like` format. - - -#### Statistics #### - By specifying the `-s` option to **swarm** it will now output detailed -statistics about each swarm to a specified file. It will print the -number of unique amplicons, the number of occurrences, the name of the -seed and its abundance, the number of singletons (amplicons with an -abundance of 1), the number of iterations and the maximum radius of -the swarm (i.e. number of differences between the seed and the -furthermost amplicon). When using input data sorted by decreasing -abundance, the seed is the most abundant amplicon in the swarm. - - - -#### Uclust-like output format #### + statistics about each swarm to a specified file. It will print the + number of unique amplicons, the number of occurrences, the name of + the seed and its abundance, the number of singletons (amplicons with + an abundance of 1), the number of iterations and the maximum radius + of the swarm (i.e. number of differences between the seed and the + furthermost amplicon). When using input data sorted by decreasing + abundance, the seed is the most abundant amplicon in the swarm. Some pipelines use the -[uclust output format](http://www.drive5.com/uclust/uclust_userguide_1_1_579.html#_Toc257997686 -"page describing the uclust output format") as input for subsequent -analyses. **swarm** can now output results in this format to a -specified file with the `-u` option. + [uclust output format](http://www.drive5.com/uclust/uclust_userguide_1_1_579.html#_Toc257997686 + "page describing the uclust output format") as input for subsequent + analyses. **swarm** can now output results in this format to a + specified file with the `-u` option. diff -Nru swarm-cluster-2.1.5/scripts/graph_plot.py swarm-cluster-2.1.6/scripts/graph_plot.py --- swarm-cluster-2.1.5/scripts/graph_plot.py 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/scripts/graph_plot.py 2015-12-14 10:01:36.000000000 +0000 @@ -79,8 +79,11 @@ with open(swarms, "rU") as swarms: for i, swarm in enumerate(swarms): if i == OTU - 1: - amplicons = [tuple(item.rsplit("_", 1)) # TODO: deal with ";size=" - for item in swarm.strip().split(" ")] + # Deal with ";size=" in a rather clumsy way... but it works + amplicons = [ + tuple( + item.replace(";size=", "_").rstrip(";").rsplit("_", 1)) + for item in swarm.strip().split(" ")] break # Drop amplicons with a low abundance (remove connections too) diff -Nru swarm-cluster-2.1.5/src/algo.cc swarm-cluster-2.1.6/src/algo.cc --- swarm-cluster-2.1.5/src/algo.cc 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/src/algo.cc 2015-12-14 10:01:36.000000000 +0000 @@ -516,6 +516,7 @@ unsigned long mass = 0; unsigned previd = amps[0].swarmid; unsigned prevamp = amps[0].ampliconid; + unsigned seed = prevamp; mass += db_getabundance(prevamp); for (unsigned long i=1; i"); - fprint_id_with_new_abundance(fp_seeds, prevamp, mass); + fprint_id_with_new_abundance(fp_seeds, seed, mass); fprintf(fp_seeds, "\n"); db_fprintseq(fp_seeds, prevamp, 0); mass = 0; + seed = amps[i].ampliconid; } previd = id; @@ -538,7 +540,7 @@ } fprintf(fp_seeds, ">"); - fprint_id_with_new_abundance(fp_seeds, prevamp, mass); + fprint_id_with_new_abundance(fp_seeds, seed, mass); fprintf(fp_seeds, "\n"); db_fprintseq(fp_seeds, prevamp, 0); diff -Nru swarm-cluster-2.1.5/src/swarm.cc swarm-cluster-2.1.6/src/swarm.cc --- swarm-cluster-2.1.5/src/swarm.cc 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/src/swarm.cc 2015-12-14 10:01:36.000000000 +0000 @@ -189,28 +189,30 @@ /* 01234567890123456789012345678901234567890123456789012345678901234567890123456789 */ fprintf(stderr, "Usage: swarm [OPTIONS] [filename]\n"); + fprintf(stderr, " -b, --boundary INTEGER min mass of large OTU for fastidious (3)\n"); + fprintf(stderr, " -c, --ceiling INTEGER max memory in MB used for fastidious\n"); fprintf(stderr, " -d, --differences INTEGER resolution (1)\n"); + fprintf(stderr, " -f, --fastidious link nearby low-abundance swarms\n"); fprintf(stderr, " -h, --help display this help and exit\n"); - fprintf(stderr, " -o, --output-file FILENAME output result filename (stdout)\n"); + fprintf(stderr, " -n, --no-otu-breaking never break OTUs\n"); fprintf(stderr, " -t, --threads INTEGER number of threads to use (1)\n"); fprintf(stderr, " -v, --version display version information and exit\n"); + fprintf(stderr, " -y, --bloom-bits INTEGER bits used per Bloom filter entry (16)\n"); + fprintf(stderr, "Input/output options:\n"); + fprintf(stderr, " -a, --append-abundance INTEGER value to use when abundance is missing\n"); + fprintf(stderr, " -i, --internal-structure FILENAME write internal swarm structure to file\n"); + fprintf(stderr, " -l, --log FILENAME log to file, not to stderr\n"); + fprintf(stderr, " -o, --output-file FILENAME output result filename (stdout)\n"); + fprintf(stderr, " -r, --mothur output in mothur list file format\n"); + fprintf(stderr, " -s, --statistics-file FILENAME dump swarm statistics to file\n"); + fprintf(stderr, " -u, --uclust-file FILENAME output in UCLUST-like format to file\n"); + fprintf(stderr, " -w, --seeds FILENAME write seed seqs with abundances to FASTA\n"); + fprintf(stderr, " -z, --usearch-abundance abundance annotation in usearch style\n"); + fprintf(stderr, "Pairwise alignment advanced options:\n"); fprintf(stderr, " -m, --match-reward INTEGER reward for nucleotide match (5)\n"); fprintf(stderr, " -p, --mismatch-penalty INTEGER penalty for nucleotide mismatch (4)\n"); fprintf(stderr, " -g, --gap-opening-penalty INTEGER gap open penalty (12)\n"); fprintf(stderr, " -e, --gap-extension-penalty INTEGER gap extension penalty (4)\n"); - fprintf(stderr, " -s, --statistics-file FILENAME dump swarm statistics to file\n"); - fprintf(stderr, " -u, --uclust-file FILENAME output in UCLUST-like format to file\n"); - fprintf(stderr, " -r, --mothur output in mothur list file format\n"); - fprintf(stderr, " -z, --usearch-abundance abundance annotation in usearch style\n"); - fprintf(stderr, " -i, --internal-structure FILENAME write internal swarm structure to file\n"); - fprintf(stderr, " -l, --log FILENAME log to file, not to stderr\n"); - fprintf(stderr, " -n, --no-otu-breaking never break OTUs\n"); - fprintf(stderr, " -w, --seeds FILENAME write seed seqs with abundances to FASTA\n"); - fprintf(stderr, " -f, --fastidious link nearby low-abundance swarms\n"); - fprintf(stderr, " -b, --boundary INTEGER min mass of large OTU for fastidious (3)\n"); - fprintf(stderr, " -y, --bloom-bits INTEGER bits used per Bloom filter entry (16)\n"); - fprintf(stderr, " -c, --ceiling INTEGER max memory in MB used for fastidious\n"); - fprintf(stderr, " -a, --append-abundance INTEGER value to use when abundance is missing\n"); fprintf(stderr, "\n"); fprintf(stderr, "See 'man swarm' for more details.\n"); } diff -Nru swarm-cluster-2.1.5/src/swarm.h swarm-cluster-2.1.6/src/swarm.h --- swarm-cluster-2.1.5/src/swarm.h 2015-09-08 12:32:42.000000000 +0000 +++ swarm-cluster-2.1.6/src/swarm.h 2015-12-14 10:01:36.000000000 +0000 @@ -25,26 +25,34 @@ #include #include #include -#include #include #include #include #include #include #include + #ifdef __APPLE__ #include #else #include #endif +#ifdef __SSE2__ +#include +#endif + +#ifdef __SSSE3__ +#include +#endif + /* constants */ #ifndef LINE_MAX #define LINE_MAX 2048 #endif -#define SWARM_VERSION "2.1.5" +#define SWARM_VERSION "2.1.6" #define WIDTH 32 #define WIDTH_SHIFT 5 #define BLOCKWIDTH 32