Merge branch 'develop'

ddesvillechabrol · Dec 26, 2017 · f7c1c1b · f7c1c1b
2 parents 9fe02d9 16fa95b
commit f7c1c1b
Show file tree

Hide file tree

Showing 33 changed files with 2,151 additions and 97 deletions.
diff --git a/doc/Changelog.rst b/doc/Changelog.rst
@@ -12,7  12,17 @@ Changelog
 * BUGS:
 
     * Fix regression bug (https://github.com/sequana/sequana/issues/484)
     * Fix missing N_final column in table of the quality_control multi-summary
       page
     * Remove phix174.fa requirements in RNAseq pipeline config file
     * Fix path starting with tilde (https://github.com/sequana/sequana/issues/486)
 
 * NEWS:
 
     * add isoseq Class
     * add vcf_filter module back to help in filtering VCF files created with
       mpileup for instance
     * add sequana_vcf_filter standalone
 
 0.6.1
 ~~~~~~~~~~

diff --git a/doc/developers.rst b/doc/developers.rst
@@ -616,17  616,12 @@ and then used as::
     singularity exec sequana-sequana-master.img sequana_coverage --help
 
 One issue here is that we should push on master only when we are sure that the
-repository is not bugged. Since people may use the container, we should no
-remove them... but the will be lots of containers. 
 repository is not bugged. Since people may use the container, we should not
 remove them... but there will be lots of containers.
 
-Instead, we could create a specific tag e.g. release_0_5_2 , activate the
-branches in the singularity hub page, push an empty commit and disable that
-branch. In doing so, we have a unique container for the release 0_5_2
 Instead, we could create a specific tag e.g. release_0_5_2, activate the
 branch in the singularity hub page, push an empty commit and disable that
 branch. In doing so, we have a unique container for the release 0_5_2.
 
 
-======= ========= ==================
-branch  version   sequana version
-======= ========= ==================
-0_5_2   3253      0.5.2
-======= ========= ==================
 
diff --git a/doc/index.rst b/doc/index.rst
@@ -10,7  10,7 @@ Sequana documentation
 
 
     <a href="http://bioconda.github.io/recipes/sequana/README.html">
-    <img src="https://img.shields.io/badge/install with-bioconda-brightgreen.svg?style=flat-square">
     <img src="https://img.shields.io/badge/install with-bioconda-brightgreen.svg?style=flat-square"></a>
 
     <a href="https://pypi.python.org/pypi/sequana">
     <img src="https://badge.fury.io/py/sequana.svg"></a>
@@ -39,7  39,7 @@ Sequana documentation
 :Source: See  `http://github.com/sequana/sequana <https://github.com/sequana/sequana/>`_.
 :Issues: Please fill a report on `github <https://github.com/sequana/sequana/issues>`_
 :How to cite: For Sequana in general including the pipelines, please use 
-    the `JOSS DOI (10.21105/joss.00352) <http://www.doi2bib.org/#/doi/10.21105/joss.00352>`_
     the `JOSS DOI (10.21105/joss.00352) <http://www.doi2bib.org/bib/10.21105/joss.00352>`_
 
 
     For the **genome coverage** tool (sequana_coverage):  Dimitri Desvillechabrol,

diff --git a/doc/installation.rst b/doc/installation.rst
@@ -10,8  10,8 @@ available on **bioconda**. Note, however, that
 releases of Sequana are also available on Pypi so you could also use **pip**. 
 
 If you just want to test **Sequana** or **Sequanix** or one of the Sequana
-standalone, we have started to provide **Singularity** containers since version
-0.5.2. This is a great solution for reproducibility as well. Containers are
 standalone, we also provide **Singularity** containers. This is a great 
 solution for reproducibility as well. Containers are
 available on https://singularity-hub.org/collections/114/. 
 
 
@@ -20,7  20,7 @@ Overview of installation methods
 
 We support 3 types of installations:
 
-#. Singularity. Strictly speaking, there is no compilation. This method is for testing and production. It downloads an image / container that is ready-to-use::
 #. Singularity (tested with version 2.4.2) . Strictly speaking, there is no compilation. This method is for testing and production. It downloads an image / container that is ready-to-use::
 
     singularity pull --name sequana.img shub://sequana/sequana
     singularity shell sequana.img
@@ -160,10  160,10 @@ environment) and would work for Linux users only for the time being. The main
 reason being that under Mac and windows a virtualbox is used by Singularity
 preventing a X connection. This should be solved in the near future.
 
-First, install singularity (http://singularity.lbl.gov/). For example ,here is
-how to download version 2.4.0 and install::
 First, install singularity (http://singularity.lbl.gov/). You must use at least
 version 2.4. We tested this recipe with version 2.4.2 (Dec 2017)::
 
-    VERSION=2.4
     VERSION=2.4.2
     wget https://github.com/singularityware/singularity/releases/download/$VERSION/singularity-$VERSION.tar.gz
     tar xvf singularity-$VERSION.tar.gz
     cd singularity-$VERSION
@@ -177,7  177,7 @@ Second, download a Sequana image. For instance, for the latest master version::
 
 or for the release 0.6.1::
 
-    singularity pull --name sequana.img shub://sequana/sequana@release_0_6_1
     singularity pull --name sequana_0_6_1.img shub://sequana/sequana@release_0_6_1
 
 
 Do not interrupt the download (1.5Go). Once downloaded,

diff --git a/environment_rtd.yml b/environment_rtd.yml
@@ -11,14  11,9 @@ dependencies:
 - numpy
 - pandas
 - matplotlib
-- scipy
-- pysam
-- krona
-- kraken
-- pillow
-- sphinx
 - pysam==0.12.0.1
 - reports
-- easydev>=0.9.34
 - easydev>=0.9.36
 - bioservices
 - pip
 - shustring

diff --git a/requirements.txt b/requirements.txt
@@ -3,7  3,7 @@ bioservices>=1.4.14
 biokit>=0.4.1
 colorlog
 docutils
-easydev>=0.9.34
 easydev>=0.9.36
 matplotlib>=2.0.0
 mock
 multiqc==1.0

diff --git a/requirements_pipelines.txt b/requirements_pipelines.txt
@@ -23,4  23,4 @@ snpeff
 spades
 star
 subread
-multiqc
 multiqc==1.0
diff --git a/sequana/__init__.py b/sequana/__init__.py
@@ -5,12  5,14 @@
 except:
     version = ">=0.20.0"
 
-import colorlog as logger
-def sequana_debug_level(level="WARNING"):
-    """A deubg level setter at top level of the library"""
-    assert level in ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
-    logging_level = getattr(logger.logging.logging, level)
-    logger.getLogger().setLevel(logging_level)
 
 try:
     from easydev.logging_tools import Logging
     logger = Logging("sequana", "WARNING")
 except:
     import colorlog
     logger = colorlog.getLogger("sequana")
 
 
 
 from easydev import CustomConfig
@@ -53,6  55,3 @@ def _download_biokit_taxon():
     tt._load_flat_file()
 try: _download_biokit_taxon()
 except: pass
-
-
-
diff --git a/sequana/busco.py b/sequana/busco.py
@@ -109,3  109,111 @@ def download(self, uncompress=True):
                 execute("rm -f %s" % ( target))
 
 
 class BuscoAnalysis(object):
 
     def __init__(self, bin_path=None, config_path=None, species=None,
                 sample_name=None, outpath=None):
         """
 
         :param bin_path: path to busco binary file. 
         :param config_path: is the path to augustus config path. See example
             in sequana/resources/busco
         :param species: one of busco dataset (e.g. bacteria_odb9)
         :param sample_name: prefix used for all files
         :param outpath: main output path
 
         mode    = config['busco']['mode_choice'],
         options = config['busco']['options'],
         wkdir   =  __busco__workdir,
         species = config['busco']['species_choice'],
 
 
         If bin_path or config_path are none, uses CONDA_PREFIX or
         CONDA_ENV_PATH.
 
         """
         assert outpath is not None
         assert sample_name is not None
         assert species is not None
 
         import os
         #sample_name = params.wkdir.split(os.sep)[0]
         if bin_path is None:
             # try conda
             try:
                 self.conda_bin_path = os.environ['CONDA_PREFIX']   os.sep  "bin"
             except:
                 self.conda_bin_path = os.environ['CONDA_ENV_PATH']   os.sep  "bin"
         else:
             self.conda_bin_path = bin_path
 
         if config_path is None:
             try:
                 self.augustus_config_path = os.environ['CONDA_PREFIX']   os.sep   "config"
             except KeyError:
                 self.augustus_config_path = os.environ['CONDA_ENV_PATH']   os.sep   "config"
         else:
             self.augustus_config_path = config_path
 
         self.species = species # TODO make it an attribute that calls set_config
         self.sample_name = sample_name
         self.outpath = outpath
 
 
     def _set_config(self):
         config = BuscoConfig(
             self.species,
             sample_name=self.sample_name,
             outpath=self.outpath,
             conda_bin_path=self.conda_bin_path,
             Rscript_bin_path="", # not required by our analysis
             tmp_path="./tmp_{}".format(self.sample_name)
         )
         from easydev import mkdirs
         mkdirs(self.outpath)
         self.config_filename = self.outpath   "/config.ini"
         config.save_config_file(self.config_filename)
 
     def run(self, infile, threads=4, mode="genome"):
         """
         mode_choice: genome, transcriptomics, proteins
         species: name of a BUSCO dataset. (use Sequanix to get the list)
         options: any options understood by busco
         """
         self._set_config()
         # -f to force overwritten existing files
         cmd = "export BUSCO_CONFIG_FILE={};".format(self.config_filename)
         cmd  = "export AUGUSTUS_CONFIG_PATH={};".format(self.augustus_config_path)
         cmd  = "run_busco -i {infile} -m {mode} -c {threads} -f 2>>{log};"
         cmd = cmd.format(**{"infile":infile, 'mode':mode, "threads":threads, "log":"log"})
 
         # Note that execute() does not keep the shell info
         from easydev import execute as shell
         from snakemake import shell
         shell(cmd)
         shell("rm -rf ./tmp_{}".format(self.sample_name))
 
 
     def multirun(self, infile, threads=4, mode="genome", species=[]):
         """
 
         datasets = BuscoDownload().filenames
         res = b.multirun("canu_test10X.contigs.fasta", threads=1,
             species=datasets)
 
         """
         res = {}
         buffer_ = self.species
         for this in species:
             self.species = this
             try:
                 self.run(infile, threads=threads, mode=mode)
                 from sequana.assembly import BUSCO
                 ab = BUSCO("{}/run_{}/full_table_{}.tsv".format(
                     self.outpath, self.sample_name, self.sample_name))
                 res[this] = ab.score
             except:
                 pass
         self.species = buffer_
 
         return res
diff --git a/sequana/cigar.py b/sequana/cigar.py
@@ -39,16  39,22 @@ class Cigar(object):
 
     Possible CIGAR types are:
 
-    - "M" for MATCH  (0)
-    - "I" for Insertion (1)
-    - "D" for deletion 2
-    - "N" for reference skipped 3
-    - "S" for soft clipping 4
-    - "H" for hard clipping 5
-    - "P" for padding
     - "M" for alignment MATCH  (0)
     - "I" for Insertion to the reference (1)
     - "D" for deletion from the reference 2
     - "N" for skipped region from the reference 3
     - "S" for soft clipping (clipped sequence present in seq) 4
     - "H" for hard clipping (clipped sequence NOT present in seq) 5
     - "P" for padding (silent deletion from padded reference)
     - "=" for equal
-    - "X" for diff
-    - "B" for back
     - "X" for diff (sequence mismatched)
     - "B" for back     !!!! could be also NM ???
     
     !!! BWA MEM get_cigar_stats returns list with 11 items
     Last item is 
     !!! what is the difference between M and = ???
     Last item is I   S   X 
     !!! dans BWA, mismatch (X) not provided... should be deduced from last item - I - S
 
     .. note:: the length of the query sequence based on the CIGAR is calculated
         by adding the M, I, S, =, or X and other operations are ignored.

diff --git a/sequana/fastq.py b/sequana/fastq.py
@@ -1,6  1,4 @@
-"""Utilities to manipulate FASTQ and Reads
-
-"""
 """Utilities to manipulate FASTQ and Reads"""
 import io
 import time
 import zlib
@@ -686,6  684,7 @@ def _to_fasta(self, output_filename="test.fasta", level=6, CHUNKSIZE=65536):
 
         if tozip is True: self._gzip(output_filename)
         """
 
     def filter(self, identifiers_list=[], min_bp=None, max_bp=None,
         progressbar=True, output_filename='filtered.fastq', remove=True):
         """Filter reads
@@ -782,6  781,11 @@ def to_krona(self, k=7, output_filename="fastq.krona"):
                 letters = "\t".join([x for x in index.decode()])
                 fout.write("%s\t" % count   letters   "\n")
 
     def stats(self):
         self.rewind()
         data = [len(read['sequence']) for read in self]
         return {"mean_read_length": pylab.mean(data), "N": len(data)}
 
     def __eq__(self, other):
         if id(other) == id(self):
             return True

diff --git a/sequana/freebayes_bcf_filter.py b/sequana/freebayes_bcf_filter.py
@@ -186,6  186,10 @@ def _filter_line(self, bcf_line):
 
         Return line if all filters are passed.
         """
         # SRF="Number of reference observations on the forward strand
         # SRR="Number of reference observations on the reverse strand
         # SAF="Number of alternate observations on the forward strand
         # SAR=Number of alternate observations on the reverse strand
         if bcf_line.qual < self.filters["freebayes_score"]:
             return False
 
@@ -285,6  289,7 @@ def strand_ratio(number1, number2):
         return 0
     return division
 
 
 def compute_freq(bcf_line):
     """ Compute frequency of alternate allele.
         alt_freq = Count Alternate Allele / Depth
@@ -295,6  300,7 @@ def compute_freq(bcf_line):
                 for count in bcf_line.info["AO"]]
     return alt_freq
 
 
 def compute_strand_bal(bcf_line):
     """ Compute strand balance of alternate allele include in [0,0.5].
         strand_bal = Alt Forward / (Alt Forward   Alt Reverse)