gzip/bzip2 output options #747

b-wyss · 2015-01-30T18:20:47Z

Feature creation in response to #505

mr-c · 2015-02-10T18:19:56Z

Jenkins, retest this please

mr-c · 2015-02-10T18:20:51Z

setup.py

@@ -15,6  15,7 @@
 import shutil
 import subprocess
 import tempfile
+import bz2file as bz2


mr-c · 2015-02-10T18:21:40Z

Jenkins, test this please

mr-c · 2015-02-11T11:10:20Z

Jenkins, test this please

mr-c · 2015-02-11T11:19:24Z

@b-wyss Looks like you have some PEP8 violations. make autopep8 should clear them up.

Keep on going, you've made good progress!

b-wyss · 2015-02-16T22:15:09Z

ctb · 2015-02-17T14:37:31Z

Looking at http://khmer.readthedocs.org/en/v1.3/user/scripts.html, I think:

filter-abund and filter-abund-single;
count-median;
extract-partitions;
extract-long-sequences;
extract-paired-reads;
fastq-to-fasta
interleave-reads
sample-reads-randomly
split-paired-reads

and eventually trim-low-abund when #759 is merged :)

ctb · 2015-02-17T14:39:43Z

khmer/kfile.py

+                        help='Option to output as bz2')
+
+
+def enable_output_compression(args):


I think this can be generalized a bit - right now the output file has to be in args.output, but for some of the scripts I think there will be a need for different output names/files, especially in the scripts that have multiple output files.

So maybe enable_output_compression can take a file handle, and return an fp (what gzip.GzipFile/bz2file.open give) for it?

(note: open for discussion. just a thought.)

mr-c · 2015-02-17T20:21:50Z

@ctb First pass is for single outputs ( -o or similar)

bocajnotnef · 2015-07-21T17:52:13Z

@b-wyss Mind if I vulture this?

mr-c · 2015-07-21T17:57:39Z

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton [email protected]
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?

—
Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician [email protected]
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

b-wyss · 2015-07-21T18:08:03Z

Please do! Sorry I fell off the face of the earth, things got really crazy
towards the end of the spring, centered around a bit of a family situation.
I meant to officially leave and try to make it on good terms, but it never
really happened. You guys deserved to at least know what was going on with
me, and I didn't let anyone know. Seems like the lab has transferred over
pretty successfully - congratulations!

On Tue, Jul 21, 2015 at 1:57 PM Michael R. Crusoe [email protected]
wrote:

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton [email protected]
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?

—
Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician [email protected]
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

—
Reply to this email directly or view it on GitHub
#747 (comment).

bocajnotnef · 2015-07-21T18:30:42Z

All good! Stuff happens. Hope everything is okay and you're doing well!

bocajnotnef · 2015-07-21T18:38:59Z

TO-DO:

Generalize enable compression: currently it takes args and converts args.output to a gz/bz writer. as per @ctb reccomendation, will shift to a func that takes a file handle and returns an FP. This will enable use in scripts a la normalize-by-median with multiple files.
Enable compression in normalize-by-median
- determine which outputs in normalize-by-median should be compressed
figure out what happens if we do -o - --gzip in normalize-by-median (probably nothing good)

bocajnotnef · 2015-07-21T20:58:45Z

question: should the func that adds the compression args be moved to khmer_args?

https://github.com/dib-lab/khmer/pull/747/files#diff-46d242f8c7f4ca46fd5c784bd24184c7R165

bocajnotnef · 2015-07-22T21:56:38Z

normalize-by-median's stdout 3 test is leaking output

(It's my doing, I just need to remember to fi x it)

bocajnotnef · 2015-07-23T15:54:52Z

bocajnotnef · 2015-07-23T16:37:06Z

split-paired-reads has this:

parser.add_argument('infile', nargs='?', default='/dev/stdin')

we should make than an argparse handled file open

bocajnotnef · 2015-07-23T17:47:12Z

khmer/kfile.py

+    if file_handle is sys.stdout:
+        return sys.stdout
+    else:
+        assert type(file_handle) == file, type(file_handle)


this is primarily a debugging thing and will be removed

shrug it's ok to leave it in, with a brief explanation.

bocajnotnef · 2015-07-23T17:50:09Z

retest this, please

ctb · 2015-07-23T17:58:21Z

scripts/fastq-to-fasta.py

@@ -67,7  70,7 @@ def main():
    else:
        print('No lines dropped from file.', file=sys.stderr)

-    print('Wrote output to', args.output, file=sys.stderr)
+    print('Wrote output to', str(args.output), file=sys.stderr)


I think this can be args.output.name, no?

Fails if we pass sys.stdout as output--it doesn't have a name.

That said, my str solution doesn't really help. I should check to see if it is in fact a block device or whathaveyou.

@mr-c reccomended making a function to do checking on file handles to see if they are fifos, block devices, etc. and factoring those checks out of the rest of the codebase.

@mr-c reccomended making a function to do checking on file handles to see if they are fifos, block devices, etc. and factoring those checks out of the rest of the codebase.

yarp. 1.

ctb · 2015-07-30T18:24:47Z

Good question. No for now, but file an issue.

Titus Brown, [email protected]

On Jul 30, 2015, at 11:22 AM, Jake Fenton [email protected] wrote:

question re trim-low-abund: we set aside stuff in a file, then go back to that file, pull things from it and then write to output. Currently, I don't compress the aside_file--Should I?

—
Reply to this email directly or view it on GitHub.

bocajnotnef · 2015-07-30T18:31:17Z

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch

ctb · 2015-07-30T18:32:47Z

I would suggest adding streaming tests to the new test_streaming_io.

Titus Brown, [email protected]

On Jul 30, 2015, at 11:31 AM, Jake Fenton [email protected] wrote:

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch

—
Reply to this email directly or view it on GitHub.

bocajnotnef · 2015-07-30T18:35:43Z

Agreed, but my concern is that for nearly every case where we have an output file we could have a compressed output file, and we should be testing all of those--but that's like, 50 tests.

bocajnotnef · 2015-07-30T20:26:16Z

from high-bandwidth conversation; We'll employ stupidity-driven-testing here. No point in going through the combinatorial matrix (Well, there is a point, but the cost/benefit isn't there). So if somebody finds something that's super borked we'll fix it and add a test for it but as we stand we're probably on solid footing.

bocajnotnef · 2015-07-30T21:06:10Z

On that note, @ctb Merge?

ctb · 2015-07-31T14:20:36Z

scripts/fastq-to-fasta.py

@@ -55,8  61,8 @@ def main():
                n_count  = 1
                continue

-        args.output.write('>'   name   '\n')
-        args.output.write(sequence   '\n')
+        del record['quality']


ctb · 2015-07-31T14:26:16Z

The number of changed files and the number of files mentioned in the ChangeLog entry are different;
what's up?
Is is_block really only used in one place in the codebase? Is there nowhere else it belongs?
More generally, in the 'Wrote output to" section of fastq-to-fasta - is there a need for a general way to
provide a human-readable output name for things that may-or-may-not-be-files? I would prefer that
logic like if output_is_block be placed somewhere central and reusable (kfile.py?) - perhaps a
function named 'describe_filehandle` or some such that returns a string?

ctb · 2015-07-31T14:27:12Z

Overall looks good. Double-check your diff-cover, fix issues above, ask for re-review :)

Conflicts: ChangeLog

bocajnotnef · 2015-07-31T21:24:34Z

@ctb Cleaned up, ready for merge.

Conflicts: ChangeLog

ctb · 2015-08-01T16:00:11Z

LGTM.

gzip/bzip2 output options

bocajnotnef · 2015-08-01T16:12:09Z

Thanks!

mr-c reviewed Feb 10, 2015
View reviewed changes

setup.py

@@ -15,6 15,7 @@

import shutil

import subprocess

import tempfile

import bz2file as bz2

Copy link

Contributor

mr-c Feb 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

ctb reviewed Feb 17, 2015
View reviewed changes

mr-c added low-hanging-fruit Python labels May 13, 2015

mr-c added this to the 1.4 milestone May 13, 2015

ctb modified the milestones: 2.0, 1.4 Jun 12, 2015

bocajnotnef self-assigned this Jul 21, 2015

bocajnotnef reviewed Jul 23, 2015
View reviewed changes

ctb reviewed Jul 23, 2015
View reviewed changes

bocajnotnef mentioned this pull request Jul 23, 2015

Automatic output file naming. #1195

Open

ctb mentioned this pull request Jul 31, 2015

Compression for working file in trim-low-abund? #1213

Open

ctb reviewed Jul 31, 2015
View reviewed changes

ctb mentioned this pull request Jul 31, 2015

Add --output-orphaned option to split-paired-reads.py #1164

Merged

bocajnotnef mentioned this pull request Jul 31, 2015

Bz2file as a requirement for khmer (and screed) #1217

Closed

bocajnotnef added 2 commits July 31, 2015 14:56

cleanup, minor refactor for generalizing output info logic

d23f7f8

Merge branch 'master' of github.com:ged-lab/khmer into feature/gzip505

c03462f

Conflicts: ChangeLog

ctb added 6 commits August 1, 2015 08:19

Merge branch 'master' of github.com:dib-lab/khmer into feature/gzip505

d64f5ef

Conflicts: ChangeLog

minor cleanup

89e9e3e

updated ChangeLog; preserve minimum greppability

5e9ee32

more minor cleanup

266ad84

fix heading foo

a85ed3a

updated ChangeLog

529f5e1

ctb added a commit that referenced this pull request Aug 1, 2015

Merge pull request #747 from dib-lab/feature/gzip505

0709ff8

gzip/bzip2 output options

ctb merged commit 0709ff8 into master Aug 1, 2015

ctb deleted the feature/gzip505 branch August 1, 2015 16:00

bocajnotnef mentioned this pull request Aug 3, 2015

Add options for outputting gzipped/bzip2ed sequence #505

Closed

		help='Option to output as bz2')


		def enable_output_compression(args):

gzip/bzip2 output options #747

gzip/bzip2 output options #747

Conversation

b-wyss commented Jan 30, 2015

mr-c commented Feb 10, 2015

Choose a reason for hiding this comment

mr-c commented Feb 10, 2015

mr-c commented Feb 11, 2015

mr-c commented Feb 11, 2015

b-wyss commented Feb 16, 2015

ctb commented Feb 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mr-c commented Feb 17, 2015

bocajnotnef commented Jul 21, 2015

mr-c commented Jul 21, 2015

b-wyss commented Jul 21, 2015

bocajnotnef commented Jul 21, 2015

bocajnotnef commented Jul 21, 2015

bocajnotnef commented Jul 21, 2015

bocajnotnef commented Jul 22, 2015

bocajnotnef commented Jul 23, 2015

bocajnotnef commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bocajnotnef commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb Jul 23, 2015 via email

Choose a reason for hiding this comment

ctb commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

ctb commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

Choose a reason for hiding this comment

ctb commented Jul 31, 2015

ctb commented Jul 31, 2015

bocajnotnef commented Jul 31, 2015

ctb commented Aug 1, 2015

bocajnotnef commented Aug 1, 2015