Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip/bzip2 output options #747

Merged
merged 30 commits into from
Aug 1, 2015
Merged

gzip/bzip2 output options #747

merged 30 commits into from
Aug 1, 2015

Conversation

b-wyss
Copy link
Contributor

@b-wyss b-wyss commented Jan 30, 2015

Feature creation in response to #505

@mr-c
Copy link
Contributor

mr-c commented Feb 10, 2015

Jenkins, retest this please

@@ -15,6 15,7 @@
import shutil
import subprocess
import tempfile
import bz2file as bz2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

@mr-c
Copy link
Contributor

mr-c commented Feb 10, 2015

Jenkins, test this please

1 similar comment
@mr-c
Copy link
Contributor

mr-c commented Feb 11, 2015

Jenkins, test this please

@mr-c
Copy link
Contributor

mr-c commented Feb 11, 2015

@b-wyss Looks like you have some PEP8 violations. make autopep8 should clear them up.

Keep on going, you've made good progress!

@b-wyss
Copy link
Contributor Author

b-wyss commented Feb 16, 2015

Scripts to update:

  • normalize-by-median (multiple files);
  • count-median;
  • extract-partitions;
  • extract-long-sequences;
  • extract-paired-reads;
  • fastq-to-fasta;
  • filter-abund;
  • filter-abund-single;
  • sample-reads-randomly;
  • split paired-reads;
  • interleave-reads;
  • trim-low-abund;

@ctb
Copy link
Member

ctb commented Feb 17, 2015

Looking at http://khmer.readthedocs.org/en/v1.3/user/scripts.html, I think:

filter-abund and filter-abund-single;
count-median;
extract-partitions;
extract-long-sequences;
extract-paired-reads;
fastq-to-fasta
interleave-reads
sample-reads-randomly
split-paired-reads

and eventually trim-low-abund when #759 is merged :)

help='Option to output as bz2')


def enable_output_compression(args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be generalized a bit - right now the output file has to be in args.output, but for some of the scripts I think there will be a need for different output names/files, especially in the scripts that have multiple output files.

So maybe enable_output_compression can take a file handle, and return an fp (what gzip.GzipFile/bz2file.open give) for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note: open for discussion. just a thought.)

@mr-c
Copy link
Contributor

mr-c commented Feb 17, 2015

@ctb First pass is for single outputs ( -o or similar)

@mr-c mr-c added this to the 1.4 milestone May 13, 2015
@ctb ctb modified the milestones: 2.0, 1.4 Jun 12, 2015
@bocajnotnef
Copy link
Contributor

@b-wyss Mind if I vulture this?

@mr-c
Copy link
Contributor

mr-c commented Jul 21, 2015

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton [email protected]
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?


Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician [email protected]
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe

@b-wyss
Copy link
Contributor Author

b-wyss commented Jul 21, 2015

Please do! Sorry I fell off the face of the earth, things got really crazy
towards the end of the spring, centered around a bit of a family situation.
I meant to officially leave and try to make it on good terms, but it never
really happened. You guys deserved to at least know what was going on with
me, and I didn't let anyone know. Seems like the lab has transferred over
pretty successfully - congratulations!

On Tue, Jul 21, 2015 at 1:57 PM Michael R. Crusoe [email protected]
wrote:

Go for it

On Tue, Jul 21, 2015 at 10:52 AM Jake Fenton [email protected]
wrote:

@b-wyss https://github.com/b-wyss Mind if I vulture this?


Reply to this email directly or view it on GitHub
#747 (comment).

Michael R. Crusoe: Programmer & Bioinformatician [email protected]
The lab for Data Intensive Biology; University of California, Davis
https://impactstory.org/MichaelRCrusoe http://twitter.com/biocrusoe


Reply to this email directly or view it on GitHub
#747 (comment).

@bocajnotnef
Copy link
Contributor

All good! Stuff happens. Hope everything is okay and you're doing well!

@bocajnotnef bocajnotnef self-assigned this Jul 21, 2015
@bocajnotnef
Copy link
Contributor

TO-DO:

  • Generalize enable compression: currently it takes args and converts args.output to a gz/bz writer. as per @ctb reccomendation, will shift to a func that takes a file handle and returns an FP. This will enable use in scripts a la normalize-by-median with multiple files.
  • Enable compression in normalize-by-median
    • determine which outputs in normalize-by-median should be compressed
  • figure out what happens if we do -o - --gzip in normalize-by-median (probably nothing good)

@bocajnotnef
Copy link
Contributor

question: should the func that adds the compression args be moved to khmer_args?

https://github.com/dib-lab/khmer/pull/747/files#diff-46d242f8c7f4ca46fd5c784bd24184c7R165

@bocajnotnef
Copy link
Contributor

normalize-by-median's stdout 3 test is leaking output

(It's my doing, I just need to remember to fi x it)

@bocajnotnef
Copy link
Contributor

"revert" to argparse handling opening files:

  • normalize-by-median (multiple files);
  • count-median;
  • extract-partitions;
  • extract-long-sequences;
  • extract-paired-reads;
  • fastq-to-fasta;
  • filter-abund;
  • filter-abund-single;
  • sample-reads-randomly;
  • split paired-reads;
  • interleave-reads;
  • trim-low-abund;

@bocajnotnef
Copy link
Contributor

split-paired-reads has this:

parser.add_argument('infile', nargs='?', default='/dev/stdin')

we should make than an argparse handled file open

if file_handle is sys.stdout:
return sys.stdout
else:
assert type(file_handle) == file, type(file_handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is primarily a debugging thing and will be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrug it's ok to leave it in, with a brief explanation.

@bocajnotnef
Copy link
Contributor

retest this, please

@@ -67,7 70,7 @@ def main():
else:
print('No lines dropped from file.', file=sys.stderr)

print('Wrote output to', args.output, file=sys.stderr)
print('Wrote output to', str(args.output), file=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be args.output.name, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails if we pass sys.stdout as output--it doesn't have a name.

That said, my str solution doesn't really help. I should check to see if it is in fact a block device or whathaveyou.

@mr-c reccomended making a function to do checking on file handles to see if they are fifos, block devices, etc. and factoring those checks out of the rest of the codebase.

Copy link
Member

@ctb ctb Jul 23, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctb
Copy link
Member

ctb commented Jul 30, 2015

Good question. No for now, but file an issue.

Titus Brown, [email protected]

On Jul 30, 2015, at 11:22 AM, Jake Fenton [email protected] wrote:

question re trim-low-abund: we set aside stuff in a file, then go back to that file, pull things from it and then write to output. Currently, I don't compress the aside_file--Should I?


Reply to this email directly or view it on GitHub.

@bocajnotnef
Copy link
Contributor

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch

@ctb
Copy link
Member

ctb commented Jul 30, 2015

I would suggest adding streaming tests to the new test_streaming_io.

Titus Brown, [email protected]

On Jul 30, 2015, at 11:31 AM, Jake Fenton [email protected] wrote:

Right. Everything is in place except for more extensive test coverage.

Probably gonna make a test_output_compression.py or summuch


Reply to this email directly or view it on GitHub.

@bocajnotnef
Copy link
Contributor

Agreed, but my concern is that for nearly every case where we have an output file we could have a compressed output file, and we should be testing all of those--but that's like, 50 tests.

@bocajnotnef
Copy link
Contributor

from high-bandwidth conversation; We'll employ stupidity-driven-testing here. No point in going through the combinatorial matrix (Well, there is a point, but the cost/benefit isn't there). So if somebody finds something that's super borked we'll fix it and add a test for it but as we stand we're probably on solid footing.

@bocajnotnef
Copy link
Contributor

On that note, @ctb Merge?

@@ -55,8 61,8 @@ def main():
n_count = 1
continue

args.output.write('>' name '\n')
args.output.write(sequence '\n')
del record['quality']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, ok ;)

@ctb
Copy link
Member

ctb commented Jul 31, 2015

  • The number of changed files and the number of files mentioned in the ChangeLog entry are different;
    what's up?
  • Is is_block really only used in one place in the codebase? Is there nowhere else it belongs?
  • More generally, in the 'Wrote output to" section of fastq-to-fasta - is there a need for a general way to
    provide a human-readable output name for things that may-or-may-not-be-files? I would prefer that
    logic like if output_is_block be placed somewhere central and reusable (kfile.py?) - perhaps a
    function named 'describe_filehandle` or some such that returns a string?

@ctb
Copy link
Member

ctb commented Jul 31, 2015

Overall looks good. Double-check your diff-cover, fix issues above, ask for re-review :)

@bocajnotnef
Copy link
Contributor

@ctb Cleaned up, ready for merge.

@ctb
Copy link
Member

ctb commented Aug 1, 2015

LGTM.

ctb added a commit that referenced this pull request Aug 1, 2015
@ctb ctb merged commit 0709ff8 into master Aug 1, 2015
@ctb ctb deleted the feature/gzip505 branch August 1, 2015 16:00
@bocajnotnef
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants