Normalize-by-median refactor: Stage 0 #1010

bocajnotnef · 2015-05-18T15:24:57Z

Initial implementation of context managers of move some boilerplate out of the way in normalize-by-median; addresses parts of #1006.

Much more to come....

bocajnotnef · 2015-05-18T15:25:54Z

scripts/normalize-by-median.py

@@ -170,7  181,7 @@ def normalize_by_median_and_check(input_filename, htable, single_output_file,
                .format(inp=input_filename, kept=total - discarded,
                        total=total, perc=int(100. - discarded /
                                              float(total) * 100.))
-            print >> sys.stderr, 'output in', output_name
+            print >> sys.stderr, 'output in', outfp


this will be reverted

bocajnotnef · 2015-05-18T15:26:31Z

@camillescott @ctb quick once-over to make sure I'm on the right track?

My concern is that so far this is just moving code around.

ctb · 2015-05-18T15:28:11Z

On Mon, May 18, 2015 at 08:26:31AM -0700, Jake Fenton wrote:

@camillescott @ctb quick once-over to make sure I'm on the right track?

My concern is that so far this is just moving code around.

didn't look, but please proceed with confidence. "just moving code around" is fine.

bocajnotnef · 2015-05-18T15:29:03Z

Mkay. Proceeding with prejudice.

bocajnotnef · 2015-05-22T14:59:57Z

@luizirber @camillescott @ctb CR, please?

ctb · 2015-05-22T15:02:53Z

Looks good on a quick skim. Please only ask for a CR after the Jenkins tests pass, though.

ctb · 2015-05-22T15:03:24Z

p.s. "just moving code around without breaking anything" is refactoring done properly. This looks like it's already a big improvement!

bocajnotnef · 2015-05-22T15:06:16Z

Huh. I only changed the changelog, since the last commit--Jenkins shouldn't have broken.

Looks like code coverage dropped; denominator issue. I'll go write a test. (Batchwise stuff isn't tested at all, it seems.)

ctb · 2015-05-22T15:07:53Z

On Fri, May 22, 2015 at 08:06:16AM -0700, Jake Fenton wrote:

Huh. I only changed the changelog, since the last commit--Jenkins shouldn't have broken.

It was a general point, not a specific one :)

…/stage0 Conflicts: ChangeLog

luizirber · 2015-05-22T15:44:18Z

scripts/normalize-by-median.py

@@ -126,6  128,22 @@ def handle_error(error, output_name, input_name, fail_save, htable):
        print >> sys.stderr, '** ERROR: problem removing corrupt filtered file'


+@contextmanager
+def FailSafe(ifile, ofile, save_on_fail, ht, corrupted, total, dicarded, jedi):


nice pun with jedi/force, but it is harder to read

It was to fit in 80 chars. I can change it, but I'll have to throw in a line break.

That's fine, because there is a typo in 'discarded' too

bocajnotnef · 2015-05-22T15:54:55Z

@ctb Updated; CR?

ctb · 2015-05-22T15:57:50Z

Not gonna get to it today. Keep on working :)

bocajnotnef · 2015-05-22T16:59:16Z

scripts/normalize-by-median.py

+                print '...saving to', hashname
+            else:
+                hashname = 'backup.ct'
+                print 'Nothing given for savetable, saving to', hashname


Should this be to stderr?

bocajnotnef · 2015-05-22T17:32:38Z

I think I've refactored much of the obvious.

ctb · 2015-05-26T14:33:40Z

scripts/normalize-by-median.py

-    total = 0
-    discarded = 0
-    for index, batch in enumerate(batchwise(screed.open(
-            input_filename, parse_description=False), batch_size)):
        if index > 0 and index % 100000 == 0:
            print >>sys.stderr, '... kept {kept} of {total} or'\


I think this 'print' could usefully be encapsulated in a function; just a thought.

Talking about print, it would be nice to move to the print function syntax:

add from __future__ import print_function to the top of the script

move sys.stderr to the file parameter:

print('... kept {kept} of {total} or {perc:2}%' .format(kept=total - discarded, total=total, perc=int(100. - discarded / float(total) * 100.)), file=sys.stderr)

This will help for the Python 3 compatibility.

bocajnotnef · 2015-05-29T15:41:36Z

retest this please

bocajnotnef · 2015-05-29T15:55:50Z

retest this please

bocajnotnef · 2015-05-29T15:59:43Z

@ctb @luizirber Updated; CR when possible, please.

bocajnotnef · 2015-05-29T16:00:34Z

scripts/normalize-by-median.py

-    desired_coverage = cutoff
-    ksize = htable.ksize()
+    index = 0
+    # global total, discarded


Whoops. I'll eliminate this in the next commit. (Assuming there's more stuff.)

ctb · 2015-05-29T16:45:35Z

I have several minor comments, but my major comment is really: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check?

bocajnotnef · 2015-05-29T18:59:36Z

Hrm. That does make the most sense--and is in keeping with the spirit of refactoring.

I never really understood the difference between normalize_..._and_check and regular normalize anywho.

I'll distill down the functionality over the weekend--gonna be busy from this afternoon through tomorrow.

The accounting for this will be so much fun.

…/stage0 Conflicts: ChangeLog

bocajnotnef · 2015-06-01T15:21:14Z

@ctb Updated

ctb · 2015-06-01T15:26:51Z

Looking much neater, but my major comment remains: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check?

bocajnotnef · 2015-06-01T15:33:36Z

I did that. normalize_by_median(...) no longer exists and all the processing that was done there has been moved to normalize_by_median_and_check(...), which gets a Normalizer object from main.

Which reminds me, there's some stuff I can move.

bocajnotnef · 2015-06-01T15:34:03Z

scripts/normalize-by-median.py

-            htable.save(hashname)
+                f, htable, args.single_output_file,
+                args.fail_save, args.paired, args.force, norm, report_fp)
+        corrupt_files  = corrupt


could just pull from norm.corrupt_files instead.

ctb · 2015-06-01T16:27:13Z

I'm kind of allergic to the use of 'norm' in WithDiagnostics, but will accept it as a short-term evil :). (If you didn't have 'norm' in there, this could be a more generally useful class; as it is, it's only usable in this script.) I also think the code will become much cleaner with broken_paired_iterator.

So I'm -0 on merging this right now but will do so if you fix #1000 in this PR, and then promise to move quickly on the next stage of the refactor!

Conflicts: ChangeLog

also added correct seq substituting

bocajnotnef · 2015-06-01T18:32:36Z

retest this please

bocajnotnef · 2015-06-01T18:39:16Z

@ctb I agree re norm in WithDiagnostics--I didn't really know of an easier way to get the information into WithDiagnostics that we needed (that being the rolling total/discarded counts).

I've implemented @drtamermansour's fix for the PE countings in such a way that we maintain the current behavior AFAICT.

I've got to get rid of tests/test-data/paired_withN.fa as I wiped out that test (or perhaps adapt it to check for the current behavior)

Next stage of the refactor would be implementing broken_paired_reader, which looks fun. I'll get to it as soon as this is CR'd/merged.

ctb · 2015-06-01T19:01:41Z

OK, looks good to me. Let's get on that 2nd round cleanup tho :)

Normalize-by-median refactor: Stage 0

initial CM implementation

1fd4a7e

bocajnotnef reviewed May 18, 2015
View reviewed changes

bocajnotnef added 2 commits May 18, 2015 13:59

nobody here but us generators!

a2c3b22

Changelog

bdad78e

bocajnotnef added 3 commits May 22, 2015 11:34

fixed norm's -R arg, added test

dc290ee

updated changelog

84a88b4

Merge branch 'master' of github.com:ged-lab/khmer into rfac/normbymed…

9c80074

…/stage0 Conflicts: ChangeLog

luizirber reviewed May 22, 2015
View reviewed changes

bringing balance to the force....

77527c5

bocajnotnef added 2 commits May 22, 2015 12:22

revenge of the iterators!

e1986ec

pep8

1ce050d

bocajnotnef reviewed May 22, 2015
View reviewed changes

bocajnotnef added 2 commits May 22, 2015 13:10

the generators strike back!

36d26cb

pointed diagnostics to stderr

6f3e546

ctb reviewed May 26, 2015
View reviewed changes

bocajnotnef reviewed May 29, 2015
View reviewed changes

bocajnotnef added 2 commits June 1, 2015 11:16

we live in a class-ist society--long live the proletariat!

7216d8e

Merge branch 'master' of github.com:ged-lab/khmer into rfac/normbymed…

ce33510

…/stage0 Conflicts: ChangeLog

bocajnotnef reviewed Jun 1, 2015
View reviewed changes

drtamermansour and others added 3 commits June 1, 2015 13:39

add tests for the 2 new behaviors

658e9b1

Conflicts: ChangeLog

added tests for correct paired-end behaviour

a8fe9b9

also added correct seq substituting

pep8

43178a2

fixed ChangeLog

a0b403a

ctb added a commit that referenced this pull request Jun 1, 2015

Merge pull request #1010 from dib-lab/rfac/normbymed/stage0

864d87b

Normalize-by-median refactor: Stage 0

ctb merged commit 864d87b into master Jun 1, 2015

bocajnotnef mentioned this pull request Jun 1, 2015

normalize-by-median skip consuming kmers of PE reads #1000

Closed

ctb deleted the rfac/normbymed/stage0 branch June 5, 2015 01:55

This was referenced Jun 11, 2015

Refactor normalize-by-median with extreme prejudice #1006

Closed

Properly handle singleton reads in normalize-by-median. #988

Closed

update the diginorm documentation #1081

Merged

ctb mentioned this pull request Jul 1, 2015

Clean up & simplify trim-low-abund by using generators #1138

Merged

This was referenced Jul 20, 2015

make diginorm reporting format be cumulative #504

Closed

Refactor diginorm reporting #1182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize-by-median refactor: Stage 0 #1010

Normalize-by-median refactor: Stage 0 #1010

bocajnotnef commented May 18, 2015

bocajnotnef May 18, 2015

bocajnotnef commented May 18, 2015

ctb commented May 18, 2015

bocajnotnef commented May 18, 2015

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015

ctb commented May 22, 2015

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015

luizirber May 22, 2015

bocajnotnef May 22, 2015

luizirber May 22, 2015

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015 via email

bocajnotnef May 22, 2015

ctb May 22, 2015 via email

bocajnotnef commented May 22, 2015

ctb May 26, 2015

luizirber May 26, 2015

bocajnotnef commented May 29, 2015

bocajnotnef commented May 29, 2015

bocajnotnef commented May 29, 2015

bocajnotnef May 29, 2015

ctb commented May 29, 2015

bocajnotnef commented May 29, 2015

bocajnotnef commented Jun 1, 2015

ctb commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

bocajnotnef Jun 1, 2015

ctb commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

ctb commented Jun 1, 2015

Normalize-by-median refactor: Stage 0 #1010

Normalize-by-median refactor: Stage 0 #1010

Conversation

bocajnotnef commented May 18, 2015

Choose a reason for hiding this comment

bocajnotnef commented May 18, 2015

ctb commented May 18, 2015

bocajnotnef commented May 18, 2015

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015

ctb commented May 22, 2015

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bocajnotnef commented May 22, 2015

ctb commented May 22, 2015 via email

Choose a reason for hiding this comment

ctb May 22, 2015 via email

Choose a reason for hiding this comment

bocajnotnef commented May 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bocajnotnef commented May 29, 2015

bocajnotnef commented May 29, 2015

bocajnotnef commented May 29, 2015

Choose a reason for hiding this comment

ctb commented May 29, 2015

bocajnotnef commented May 29, 2015

bocajnotnef commented Jun 1, 2015

ctb commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

Choose a reason for hiding this comment

ctb commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

bocajnotnef commented Jun 1, 2015

ctb commented Jun 1, 2015