-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize-by-median refactor: Stage 0 #1010
Conversation
@@ -170,7 181,7 @@ def normalize_by_median_and_check(input_filename, htable, single_output_file, | |||
.format(inp=input_filename, kept=total - discarded, | |||
total=total, perc=int(100. - discarded / | |||
float(total) * 100.)) | |||
print >> sys.stderr, 'output in', output_name | |||
print >> sys.stderr, 'output in', outfp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be reverted
@camillescott @ctb quick once-over to make sure I'm on the right track? My concern is that so far this is just moving code around. |
On Mon, May 18, 2015 at 08:26:31AM -0700, Jake Fenton wrote:
didn't look, but please proceed with confidence. "just moving code around" is fine. |
Mkay. Proceeding with prejudice. |
@luizirber @camillescott @ctb CR, please? |
Looks good on a quick skim. Please only ask for a CR after the Jenkins tests pass, though. |
p.s. "just moving code around without breaking anything" is refactoring done properly. This looks like it's already a big improvement! |
Huh. I only changed the changelog, since the last commit--Jenkins shouldn't have broken. Looks like code coverage dropped; denominator issue. I'll go write a test. (Batchwise stuff isn't tested at all, it seems.) |
On Fri, May 22, 2015 at 08:06:16AM -0700, Jake Fenton wrote:
It was a general point, not a specific one :) |
…/stage0 Conflicts: ChangeLog
@@ -126,6 128,22 @@ def handle_error(error, output_name, input_name, fail_save, htable): | |||
print >> sys.stderr, '** ERROR: problem removing corrupt filtered file' | |||
|
|||
|
|||
@contextmanager | |||
def FailSafe(ifile, ofile, save_on_fail, ht, corrupted, total, dicarded, jedi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice pun with jedi/force, but it is harder to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to fit in 80 chars. I can change it, but I'll have to throw in a line break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, because there is a typo in 'discarded' too
@ctb Updated; CR? |
Not gonna get to it today. Keep on working :)
|
print '...saving to', hashname | ||
else: | ||
hashname = 'backup.ct' | ||
print 'Nothing given for savetable, saving to', hashname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be to stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've refactored much of the obvious. |
total = 0 | ||
discarded = 0 | ||
for index, batch in enumerate(batchwise(screed.open( | ||
input_filename, parse_description=False), batch_size)): | ||
if index > 0 and index % 100000 == 0: | ||
print >>sys.stderr, '... kept {kept} of {total} or'\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this 'print' could usefully be encapsulated in a function; just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking about print
, it would be nice to move to the print function syntax:
- add
from __future__ import print_function
to the top of the script - move
sys.stderr
to thefile
parameter:
print('... kept {kept} of {total} or {perc:2}%'
.format(kept=total - discarded,
total=total,
perc=int(100. - discarded / float(total) * 100.)),
file=sys.stderr)
This will help for the Python 3 compatibility.
retest this please |
1 similar comment
retest this please |
@ctb @luizirber Updated; CR when possible, please. |
desired_coverage = cutoff | ||
ksize = htable.ksize() | ||
index = 0 | ||
# global total, discarded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. I'll eliminate this in the next commit. (Assuming there's more stuff.)
I have several minor comments, but my major comment is really: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check? |
Hrm. That does make the most sense--and is in keeping with the spirit of refactoring. I never really understood the difference between normalize_..._and_check and regular normalize anywho. I'll distill down the functionality over the weekend--gonna be busy from this afternoon through tomorrow. The accounting for this will be so much fun. |
@ctb Updated |
Looking much neater, but my major comment remains: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check? |
I did that. Which reminds me, there's some stuff I can move. |
htable.save(hashname) | ||
f, htable, args.single_output_file, | ||
args.fail_save, args.paired, args.force, norm, report_fp) | ||
corrupt_files = corrupt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just pull from norm.corrupt_files
instead.
I'm kind of allergic to the use of 'norm' in WithDiagnostics, but will accept it as a short-term evil :). (If you didn't have 'norm' in there, this could be a more generally useful class; as it is, it's only usable in this script.) I also think the code will become much cleaner with broken_paired_iterator. So I'm -0 on merging this right now but will do so if you fix #1000 in this PR, and then promise to move quickly on the next stage of the refactor! |
Conflicts: ChangeLog
also added correct seq substituting
retest this please |
@ctb I agree re norm in WithDiagnostics--I didn't really know of an easier way to get the information into WithDiagnostics that we needed (that being the rolling total/discarded counts). I've implemented @drtamermansour's fix for the PE countings in such a way that we maintain the current behavior AFAICT. I've got to get rid of Next stage of the refactor would be implementing broken_paired_reader, which looks fun. I'll get to it as soon as this is CR'd/merged. |
OK, looks good to me. Let's get on that 2nd round cleanup tho :) |
Normalize-by-median refactor: Stage 0
Initial implementation of context managers of move some boilerplate out of the way in normalize-by-median; addresses parts of #1006.
Much more to come....