Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize-by-median refactor: Stage 0 #1010

Merged
merged 22 commits into from
Jun 1, 2015
Merged

Normalize-by-median refactor: Stage 0 #1010

merged 22 commits into from
Jun 1, 2015

Conversation

bocajnotnef
Copy link
Contributor

Initial implementation of context managers of move some boilerplate out of the way in normalize-by-median; addresses parts of #1006.

Much more to come....

@@ -170,7 181,7 @@ def normalize_by_median_and_check(input_filename, htable, single_output_file,
.format(inp=input_filename, kept=total - discarded,
total=total, perc=int(100. - discarded /
float(total) * 100.))
print >> sys.stderr, 'output in', output_name
print >> sys.stderr, 'output in', outfp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be reverted

@bocajnotnef
Copy link
Contributor Author

@camillescott @ctb quick once-over to make sure I'm on the right track?

My concern is that so far this is just moving code around.

@ctb
Copy link
Member

ctb commented May 18, 2015

On Mon, May 18, 2015 at 08:26:31AM -0700, Jake Fenton wrote:

@camillescott @ctb quick once-over to make sure I'm on the right track?

My concern is that so far this is just moving code around.

didn't look, but please proceed with confidence. "just moving code around" is fine.

@bocajnotnef
Copy link
Contributor Author

Mkay. Proceeding with prejudice.

@bocajnotnef
Copy link
Contributor Author

@luizirber @camillescott @ctb CR, please?

@ctb
Copy link
Member

ctb commented May 22, 2015

Looks good on a quick skim. Please only ask for a CR after the Jenkins tests pass, though.

@ctb
Copy link
Member

ctb commented May 22, 2015

p.s. "just moving code around without breaking anything" is refactoring done properly. This looks like it's already a big improvement!

@bocajnotnef
Copy link
Contributor Author

Huh. I only changed the changelog, since the last commit--Jenkins shouldn't have broken.

Looks like code coverage dropped; denominator issue. I'll go write a test. (Batchwise stuff isn't tested at all, it seems.)

@ctb
Copy link
Member

ctb commented May 22, 2015

On Fri, May 22, 2015 at 08:06:16AM -0700, Jake Fenton wrote:

Huh. I only changed the changelog, since the last commit--Jenkins shouldn't have broken.

It was a general point, not a specific one :)

@@ -126,6 128,22 @@ def handle_error(error, output_name, input_name, fail_save, htable):
print >> sys.stderr, '** ERROR: problem removing corrupt filtered file'


@contextmanager
def FailSafe(ifile, ofile, save_on_fail, ht, corrupted, total, dicarded, jedi):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice pun with jedi/force, but it is harder to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to fit in 80 chars. I can change it, but I'll have to throw in a line break.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, because there is a typo in 'discarded' too

@bocajnotnef
Copy link
Contributor Author

@ctb Updated; CR?

@ctb
Copy link
Member

ctb commented May 22, 2015 via email

print '...saving to', hashname
else:
hashname = 'backup.ct'
print 'Nothing given for savetable, saving to', hashname
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be to stderr?

Copy link
Member

@ctb ctb May 22, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bocajnotnef
Copy link
Contributor Author

I think I've refactored much of the obvious.

total = 0
discarded = 0
for index, batch in enumerate(batchwise(screed.open(
input_filename, parse_description=False), batch_size)):
if index > 0 and index % 100000 == 0:
print >>sys.stderr, '... kept {kept} of {total} or'\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 'print' could usefully be encapsulated in a function; just a thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking about print, it would be nice to move to the print function syntax:

  • add from __future__ import print_function to the top of the script
  • move sys.stderr to the file parameter:
print('... kept {kept} of {total} or {perc:2}%'
      .format(kept=total - discarded,
              total=total,
              perc=int(100. - discarded / float(total) * 100.)),
      file=sys.stderr)

This will help for the Python 3 compatibility.

@bocajnotnef
Copy link
Contributor Author

retest this please

1 similar comment
@bocajnotnef
Copy link
Contributor Author

retest this please

@bocajnotnef
Copy link
Contributor Author

@ctb @luizirber Updated; CR when possible, please.

desired_coverage = cutoff
ksize = htable.ksize()
index = 0
# global total, discarded
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I'll eliminate this in the next commit. (Assuming there's more stuff.)

@ctb
Copy link
Member

ctb commented May 29, 2015

I have several minor comments, but my major comment is really: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check?

@bocajnotnef
Copy link
Contributor Author

Hrm. That does make the most sense--and is in keeping with the spirit of refactoring.

I never really understood the difference between normalize_..._and_check and regular normalize anywho.

I'll distill down the functionality over the weekend--gonna be busy from this afternoon through tomorrow.

The accounting for this will be so much fun.

@bocajnotnef
Copy link
Contributor Author

@ctb Updated

@ctb
Copy link
Member

ctb commented Jun 1, 2015

Looking much neater, but my major comment remains: why not replace normalize_by_median(...) with a call to the Normalizer object, which would be created in main() and passed into normalize_by_median_and_check?

@bocajnotnef
Copy link
Contributor Author

I did that. normalize_by_median(...) no longer exists and all the processing that was done there has been moved to normalize_by_median_and_check(...), which gets a Normalizer object from main.

Which reminds me, there's some stuff I can move.

htable.save(hashname)
f, htable, args.single_output_file,
args.fail_save, args.paired, args.force, norm, report_fp)
corrupt_files = corrupt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just pull from norm.corrupt_files instead.

@ctb
Copy link
Member

ctb commented Jun 1, 2015

I'm kind of allergic to the use of 'norm' in WithDiagnostics, but will accept it as a short-term evil :). (If you didn't have 'norm' in there, this could be a more generally useful class; as it is, it's only usable in this script.) I also think the code will become much cleaner with broken_paired_iterator.

So I'm -0 on merging this right now but will do so if you fix #1000 in this PR, and then promise to move quickly on the next stage of the refactor!

drtamermansour and others added 3 commits June 1, 2015 13:39
@bocajnotnef
Copy link
Contributor Author

retest this please

@bocajnotnef
Copy link
Contributor Author

@ctb I agree re norm in WithDiagnostics--I didn't really know of an easier way to get the information into WithDiagnostics that we needed (that being the rolling total/discarded counts).

I've implemented @drtamermansour's fix for the PE countings in such a way that we maintain the current behavior AFAICT.

I've got to get rid of tests/test-data/paired_withN.fa as I wiped out that test (or perhaps adapt it to check for the current behavior)

Next stage of the refactor would be implementing broken_paired_reader, which looks fun. I'll get to it as soon as this is CR'd/merged.

@ctb
Copy link
Member

ctb commented Jun 1, 2015

OK, looks good to me. Let's get on that 2nd round cleanup tho :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants