Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing --max-mem and auto'd hashtable agrs #1117

Closed
bocajnotnef opened this issue Jun 23, 2015 · 24 comments
Closed

Implementing --max-mem and auto'd hashtable agrs #1117

bocajnotnef opened this issue Jun 23, 2015 · 24 comments

Comments

@bocajnotnef
Copy link
Contributor

Related to: #1050, #390, #1106, #732

Currently: @ctb is implementing --max-mem in #1050 where, given a memory ceiling, automagically calculates graph sizes
#1106 also has methods for automagically calculating graph sizes given number of unique kmers, memory ceiling and/or desired FP rate.

So, my question: is it worth including the methods added in #1106 more widely if #1050 is going to address some/most of the usability?

If so, I move that the argparser stuff currently here https://github.com/dib-lab/khmer/pull/1106/files#diff-c9f39ec665e36b5911296d1f4028270cR61 gets moved into the oxli module so it can be added to scripts easily (a la threading args). Some additional processing stuff will need to be added to the scripts that implement this but it should be along the lines of "if memory options then use returned values"

Sticking points:

making sure #1106 doesn't overlap with #1050 in terms of function--#1050 has --max-mem already, so #1106 might be limited to tablesize estimations and FP-rate based estimations.

implementing the size calculations via argparser without duplicating code--Should be fairly easy to just use the added oxli functions but we all know how "fairly easy" stuff tends to go when I get involved.

@ctb @camillescott @luizirber @mr-c @macmanes @kdmurray91 @drtamermansour Thoughts?

@ctb
Copy link
Member

ctb commented Jun 23, 2015

On Tue, Jun 23, 2015 at 03:53:21PM -0700, Jake Fenton wrote:

Related to: #1050, #390, #1106, #732
#1106 also has methods for automagically calculating graph sizes given number of unique kmers, memory ceiling and/or desired FP rate.

So, my question: is it worth including the methods added in #1106 more widely if #1050 is going to address some/most of the usability?

Question: from reading through #1106 it looks to me like all of your methods
require some information (desired FP rate, or number of k-mers) along with
desired memory usage. Is that correct?

@bocajnotnef
Copy link
Contributor Author

Yes. Specifically, number of kmers paired with either desired FP or desired memory.

@ctb
Copy link
Member

ctb commented Jun 23, 2015

On Tue, Jun 23, 2015 at 04:00:45PM -0700, Jake Fenton wrote:

Yes. Specifically, number of kmers paired with either desired FP or desired memory.

So in that case I see no conflict in the feature sets. I think #1050 will
be (is already) a sound platform on which to build in your additional
functionality. No?

@ctb
Copy link
Member

ctb commented Jun 23, 2015

And, more explicitly, #1050 is much more about refactoring and cleanup than it is about new functionality, while #1106 is adding some useful new features. So again, no conflict.

@bocajnotnef
Copy link
Contributor Author

Ah, fair point. From my reading of #1050 it looked like --max-mem took care of some of what #1106 was made to accomplish

@ctb
Copy link
Member

ctb commented Jun 23, 2015

On Tue, Jun 23, 2015 at 04:04:21PM -0700, Jake Fenton wrote:

Ah, fair point. From my reading of #1050 it looked like --max-mem took care of some of what #1106 was made to accomplish

Do you still feel that way? :)

@bocajnotnef
Copy link
Contributor Author

#1106 is adding some useful new features.

Not any more : D

@ctb
Copy link
Member

ctb commented Jun 24, 2015

Great - can you outline a new plan of attack?

@bocajnotnef
Copy link
Contributor Author

Right, so, new plan: implement new args (perhaps using a subcommand structure?) that take in no. of kmer and (fp rate or max mem), perhaps pulling from --max-mem if it's set. Using the algorithms that're being added to oxli in #1106 we'll then automagically get the rest of the parameters.

What would the targets for this be? Since we're making args for graph settings I would assume this would affect diginorm and load-graph

@ctb
Copy link
Member

ctb commented Jun 28, 2015

I would start by taking into account desired FP rates, and allow for specification of N alone (at which point max memory usage is automatically calculated from the default FP rate).

For diginorm, fp of 0.1 is acceptable. For load-graph, I would go with .01.

@bocajnotnef
Copy link
Contributor Author

I assume I should allow for overriding of FP rates, but use a default value and not require that one be set?

@bocajnotnef
Copy link
Contributor Author

Also, from reading the relevant issues and comments (specifically #732 (comment)) I'm taking this to mean we'll take in a specified N and M to set a good hash size, but given an unset M we default to estimating based off of what we think a good FP rate is--Yes?

@ctb
Copy link
Member

ctb commented Jun 29, 2015

All sounds good.

Titus Brown, [email protected]

On Jun 28, 2015, at 8:04 PM, Jake Fenton [email protected] wrote:

Also, from reading the relevant issues and comments (specifically #732 (comment)) I'm taking this to mean we'll take in a specified N and M to set a good hash size, but given an unset M we default to estimating based off of what we think a good FP rate is--Yes?


Reply to this email directly or view it on GitHub.

@bocajnotnef
Copy link
Contributor Author

@ctb Alright, I'm back to feeling like I'm duplicating work. I skimmed through the docs for #1050 (specifically https://github.com/dib-lab/khmer/pull/1050/files#diff-f774d7b19ba2beb15a8f5a125c400abeR8) and then glanced through your additions to khmerags and it seems a lot like the functionality I've proposed is essentially already in place, albiet without FP rate handling.

If anything the stuff I'd be adding would be somewhat less useful as it requires knowing the number of unique khmers up front in order to do any of the calculations.

It seems like there's something I'm missing, here.

@ctb
Copy link
Member

ctb commented Jun 29, 2015

Know N, want M set automatically to minimum.

Know M, fix M, want it to fail when fp is going to be too high.

Both are valid use cases that require knowing N. All good.

On Jun 28, 2015, at 8:57 PM, Jake Fenton [email protected] wrote:

@ctb Alright, I'm back to feeling like I'm duplicating work. I skimmed through the docs for #1050 (specifically https://github.com/dib-lab/khmer/pull/1050/files#diff-f774d7b19ba2beb15a8f5a125c400abeR8) and then glanced through your additions to khmerags and it seems a lot like the functionality I've proposed is essentially already in place, albiet without FP rate handling.

If anything the stuff I'd be adding would be somewhat less useful as it requires knowing the number of unique khmers up front in order to do any of the calculations.


Reply to this email directly or view it on GitHub.

@bocajnotnef
Copy link
Contributor Author

Okay. Makes sense, understanding acquired. Thanks.

@bocajnotnef
Copy link
Contributor Author

Should I go ahead and add this into all scripts that create a hashtable? And if so, what FP should I use for them?

@ctb
Copy link
Member

ctb commented Jul 29, 2015

On Wed, Jul 29, 2015 at 02:47:40PM -0700, Jake Fenton wrote:

Should I go ahead and add this into all scripts that create a hashtable? And if so, what FP should I use for them?

For now, go with a default FP rate of 0.15 at max, but provide a way to
override it in the function call.

Please change only 1-2 scripts, then let's take a look before doing any more...

@bocajnotnef
Copy link
Contributor Author

Currently, this is implemented in build/load-graph and normalize, so we have 1-2.

@ctb
Copy link
Member

ctb commented Jul 29, 2015

Where? In what pull request?

@bocajnotnef
Copy link
Contributor Author

#1126

@ctb
Copy link
Member

ctb commented Jul 30, 2015

OK, well, let's start a new pull request :). Here are a few requested changes:

  • change do_sanity_checking name to check_fp_rate
  • should this function only be called if -U is set? For now, I think so...
  • change oxutils shortname to oxfuncs, since it's `oxli.functions' you're importing.
  • get rid of "optimization" language in comments - it's automatic argument picking, not optimization :)

(The goal of all of this is to enable script readability.)

Is there a reason that this code isn't called generically in create_countgraph/other khmer_args functions? That seems like a better place for it than adding it into each script individually, but I'm probably missing something.

@bocajnotnef
Copy link
Contributor Author

The desired_fp rate varies from script to script, which is why we currently call it on a script-by-script basis--that said, it shouldn't be too hard to pass that along to create_countgraph.

@mr-c mr-c added this to the unscheduled milestone Jul 30, 2015
@ctb
Copy link
Member

ctb commented Aug 7, 2015

Fixed in #1214.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants