Implementing --max-mem and auto'd hashtable agrs #1117

bocajnotnef · 2015-06-23T22:53:20Z

Currently: @ctb is implementing --max-mem in #1050 where, given a memory ceiling, automagically calculates graph sizes
#1106 also has methods for automagically calculating graph sizes given number of unique kmers, memory ceiling and/or desired FP rate.

So, my question: is it worth including the methods added in #1106 more widely if #1050 is going to address some/most of the usability?

If so, I move that the argparser stuff currently here https://github.com/dib-lab/khmer/pull/1106/files#diff-c9f39ec665e36b5911296d1f4028270cR61 gets moved into the oxli module so it can be added to scripts easily (a la threading args). Some additional processing stuff will need to be added to the scripts that implement this but it should be along the lines of "if memory options then use returned values"

Sticking points:

making sure #1106 doesn't overlap with #1050 in terms of function--#1050 has --max-mem already, so #1106 might be limited to tablesize estimations and FP-rate based estimations.

implementing the size calculations via argparser without duplicating code--Should be fairly easy to just use the added oxli functions but we all know how "fairly easy" stuff tends to go when I get involved.

@ctb @camillescott @luizirber @mr-c @macmanes @kdmurray91 @drtamermansour Thoughts?

ctb · 2015-06-23T22:59:32Z

On Tue, Jun 23, 2015 at 03:53:21PM -0700, Jake Fenton wrote:

Related to: #1050, #390, #1106, #732
#1106 also has methods for automagically calculating graph sizes given number of unique kmers, memory ceiling and/or desired FP rate.

So, my question: is it worth including the methods added in #1106 more widely if #1050 is going to address some/most of the usability?

Question: from reading through #1106 it looks to me like all of your methods
require some information (desired FP rate, or number of k-mers) along with
desired memory usage. Is that correct?

bocajnotnef · 2015-06-23T23:00:44Z

Yes. Specifically, number of kmers paired with either desired FP or desired memory.

ctb · 2015-06-23T23:02:34Z

On Tue, Jun 23, 2015 at 04:00:45PM -0700, Jake Fenton wrote:

Yes. Specifically, number of kmers paired with either desired FP or desired memory.

So in that case I see no conflict in the feature sets. I think #1050 will
be (is already) a sound platform on which to build in your additional
functionality. No?

ctb · 2015-06-23T23:03:28Z

And, more explicitly, #1050 is much more about refactoring and cleanup than it is about new functionality, while #1106 is adding some useful new features. So again, no conflict.

bocajnotnef · 2015-06-23T23:04:20Z

Ah, fair point. From my reading of #1050 it looked like --max-mem took care of some of what #1106 was made to accomplish

ctb · 2015-06-23T23:05:08Z

On Tue, Jun 23, 2015 at 04:04:21PM -0700, Jake Fenton wrote:

Ah, fair point. From my reading of #1050 it looked like --max-mem took care of some of what #1106 was made to accomplish

Do you still feel that way? :)

bocajnotnef · 2015-06-23T23:06:19Z

#1106 is adding some useful new features.

Not any more : D

ctb · 2015-06-24T14:47:02Z

Great - can you outline a new plan of attack?

bocajnotnef · 2015-06-24T18:12:20Z

Right, so, new plan: implement new args (perhaps using a subcommand structure?) that take in no. of kmer and (fp rate or max mem), perhaps pulling from --max-mem if it's set. Using the algorithms that're being added to oxli in #1106 we'll then automagically get the rest of the parameters.

What would the targets for this be? Since we're making args for graph settings I would assume this would affect diginorm and load-graph

ctb · 2015-06-28T17:37:43Z

I would start by taking into account desired FP rates, and allow for specification of N alone (at which point max memory usage is automatically calculated from the default FP rate).

For diginorm, fp of 0.1 is acceptable. For load-graph, I would go with .01.

bocajnotnef · 2015-06-29T03:01:46Z

I assume I should allow for overriding of FP rates, but use a default value and not require that one be set?

bocajnotnef · 2015-06-29T03:04:17Z

Also, from reading the relevant issues and comments (specifically #732 (comment)) I'm taking this to mean we'll take in a specified N and M to set a good hash size, but given an unset M we default to estimating based off of what we think a good FP rate is--Yes?

ctb · 2015-06-29T03:19:14Z

All sounds good.

Titus Brown, [email protected]

On Jun 28, 2015, at 8:04 PM, Jake Fenton [email protected] wrote:

Also, from reading the relevant issues and comments (specifically #732 (comment)) I'm taking this to mean we'll take in a specified N and M to set a good hash size, but given an unset M we default to estimating based off of what we think a good FP rate is--Yes?

—
Reply to this email directly or view it on GitHub.

bocajnotnef · 2015-06-29T03:57:52Z

@ctb Alright, I'm back to feeling like I'm duplicating work. I skimmed through the docs for #1050 (specifically https://github.com/dib-lab/khmer/pull/1050/files#diff-f774d7b19ba2beb15a8f5a125c400abeR8) and then glanced through your additions to khmerags and it seems a lot like the functionality I've proposed is essentially already in place, albiet without FP rate handling.

If anything the stuff I'd be adding would be somewhat less useful as it requires knowing the number of unique khmers up front in order to do any of the calculations.

It seems like there's something I'm missing, here.

ctb · 2015-06-29T04:10:14Z

Know N, want M set automatically to minimum.

Know M, fix M, want it to fail when fp is going to be too high.

Both are valid use cases that require knowing N. All good.

On Jun 28, 2015, at 8:57 PM, Jake Fenton [email protected] wrote:

@ctb Alright, I'm back to feeling like I'm duplicating work. I skimmed through the docs for #1050 (specifically https://github.com/dib-lab/khmer/pull/1050/files#diff-f774d7b19ba2beb15a8f5a125c400abeR8) and then glanced through your additions to khmerags and it seems a lot like the functionality I've proposed is essentially already in place, albiet without FP rate handling.

If anything the stuff I'd be adding would be somewhat less useful as it requires knowing the number of unique khmers up front in order to do any of the calculations.

—
Reply to this email directly or view it on GitHub.

bocajnotnef · 2015-06-29T04:10:57Z

Okay. Makes sense, understanding acquired. Thanks.

bocajnotnef · 2015-07-29T21:47:40Z

Should I go ahead and add this into all scripts that create a hashtable? And if so, what FP should I use for them?

ctb · 2015-07-29T22:54:31Z

On Wed, Jul 29, 2015 at 02:47:40PM -0700, Jake Fenton wrote:

Should I go ahead and add this into all scripts that create a hashtable? And if so, what FP should I use for them?

For now, go with a default FP rate of 0.15 at max, but provide a way to
override it in the function call.

Please change only 1-2 scripts, then let's take a look before doing any more...

bocajnotnef · 2015-07-29T23:13:21Z

Currently, this is implemented in build/load-graph and normalize, so we have 1-2.

ctb · 2015-07-29T23:24:44Z

Where? In what pull request?

bocajnotnef · 2015-07-29T23:34:57Z

#1126

ctb · 2015-07-30T14:39:25Z

OK, well, let's start a new pull request :). Here are a few requested changes:

change do_sanity_checking name to check_fp_rate
should this function only be called if -U is set? For now, I think so...
change oxutils shortname to oxfuncs, since it's `oxli.functions' you're importing.
get rid of "optimization" language in comments - it's automatic argument picking, not optimization :)

(The goal of all of this is to enable script readability.)

Is there a reason that this code isn't called generically in create_countgraph/other khmer_args functions? That seems like a better place for it than adding it into each script individually, but I'm probably missing something.

bocajnotnef · 2015-07-30T18:02:14Z

The desired_fp rate varies from script to script, which is why we currently call it on a script-by-script basis--that said, it shouldn't be too hard to pass that along to create_countgraph.

ctb · 2015-08-07T17:26:52Z

Fixed in #1214.

bocajnotnef added question discussion-needed labels Jun 23, 2015

ctb mentioned this issue Jun 29, 2015

User experience upgrades for khmer #732

Closed

This was referenced Jun 29, 2015

Implement auto'd optimal hashtable args and memory limitation checking #1126

Merged

hashtable arg estimation functions may be wrong #1146

Closed

ctb mentioned this issue Jul 29, 2015

Deal appropriately with --unique-kmers argument #1179

Closed

mr-c added this to the unscheduled milestone Jul 30, 2015

bocajnotnef mentioned this issue Jul 30, 2015

Cleaning up automatic argument setting #1214

Merged

ctb closed this as completed Aug 7, 2015

ctb mentioned this issue Jun 22, 2017

Checking how full a CQF is splatlab/cqf#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing --max-mem and auto'd hashtable agrs #1117

Implementing --max-mem and auto'd hashtable agrs #1117

bocajnotnef commented Jun 23, 2015

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 23, 2015

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 24, 2015

bocajnotnef commented Jun 24, 2015

ctb commented Jun 28, 2015

bocajnotnef commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

ctb commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

ctb commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

ctb commented Aug 7, 2015

Implementing --max-mem and auto'd hashtable agrs #1117

Implementing --max-mem and auto'd hashtable agrs #1117

Comments

bocajnotnef commented Jun 23, 2015

making sure #1106 doesn't overlap with #1050 in terms of function--#1050 has --max-mem already, so #1106 might be limited to tablesize estimations and FP-rate based estimations.

implementing the size calculations via argparser without duplicating code--Should be fairly easy to just use the added oxli functions but we all know how "fairly easy" stuff tends to go when I get involved.

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 23, 2015

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 23, 2015

bocajnotnef commented Jun 23, 2015

ctb commented Jun 24, 2015

bocajnotnef commented Jun 24, 2015

ctb commented Jun 28, 2015

bocajnotnef commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

ctb commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

ctb commented Jun 29, 2015

bocajnotnef commented Jun 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 29, 2015

bocajnotnef commented Jul 29, 2015

ctb commented Jul 30, 2015

bocajnotnef commented Jul 30, 2015

ctb commented Aug 7, 2015