-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing --max-mem and auto'd hashtable agrs #1117
Comments
On Tue, Jun 23, 2015 at 03:53:21PM -0700, Jake Fenton wrote:
Question: from reading through #1106 it looks to me like all of your methods |
Yes. Specifically, number of kmers paired with either desired FP or desired memory. |
On Tue, Jun 23, 2015 at 04:00:45PM -0700, Jake Fenton wrote:
So in that case I see no conflict in the feature sets. I think #1050 will |
Not any more : D |
Great - can you outline a new plan of attack? |
Right, so, new plan: implement new args (perhaps using a subcommand structure?) that take in no. of kmer and (fp rate or max mem), perhaps pulling from --max-mem if it's set. Using the algorithms that're being added to oxli in #1106 we'll then automagically get the rest of the parameters. What would the targets for this be? Since we're making args for graph settings I would assume this would affect diginorm and load-graph |
I would start by taking into account desired FP rates, and allow for specification of N alone (at which point max memory usage is automatically calculated from the default FP rate). For diginorm, fp of 0.1 is acceptable. For load-graph, I would go with .01. |
I assume I should allow for overriding of FP rates, but use a default value and not require that one be set? |
Also, from reading the relevant issues and comments (specifically #732 (comment)) I'm taking this to mean we'll take in a specified N and M to set a good hash size, but given an unset M we default to estimating based off of what we think a good FP rate is--Yes? |
All sounds good. Titus Brown, [email protected]
|
@ctb Alright, I'm back to feeling like I'm duplicating work. I skimmed through the docs for #1050 (specifically https://github.com/dib-lab/khmer/pull/1050/files#diff-f774d7b19ba2beb15a8f5a125c400abeR8) and then glanced through your additions to khmerags and it seems a lot like the functionality I've proposed is essentially already in place, albiet without FP rate handling. If anything the stuff I'd be adding would be somewhat less useful as it requires knowing the number of unique khmers up front in order to do any of the calculations. It seems like there's something I'm missing, here. |
Know N, want M set automatically to minimum. Know M, fix M, want it to fail when fp is going to be too high. Both are valid use cases that require knowing N. All good.
|
Okay. Makes sense, understanding acquired. Thanks. |
Should I go ahead and add this into all scripts that create a hashtable? And if so, what FP should I use for them? |
On Wed, Jul 29, 2015 at 02:47:40PM -0700, Jake Fenton wrote:
For now, go with a default FP rate of 0.15 at max, but provide a way to Please change only 1-2 scripts, then let's take a look before doing any more... |
Currently, this is implemented in build/load-graph and normalize, so we have 1-2. |
Where? In what pull request? |
OK, well, let's start a new pull request :). Here are a few requested changes:
(The goal of all of this is to enable script readability.) Is there a reason that this code isn't called generically in |
The desired_fp rate varies from script to script, which is why we currently call it on a script-by-script basis--that said, it shouldn't be too hard to pass that along to create_countgraph. |
Fixed in #1214. |
Related to: #1050, #390, #1106, #732
Currently: @ctb is implementing --max-mem in #1050 where, given a memory ceiling, automagically calculates graph sizes
#1106 also has methods for automagically calculating graph sizes given number of unique kmers, memory ceiling and/or desired FP rate.
So, my question: is it worth including the methods added in #1106 more widely if #1050 is going to address some/most of the usability?
If so, I move that the argparser stuff currently here https://github.com/dib-lab/khmer/pull/1106/files#diff-c9f39ec665e36b5911296d1f4028270cR61 gets moved into the oxli module so it can be added to scripts easily (a la threading args). Some additional processing stuff will need to be added to the scripts that implement this but it should be along the lines of "if memory options then use returned values"
Sticking points:
making sure #1106 doesn't overlap with #1050 in terms of function--#1050 has --max-mem already, so #1106 might be limited to tablesize estimations and FP-rate based estimations.
implementing the size calculations via argparser without duplicating code--Should be fairly easy to just use the added oxli functions but we all know how "fairly easy" stuff tends to go when I get involved.
@ctb @camillescott @luizirber @mr-c @macmanes @kdmurray91 @drtamermansour Thoughts?
The text was updated successfully, but these errors were encountered: