-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More datasets and regression problems #53
Comments
Re more datasets: szilard/GBM-perf#4 (comment) My focus now is top GBM implementations (including on GPUs). Doing more by doing less. I dockerized the most important things in a separate repo https://github.com/szilard/GBM-perf Also read this summary I wrote recently: https://github.com/szilard/benchm-ml#summary |
I just watched your talk, very interesting. In my opinion one of the directions that should be further developed (and you already mentioned) is AutoML: packages for automatic tuning, automatic ensembling, automatic feature engineering etc. in a time efficient way. |
Oh, I forgot to say last comment that RE OpenML, those datasets are ridiculously small: https://gist.github.com/szilard/b82635fa9060227514af3423b3225a29 There is also another set of datasets, that's also too small datasets: https://gist.github.com/szilard/d8279374646fb5f372317db2a4074f2f I would want a set of datasets with sizes from 1000 to 10M with median size 100K (so should cover 1K-10K-100K-1M-10M). RE AutoML: Indeed that's super interesting. However, benchmarking that is way more difficult because there is the tricky tradeoff between computation time and accuracy. I've been looking at a few solutions but nothing formally (just tried out). Btw most of them have GBMs are building blocks, so benchmarking the components can give you already some idea on performance. Btw when you say my talk, is is the KDD one? That's probably the most up to date, though my experiments with autoML and a few other things/results happened after the talk. |
Ok, there are only some datasets with size above 10 K in the OpenML or PMLB benchmarking suite. The AutoML solutions should have a time constraint parameter, so e.g. one can compare the results after 1 hour between these algorithms. Of course in reality they often miss this feature. Yes, the KDD one, quite inspiring. |
Did you consider using more datasets?
And how about regression problems?
There is for example this benchmarking suite, accessible via the OpenML packages: https://arxiv.org/abs/1708.03731
The text was updated successfully, but these errors were encountered: