More datasets and regression problems #53

PhilippPro · 2018-02-12T16:16:30Z

Did you consider using more datasets?

And how about regression problems?

There is for example this benchmarking suite, accessible via the OpenML packages: https://arxiv.org/abs/1708.03731

szilard · 2018-02-12T19:38:23Z

Re more datasets: szilard/GBM-perf#4 (comment)

My focus now is top GBM implementations (including on GPUs). Doing more by doing less. I dockerized the most important things in a separate repo https://github.com/szilard/GBM-perf

Also read this summary I wrote recently: https://github.com/szilard/benchm-ml#summary

PhilippPro · 2018-02-13T13:06:20Z

I just watched your talk, very interesting.

In my opinion one of the directions that should be further developed (and you already mentioned) is AutoML: packages for automatic tuning, automatic ensembling, automatic feature engineering etc. in a time efficient way.

szilard · 2018-02-13T14:36:07Z

Oh, I forgot to say last comment that RE OpenML, those datasets are ridiculously small: https://gist.github.com/szilard/b82635fa9060227514af3423b3225a29

There is also another set of datasets, that's also too small datasets: https://gist.github.com/szilard/d8279374646fb5f372317db2a4074f2f

I would want a set of datasets with sizes from 1000 to 10M with median size 100K (so should cover 1K-10K-100K-1M-10M).

RE AutoML: Indeed that's super interesting. However, benchmarking that is way more difficult because there is the tricky tradeoff between computation time and accuracy. I've been looking at a few solutions but nothing formally (just tried out). Btw most of them have GBMs are building blocks, so benchmarking the components can give you already some idea on performance.

Btw when you say my talk, is is the KDD one? That's probably the most up to date, though my experiments with autoML and a few other things/results happened after the talk.

PhilippPro · 2018-02-14T10:14:59Z

Ok, there are only some datasets with size above 10 K in the OpenML or PMLB benchmarking suite.

The AutoML solutions should have a time constraint parameter, so e.g. one can compare the results after 1 hour between these algorithms. Of course in reality they often miss this feature.

Yes, the KDD one, quite inspiring.

PhilippPro changed the title ~~More datasets~~ More datasets and regression problems Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More datasets and regression problems #53

More datasets and regression problems #53

PhilippPro commented Feb 12, 2018 •

edited

Loading

szilard commented Feb 12, 2018

PhilippPro commented Feb 13, 2018

szilard commented Feb 13, 2018

PhilippPro commented Feb 14, 2018 •

edited

Loading

More datasets and regression problems #53

More datasets and regression problems #53

Comments

PhilippPro commented Feb 12, 2018 • edited Loading

szilard commented Feb 12, 2018

PhilippPro commented Feb 13, 2018

szilard commented Feb 13, 2018

PhilippPro commented Feb 14, 2018 • edited Loading

PhilippPro commented Feb 12, 2018 •

edited

Loading

PhilippPro commented Feb 14, 2018 •

edited

Loading