[ENH] access to internal data of forecasters #6914

kdekker-kdr4 · 2024-08-08T13:33:42Z

Is your feature request related to a problem? Please describe.
The need for access to internal data arise from multiple angels:

Rolling update: Each forecaster has an update method to update the internal y,X data.
However, not all data might be required for prediction. For example, with regression based model, you might train on a large amount of data, but only need the last n timepoints to generate lags for the cutoff.
A rolling update would facilitate this by keep only the last N timepoints, when called.
Model validation: In MLOPS it can be relevant to compare the performance of older vs newer models. For an older model it will require for forecasting to replace the internal data with with the latest data on which a new model is trained.

Describe the solution you'd like

Rolling update: Have an argument nr_timepoints for BaseForecaster.update that allows to filter the internal data.
Model validation: Allow deleting internal data. After this the update mechanism can provide the new internal data.

Describe alternatives you've considered
Currently I modify the internal data myself via the attribute _y, _X, but this is not future proof.

fkiraly · 2024-08-08T15:02:27Z

interesting - as a starting point, may I kindly ask you to write some speculative code for both use cases? I don't think I've fully grokked them yet.

kdekker-kdr4 · 2024-08-12T09:54:36Z

Rolling update: filtering timepoints after adding the new data:

from sktime.split import temporal_train_test_split


  if hasattr(forecaster, "_y"):
      if forecaster._y is not None:
          _, forecaster._y = temporal_train_test_split(
              y=forecaster._y, test_size=self.nr_timepoints
          )

  if hasattr(forecaster, "_X"):
      if forecaster._X is not None:
          _, forecaster._X = temporal_train_test_split(
              y=forecaster._X, test_size=self.nr_timepoints
          )

Model validation: Deleting data capability

if hasattr(forecaster, "_y"):
    del forecaster._y
if hasattr(forecaster, "_X"):
    del forecaster._X

But for above asked capabilities somehow the internal data update needs to propagate to all elements in the pipeline as data can be stored multiple times via (shadow) copy.

fkiraly · 2024-08-12T10:01:20Z

Hm, could you kindly add more text on why you are doing this and what the workflow tries to achieve?

kdekker-kdr4 · 2024-08-12T10:06:23Z

To keep memory low as for some forecasters the internal data can become large as in hierarchical/panel/global forecasting cases. On our end we keep the data for fitting separately stored and only want to keep data relevant for prediction in the internal data of the forecaster.

kdekker-kdr4 · 2024-08-12T10:08:09Z

The above comment is for the rolling update part. For the model validation the reason is already mentioned I think clearly in the initial description. Let me know if there is a specific point you would like me to elaborate more.

fkiraly · 2024-08-12T19:54:03Z

Ahhh, I see. For deleting - or, more precisely, not remembering - the _X, _y data, there is a config fields that turns it off:

related discussion, How to reset the history of a TS model without to change the model's parameters #5545, @corradomio has also requested related featuers
config option &corradomio [ENH] config to turn off data memory in forecasters #5676, it is called remember_data (a boolean) and can be turned off via set_config

kdekker-kdr4 · 2024-09-30T08:23:29Z

It comes close. Not remembering seems to me only relevant to set before a model is first fitted and not covering the case after fitting when updating the model. It seems to me the old data is in that case still kept and there is no clean option to delete data after data is stored internally on the model.

fkiraly · 2024-09-30T10:33:40Z

Thanks for replying! Can you explain what you would wish instead? The config can be set after construction, before the model is fitted.

kdekker-kdr4 · 2024-10-07T11:32:07Z

Two things would be nice:

Public Method to delete the internal data from a forecaster and for example in the case of a ForecastingPipeline also all for all steps.
Have a config option that defines how much internal data to keep. So how many timepoints to keep per timeseries stored in the internal data. There is often no need to keep all data, which now happens after multiple update calls. Not for training or prediction.

hudsonhoch · 2024-11-15T16:12:20Z

Maybe this deserves a separate issue, but this would solve a problem I'm having with _StatsModelsAdapter forecasters where calling set_config(remember_data=False) doesn't allow prediction since _predict uses some metadata items of self._y. Ideally, I would be able to delete all data not needed.

fkiraly · 2024-11-18T19:23:47Z

Yes, I forgot to add here that set_config(remember_data=False) is meant to solve this at least temporarily.

We should probably decouple all forecasters from the remembered data in _y, _X, this would allow a cleaner move to not remembering it at all.

@hudsonhoch, I agree that a separate issue would be great for the _StatsModelsAdapter. We should be able to fix the problem by adding lines in _fit that check for presence of _X, _y, and if not present, write them.

hudsonhoch · 2024-11-18T20:35:24Z

Created issue #7405.

We should probably decouple all forecasters from the remembered data in _y, _X, this would allow a cleaner move to not remembering it at all.

I'm not exactly sure what this means, but what is the idea behind saving _y and _X rather than only the data needed for post-_fit methods? FWIW I frequently hit memory issues so I've resorted to writing wrapper classes to get rid of bloat.

fkiraly · 2024-11-18T22:44:31Z

@hudsonhoch, interesting - could you share the wrapper classes, and how you fixed the _StatsModelsAdapter? Perhaps in a draft PR?

I would be very curious - maybe we can use this to reduce the memory footprint.

hudsonhoch · 2024-11-18T23:07:51Z

I suspect my wrapper classes wouldn't be much use in the module as they are hacky and break things: They just add a remove_data method, which clears most all of the data without any care of the repercussions downstream.

The reason for my earlier question on _y and _X is because I think it makes more sense for _fit to be responsible for saving only necessary data for downstream use. For example, _StatsModelsAdapter._predict only needs the following pieces of information: _y.index[0], len(_y), and _y.name.

fkiraly · 2024-11-19T10:37:12Z

They just add a remove_data method, which clears most all of the data without any care of the repercussions downstream.

What I would like to understand, when do you use this?

For example, _StatsModelsAdapter._predict only needs the following pieces of information: _y.index[0], len(_y), and _y.name.

I see! These are actually data either saved in metadata fields, or they can be remembered in _fit. That is what I meant, we just remember in _fit what we need, and do not access via _y in _predict. That should be an easy modification to the adapter.

_y.index[0] is not stored outside _y, but we can remember it!
_y.name is inferred by the input checks and stored as part of _y_metadata, namely the "feature_names" field. It should als obe written to self.feature_names_in_. The _y_metadata dictionary is created even if remember_data=False.
len(_y) is not stored outside _y but could be remembered. We could also add it to _y_metadata, although this would require some changes to the datatypes metadata inference by adding a new field. Perhaps it is better to remember case-by-case, since lengths can be different in the panel case.

Might I suggest two actions?

issue and/or PR for uncoupling _StatsModelsAdapter from _X, _y
new issue on uncoupling every forecaster from _X, _y. We could check this with a diagnostic test, seeing what happens if remember_data=False.

hudsonhoch · 2024-11-19T13:00:59Z

I use my wrappers like the following:

forecaster.fit(y, X=fit_X)
p = forecaster.predict(fh=fh, X=predict_X)
forecaster.remove_data()

where remove_data deletes everything I don't need later. I'm forecasting hourly data with >50 groups of ~5 forecasters. >250 hourly forecasters quickly eats up my memory so I only keep the bare necessities for debugging purposes.

Might I suggest two actions?

issue and/or PR for uncoupling _StatsModelsAdapter from _X, _y

new issue on uncoupling every forecaster from _X, _y. We could check this with a diagnostic test, seeing what happens if remember_data=False.

I'll submit a PR to fix [BUG] Unable to use _StatsModelsAdapter.predict if config remember_data=False #7405
I can create an issue for this.

fkiraly · 2024-11-19T17:36:56Z

I see. So you do not use update?

1 and 2 would be appreciated!

I would then comment on the memory footprint - perhaps we should by default not make forecasters remember _X, _y, but this will take some serious change cycles.

hudsonhoch · 2024-11-19T19:47:27Z

I see. So you do not use update?

I don't, but maybe that's because I don't understand the use case. For my specific application, I just re-instantiate forecasters and fit with new data.

kdekker-kdr4 added the enhancement Adding new functionality label Aug 8, 2024

fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label Aug 8, 2024

fkiraly changed the title ~~[ENH] access to internal data forecaster~~ [ENH] access to internal data of forecasters Aug 8, 2024

hudsonhoch mentioned this issue Nov 18, 2024

[BUG] Unable to use _StatsModelsAdapter.predict if config remember_data=False #7405

Open

This was referenced Nov 19, 2024

[ENH] Be more intentional/memory conscious with data saved to forecasters such as _y and _X #7416

Open

[BUG] Save necessary _y info to remove need for _y #7417

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] access to internal data of forecasters #6914

[ENH] access to internal data of forecasters #6914

kdekker-kdr4 commented Aug 8, 2024

fkiraly commented Aug 8, 2024

kdekker-kdr4 commented Aug 12, 2024

fkiraly commented Aug 12, 2024

kdekker-kdr4 commented Aug 12, 2024

kdekker-kdr4 commented Aug 12, 2024

fkiraly commented Aug 12, 2024 •

edited

Loading

kdekker-kdr4 commented Sep 30, 2024

fkiraly commented Sep 30, 2024

kdekker-kdr4 commented Oct 7, 2024

hudsonhoch commented Nov 15, 2024

fkiraly commented Nov 18, 2024

hudsonhoch commented Nov 18, 2024

fkiraly commented Nov 18, 2024

hudsonhoch commented Nov 18, 2024

fkiraly commented Nov 19, 2024

hudsonhoch commented Nov 19, 2024

fkiraly commented Nov 19, 2024 •

edited

Loading

hudsonhoch commented Nov 19, 2024

[ENH] access to internal data of forecasters #6914

[ENH] access to internal data of forecasters #6914

Comments

kdekker-kdr4 commented Aug 8, 2024

fkiraly commented Aug 8, 2024

kdekker-kdr4 commented Aug 12, 2024

fkiraly commented Aug 12, 2024

kdekker-kdr4 commented Aug 12, 2024

kdekker-kdr4 commented Aug 12, 2024

fkiraly commented Aug 12, 2024 • edited Loading

kdekker-kdr4 commented Sep 30, 2024

fkiraly commented Sep 30, 2024

kdekker-kdr4 commented Oct 7, 2024

hudsonhoch commented Nov 15, 2024

fkiraly commented Nov 18, 2024

hudsonhoch commented Nov 18, 2024

fkiraly commented Nov 18, 2024

hudsonhoch commented Nov 18, 2024

fkiraly commented Nov 19, 2024

hudsonhoch commented Nov 19, 2024

fkiraly commented Nov 19, 2024 • edited Loading

hudsonhoch commented Nov 19, 2024

fkiraly commented Aug 12, 2024 •

edited

Loading

fkiraly commented Nov 19, 2024 •

edited

Loading