Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] access to internal data of forecasters #6914

Open
2 tasks
kdekker-kdr4 opened this issue Aug 8, 2024 · 18 comments
Open
2 tasks

[ENH] access to internal data of forecasters #6914

kdekker-kdr4 opened this issue Aug 8, 2024 · 18 comments
Labels
enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@kdekker-kdr4
Copy link

Is your feature request related to a problem? Please describe.
The need for access to internal data arise from multiple angels:

  • Rolling update: Each forecaster has an update method to update the internal y,X data.
    However, not all data might be required for prediction. For example, with regression based model, you might train on a large amount of data, but only need the last n timepoints to generate lags for the cutoff.
    A rolling update would facilitate this by keep only the last N timepoints, when called.
  • Model validation: In MLOPS it can be relevant to compare the performance of older vs newer models. For an older model it will require for forecasting to replace the internal data with with the latest data on which a new model is trained.

Describe the solution you'd like

  • Rolling update: Have an argument nr_timepoints for BaseForecaster.update that allows to filter the internal data.
  • Model validation: Allow deleting internal data. After this the update mechanism can provide the new internal data.

Describe alternatives you've considered
Currently I modify the internal data myself via the attribute _y, _X, but this is not future proof.

@kdekker-kdr4 kdekker-kdr4 added the enhancement Adding new functionality label Aug 8, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Aug 8, 2024

interesting - as a starting point, may I kindly ask you to write some speculative code for both use cases? I don't think I've fully grokked them yet.

@fkiraly fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label Aug 8, 2024
@fkiraly fkiraly changed the title [ENH] access to internal data forecaster [ENH] access to internal data of forecasters Aug 8, 2024
@kdekker-kdr4
Copy link
Author

Rolling update: filtering timepoints after adding the new data:

from sktime.split import temporal_train_test_split


  if hasattr(forecaster, "_y"):
      if forecaster._y is not None:
          _, forecaster._y = temporal_train_test_split(
              y=forecaster._y, test_size=self.nr_timepoints
          )

  if hasattr(forecaster, "_X"):
      if forecaster._X is not None:
          _, forecaster._X = temporal_train_test_split(
              y=forecaster._X, test_size=self.nr_timepoints
          )

Model validation: Deleting data capability

if hasattr(forecaster, "_y"):
    del forecaster._y
if hasattr(forecaster, "_X"):
    del forecaster._X

But for above asked capabilities somehow the internal data update needs to propagate to all elements in the pipeline as data can be stored multiple times via (shadow) copy.

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 12, 2024

Hm, could you kindly add more text on why you are doing this and what the workflow tries to achieve?

@kdekker-kdr4
Copy link
Author

To keep memory low as for some forecasters the internal data can become large as in hierarchical/panel/global forecasting cases. On our end we keep the data for fitting separately stored and only want to keep data relevant for prediction in the internal data of the forecaster.

@kdekker-kdr4
Copy link
Author

The above comment is for the rolling update part. For the model validation the reason is already mentioned I think clearly in the initial description. Let me know if there is a specific point you would like me to elaborate more.

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 12, 2024

Ahhh, I see. For deleting - or, more precisely, not remembering - the _X, _y data, there is a config fields that turns it off:

@kdekker-kdr4
Copy link
Author

It comes close. Not remembering seems to me only relevant to set before a model is first fitted and not covering the case after fitting when updating the model. It seems to me the old data is in that case still kept and there is no clean option to delete data after data is stored internally on the model.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 30, 2024

Thanks for replying! Can you explain what you would wish instead? The config can be set after construction, before the model is fitted.

@kdekker-kdr4
Copy link
Author

Two things would be nice:

  • Public Method to delete the internal data from a forecaster and for example in the case of a ForecastingPipeline also all for all steps.
  • Have a config option that defines how much internal data to keep. So how many timepoints to keep per timeseries stored in the internal data. There is often no need to keep all data, which now happens after multiple update calls. Not for training or prediction.

@hudsonhoch
Copy link

Maybe this deserves a separate issue, but this would solve a problem I'm having with _StatsModelsAdapter forecasters where calling set_config(remember_data=False) doesn't allow prediction since _predict uses some metadata items of self._y. Ideally, I would be able to delete all data not needed.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 18, 2024

Yes, I forgot to add here that set_config(remember_data=False) is meant to solve this at least temporarily.

We should probably decouple all forecasters from the remembered data in _y, _X, this would allow a cleaner move to not remembering it at all.

@hudsonhoch, I agree that a separate issue would be great for the _StatsModelsAdapter. We should be able to fix the problem by adding lines in _fit that check for presence of _X, _y, and if not present, write them.

@hudsonhoch
Copy link

Created issue #7405.

We should probably decouple all forecasters from the remembered data in _y, _X, this would allow a cleaner move to not remembering it at all.

I'm not exactly sure what this means, but what is the idea behind saving _y and _X rather than only the data needed for post-_fit methods? FWIW I frequently hit memory issues so I've resorted to writing wrapper classes to get rid of bloat.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 18, 2024

@hudsonhoch, interesting - could you share the wrapper classes, and how you fixed the _StatsModelsAdapter? Perhaps in a draft PR?

I would be very curious - maybe we can use this to reduce the memory footprint.

@hudsonhoch
Copy link

I suspect my wrapper classes wouldn't be much use in the module as they are hacky and break things: They just add a remove_data method, which clears most all of the data without any care of the repercussions downstream.

The reason for my earlier question on _y and _X is because I think it makes more sense for _fit to be responsible for saving only necessary data for downstream use. For example, _StatsModelsAdapter._predict only needs the following pieces of information: _y.index[0], len(_y), and _y.name.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 19, 2024

They just add a remove_data method, which clears most all of the data without any care of the repercussions downstream.

What I would like to understand, when do you use this?

For example, _StatsModelsAdapter._predict only needs the following pieces of information: _y.index[0], len(_y), and _y.name.

I see! These are actually data either saved in metadata fields, or they can be remembered in _fit. That is what I meant, we just remember in _fit what we need, and do not access via _y in _predict. That should be an easy modification to the adapter.

  • _y.index[0] is not stored outside _y, but we can remember it!
  • _y.name is inferred by the input checks and stored as part of _y_metadata, namely the "feature_names" field. It should als obe written to self.feature_names_in_. The _y_metadata dictionary is created even if remember_data=False.
  • len(_y) is not stored outside _y but could be remembered. We could also add it to _y_metadata, although this would require some changes to the datatypes metadata inference by adding a new field. Perhaps it is better to remember case-by-case, since lengths can be different in the panel case.

Might I suggest two actions?

  1. issue and/or PR for uncoupling _StatsModelsAdapter from _X, _y
  2. new issue on uncoupling every forecaster from _X, _y. We could check this with a diagnostic test, seeing what happens if remember_data=False.

@hudsonhoch
Copy link

I use my wrappers like the following:

forecaster.fit(y, X=fit_X)
p = forecaster.predict(fh=fh, X=predict_X)
forecaster.remove_data()

where remove_data deletes everything I don't need later. I'm forecasting hourly data with >50 groups of ~5 forecasters. >250 hourly forecasters quickly eats up my memory so I only keep the bare necessities for debugging purposes.

Might I suggest two actions?

  1. issue and/or PR for uncoupling _StatsModelsAdapter from _X, _y
  2. new issue on uncoupling every forecaster from _X, _y. We could check this with a diagnostic test, seeing what happens if remember_data=False.
  1. I'll submit a PR to fix [BUG] Unable to use _StatsModelsAdapter.predict if config remember_data=False #7405
  2. I can create an issue for this.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 19, 2024

I see. So you do not use update?

1 and 2 would be appreciated!

I would then comment on the memory footprint - perhaps we should by default not make forecasters remember _X, _y, but this will take some serious change cycles.

@hudsonhoch
Copy link

I see. So you do not use update?

I don't, but maybe that's because I don't understand the use case. For my specific application, I just re-instantiate forecasters and fit with new data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
None yet
Development

No branches or pull requests

3 participants