-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] access to internal data of forecasters #6914
Comments
interesting - as a starting point, may I kindly ask you to write some speculative code for both use cases? I don't think I've fully grokked them yet. |
Rolling update: filtering timepoints after adding the new data: from sktime.split import temporal_train_test_split
if hasattr(forecaster, "_y"):
if forecaster._y is not None:
_, forecaster._y = temporal_train_test_split(
y=forecaster._y, test_size=self.nr_timepoints
)
if hasattr(forecaster, "_X"):
if forecaster._X is not None:
_, forecaster._X = temporal_train_test_split(
y=forecaster._X, test_size=self.nr_timepoints
) Model validation: Deleting data capability if hasattr(forecaster, "_y"):
del forecaster._y
if hasattr(forecaster, "_X"):
del forecaster._X But for above asked capabilities somehow the internal data update needs to propagate to all elements in the pipeline as data can be stored multiple times via (shadow) copy. |
Hm, could you kindly add more text on why you are doing this and what the workflow tries to achieve? |
To keep memory low as for some forecasters the internal data can become large as in hierarchical/panel/global forecasting cases. On our end we keep the data for fitting separately stored and only want to keep data relevant for prediction in the internal data of the forecaster. |
The above comment is for the rolling update part. For the model validation the reason is already mentioned I think clearly in the initial description. Let me know if there is a specific point you would like me to elaborate more. |
Ahhh, I see. For deleting - or, more precisely, not remembering - the
|
It comes close. Not remembering seems to me only relevant to set before a model is first fitted and not covering the case after fitting when updating the model. It seems to me the old data is in that case still kept and there is no clean option to delete data after data is stored internally on the model. |
Thanks for replying! Can you explain what you would wish instead? The config can be set after construction, before the model is fitted. |
Two things would be nice:
|
Maybe this deserves a separate issue, but this would solve a problem I'm having with |
Yes, I forgot to add here that We should probably decouple all forecasters from the remembered data in @hudsonhoch, I agree that a separate issue would be great for the |
Created issue #7405.
I'm not exactly sure what this means, but what is the idea behind saving |
@hudsonhoch, interesting - could you share the wrapper classes, and how you fixed the I would be very curious - maybe we can use this to reduce the memory footprint. |
I suspect my wrapper classes wouldn't be much use in the module as they are hacky and break things: They just add a The reason for my earlier question on |
What I would like to understand, when do you use this?
I see! These are actually data either saved in metadata fields, or they can be remembered in
Might I suggest two actions?
|
I use my wrappers like the following: forecaster.fit(y, X=fit_X)
p = forecaster.predict(fh=fh, X=predict_X)
forecaster.remove_data() where
|
I see. So you do not use 1 and 2 would be appreciated! I would then comment on the memory footprint - perhaps we should by default not make forecasters remember |
I don't, but maybe that's because I don't understand the use case. For my specific application, I just re-instantiate forecasters and fit with new data. |
Is your feature request related to a problem? Please describe.
The need for access to internal data arise from multiple angels:
However, not all data might be required for prediction. For example, with regression based model, you might train on a large amount of data, but only need the last n timepoints to generate lags for the cutoff.
A rolling update would facilitate this by keep only the last N timepoints, when called.
Describe the solution you'd like
Describe alternatives you've considered
Currently I modify the internal data myself via the attribute _y, _X, but this is not future proof.
The text was updated successfully, but these errors were encountered: