-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added arguments to azuremldataasset for better control of output path #75
base: develop
Are you sure you want to change the base?
added arguments to azuremldataasset for better control of output path #75
Conversation
…e path within azureml where the dataset is saved
Hi @Gabriel2409, In case it helps to be more concrete, seeing a basic working example of supporting |
- This lead to a failure when outputting uri_file - removing this check does not seem to have any negative impact
Giving up the possibility to set versioned to FalseI ran additional tests. Ensuring no overwrites on saveWith that in mind, I modified the code to ensure that there is absolutely no risk of overwrite (at least in standard scenarios), by including the data asset name in the path. Solving the uri_file issueI also found what caused the uri_file issue. When I inspected the logs, I had the following error: Current PR state@fdroessler Thanks for the feedback, I always welcome constructive criticism. My initial inclination was to discuss the changes incrementally since I'm still wrapping my head around how the entire plugin works. However, I understand that you'd like to see something more substantial upfront. With these new commits, I think I have come up with a solution that ensures versioning, saves to specified folders in AzureML (rather than relying on job IDs), and appears to work for both Summary of current state using your pros and cons:
|
kedro_azureml/generator.py
Outdated
|
||
# versioning system: to be discussed | ||
output_path = ( | ||
f"{output_path}/{ds._azureml_dataset}/{ds.resolve_save_version()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem I see with this approach is that this hard codes the version in the pipeline specification. This might not be an issue for a single run using kedro azureml run
, but it means a rerun of that pipeline or a scheduled job created from the result of kedro azureml compile
will (attempt) to save to the same location.
Ideally, the versioning would be added to the path when the node is run, but that doesn't look feasible given that the directory has to be empty for Azure ML to accept it.
I might be overlooking something, but I now feel that we won't be able to add a sensible implementation of this feature unless Azure ML accepts non-empty output directories...
@tomasvanpottelbergh By adding {name} to the output path, it injects the job id when the job is run (see last commit) |
@Gabriel2409 Interesting, I didn't know about this feature! Where did you find this in the documentation? I agree that this makes the path too complex. If there is no way to let Azure ML inject the current timestamp in the path, I would stick to using the job ID for versioning purposes. This way we could probably make the |
@tomasvanpottelbergh https://learn.microsoft.com/en-us/azure/machine-learning/concept-expressions?view=azureml-api-2 gives you all the variables. You can also find a lot of example pipelines on github. For ex, here. Note: these variables are actually already used in the plugin: when I inspected the kedro execute command, I had something looking like |
Current PR StateNotes about versions
return Output(
type=ds._azureml_type,
name=ds._azureml_dataset,
path=output_path,
version=ds.resolve_save_version()
) In azureml, new version becomes timestamp instead of number. Test resultsuri_folder
uri_file
changing arguments
I also ran some tests with more complex urifolders in a nested structure and all the files are correctly saved Follow upBefore continuing with the PR, I need to know if you are ok with the current approach, or if you feel we should change the arguments, their names, or if you have any remarks. |
Hi sorry I was off on holiday for some days. I think the proposed solution looks ok. If there is a way to consolidate the azureml and local folder path that would be great but I'm struggling to see one that won't impose any structure onto other people's projects ... maybe we should rename it then to local/remote or local/azureml? |
Hi @fdroessler, the problem I see with renaming And do you have any insights on why the runs fail for uri_files but not uri_folders if click checks for the path existence? |
@Gabriel2409 I don't have a strong opinion on the naming, but like @fdroessler I think we need two separate Regarding the |
Thank you for the explanation for click. It makes sense. To be honest, it caused me a bit of a headache at the beginning before I managed to identify the issue. Regarding the two different If you are fine with these names, I can add the tests and update the documentation. |
@Gabriel2409 I'm fine with the naming, but maybe @marrrcin has another idea? |
Modify multi_catalog to include extra parameters. Update test_azureml_asset_dataset to ensure asset is constructured with correct type. Add checks on specific node outputs in test_can_generate_azure_pipeline
I added some tests. For the generated command, in test_can_generate_azure_pipeline, I added checks on specific nodes directly, which may be suboptimal as it means the test will fail if we change the dummy pipeline. |
@@ -81,6 85,8 @@ def __init__( | |||
filepath_arg: str = "filepath", | |||
azureml_type: AzureMLDataAssetType = "uri_folder", | |||
version: Optional[Version] = None, | |||
datastore: str = "${{default_datastore}}", | |||
azureml_root_dir: str = "kedro_azureml", # maybe combine with root_dir? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment can be removed here
@@ -172,12 172,22 @@ def _get_output(self, name): | |||
if name in self.catalog.list() and isinstance( | |||
ds := self.catalog._get_dataset(name), AzureMLAssetDataSet | |||
): | |||
output_path = ( | |||
f"azureml://datastores/{ds._datastore}/paths/{ds._azureml_root_dir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to strip off trailing slashes from ds._azureml_root_dir
, or use some URI parsing lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an URI parsing lib may be overkill. Moreover datastore
can be equal to ${{default_datastore}}
and I don't know if that will pose a problem with these libraries.
However, what about ds._azureml_root_dir.strip("/")
to remove both leading and trailing slash?
Another solution would be to add checks on the parameters on data set creation and throw an error if they don't match the correct format. Actually if we go this route, this is probably something we want to add to the datastore
and root_dir
parameter as well.
@tomasvanpottelbergh Note that it is not a check on the Output url itself but I think it is enough. |
Hi @marrrcin @fdroessler @tomasvanpottelbergh, I wanted to let you know that I've completed the final checks for this PR. Please take your time to review it at your convenience. |
Thanks @Gabriel2409 , sorry for leaving this untouched for so long. The PR is really promissing and we will review it this month, due to our current workload. Thanks for your patience! |
Any update on this @marrrcin? It would be nice to have this feature merged finally. |
@marrrcin @fdroessler @tomasvanpottelbergh
Following issue #70,
I propose this PR as a base for discussion, it is not meant to be merged as is for now.
Main changes is that we specify a path in the Output of the command.
It is important to note that this was only tested with a uri_folder outputting a single file. I think it may be worth it to test it with different types of uri_folders, especially datasets that write multiple files.
Here is a summary of my experimentations:
azureml/<job_id>/<output_name>/
: this is currently the behavior in the pluginazureml://datastores/<datastore_name>/paths/<azureml_root_dir>
For this pull request, I decided to completely ignore the
root_dir
and use another argument instead (azureml_root_dir
). This allows to have a local organisation of the files different from the organisation on azureml.I also added automatic versioning on save so we don't have the overwrite problem (it does not solve everything though, see below).
With the following catalog.yml:
I tested the following pipeline
which takes a csv as an input and saves it to the blob storage as output
When I do a local run,
kedro run
, data is saved todata/00_azurelocals/projects_train_dataset/local/projects_train_dataset.csv
When I do an azure run
kedro azureml run
, data is saved on azure in my datastore herekedro_azureml/2023-09-14T14.12.29.983Z/projects_train_dataset.csv
. The Explore function of azureml works.Important note: for now, the versioning solution I propose is not perfect. If two nodes run at the same time (for ex for train and test, both files will be saved in the same folder, for ex
kedro_azureml/2023-09-14T14.12.29.983Z/projects_train_dataset.csv
andkedro_azureml/2023-09-14T14.12.29.983Z/projects_test_dataset.csv
. This leads to the impression that both files are part of the same dataset. Note that in theory, people should not put two different datasets in the same folder though... Maybe changing the default to include the dataset name would be enough.Finally, I also tried to use the dataasset as an input for both a local and azureml run. It works (for local runs, data is first downloaded to
data/00_azurelocals/projects_train_dataset/1/projects_train_dataset.csv
(here 1 is the dataset version). Note that this is not the same path as when the data asset is saved. That means that a local run using the data asset as output of a node and input of another will incorrectly use the initial data. However that this won't pose problems for runs on azureml.Please tell me what you think and what we should test next.