Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added arguments to azuremldataasset for better control of output path #75

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

Gabriel2409
Copy link
Contributor

@marrrcin @fdroessler @tomasvanpottelbergh
Following issue #70,
I propose this PR as a base for discussion, it is not meant to be merged as is for now.

Main changes is that we specify a path in the Output of the command.
It is important to note that this was only tested with a uri_folder outputting a single file. I think it may be worth it to test it with different types of uri_folders, especially datasets that write multiple files.

Here is a summary of my experimentations:

  • If you don't specify a path in the output, it will be saved in the following folder in the default datastore: azureml/<job_id>/<output_name>/: this is currently the behavior in the plugin
  • If you specify a path, it must be of the form azureml://datastores/<datastore_name>/paths/<azureml_root_dir>

For this pull request, I decided to completely ignore the root_dir and use another argument instead (azureml_root_dir). This allows to have a local organisation of the files different from the organisation on azureml.
I also added automatic versioning on save so we don't have the overwrite problem (it does not solve everything though, see below).

With the following catalog.yml:

projects_train_dataset#web:
  type: pandas.CSVDataSet
  filepath: https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/dataset.csv

projects_train_dataset#urifolder:
  type: kedro_azureml.datasets.AzureMLAssetDataSet
  versioned: True
  azureml_type: uri_folder
  azureml_dataset: projects_train_dataset
  root_dir: data/00_azurelocals/ # for local runs only
  dataset:
    type: pandas.CSVDataSet
    filepath: "projects_train_dataset.csv"

I tested the following pipeline

from kedro.pipeline import Pipeline, node


def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes=[
            node(
                func=lambda x: x,
                inputs="projects_train_dataset#web",
                outputs="projects_train_dataset#urifolder",
                name="create_train_dataasset",
            )
        ]
    )

which takes a csv as an input and saves it to the blob storage as output

When I do a local run, kedro run, data is saved to data/00_azurelocals/projects_train_dataset/local/projects_train_dataset.csv

When I do an azure run kedro azureml run, data is saved on azure in my datastore here kedro_azureml/2023-09-14T14.12.29.983Z/projects_train_dataset.csv. The Explore function of azureml works.

Important note: for now, the versioning solution I propose is not perfect. If two nodes run at the same time (for ex for train and test, both files will be saved in the same folder, for ex kedro_azureml/2023-09-14T14.12.29.983Z/projects_train_dataset.csv and kedro_azureml/2023-09-14T14.12.29.983Z/projects_test_dataset.csv. This leads to the impression that both files are part of the same dataset. Note that in theory, people should not put two different datasets in the same folder though... Maybe changing the default to include the dataset name would be enough.

Finally, I also tried to use the dataasset as an input for both a local and azureml run. It works (for local runs, data is first downloaded to
data/00_azurelocals/projects_train_dataset/1/projects_train_dataset.csv (here 1 is the dataset version). Note that this is not the same path as when the data asset is saved. That means that a local run using the data asset as output of a node and input of another will incorrectly use the initial data. However that this won't pose problems for runs on azureml.

Please tell me what you think and what we should test next.

…e path within azureml where the dataset is saved
@fdroessler
Copy link
Contributor

Hi @Gabriel2409,
thanks for making the initial step so quickly. I think from my perspective (and I invite anyone to disagree on this) when I talked about an initial PR to discuss, I meant something that implements the pro's that we talked about in the issue #70 as reasonable as possible while trying to mitigate the cons (or if not possible an initial implementation including some cons that we can then refine together). The PR as it is now doesn't offer enough basis for a discussion in my opinion. I hope this is not perceived as harsh I simply don't know what to discuss here that has not been brought up in the issue before.

In case it helps to be more concrete, seeing a basic working example of supporting uri_file while maintaining functionality with the uri_folder as well as an initial implementation of versioning considering best practices (as you say not necessarily saving to the same location) would be fantastic. E.g. when I added together with @tomasvanpottelbergh some things around v2 dataset support remote/local we created an end to end pipeline that included files at root level, files at nested folder level as well as partitioned parquet datasets (reading a folder as a whole), as these were the use-cases we faced on a daily basis and wanted to solve. I hope that gives some pointers and do let me know if you need any more input.

- This lead to a failure when outputting uri_file
- removing this check does not seem to have any negative impact
@Gabriel2409
Copy link
Contributor Author

Giving up the possibility to set versioned to False

I ran additional tests.
Basically Output in azure always create a new version of the data asset. The only way we could update the data asset without changing its version would be to write directly to the storage where the data asset points to, which goes against azure philosophy of immutable data assets. As such, I propose to go with the initial proposal, which was to hardcode versioned as True.

Ensuring no overwrites on save

With that in mind, I modified the code to ensure that there is absolutely no risk of overwrite (at least in standard scenarios), by including the data asset name in the path.

Solving the uri_file issue

I also found what caused the uri_file issue. When I inspected the logs, I had the following error: Error: Invalid value for '--az-output'. So the pipeline failed because of a check done by click.
This is solved by removing the click check on the existence of the path. I don't understand why this did not seem to pose problems for uri_folder though and any insight on this is welcome.

Current PR state

@fdroessler Thanks for the feedback, I always welcome constructive criticism. My initial inclination was to discuss the changes incrementally since I'm still wrapping my head around how the entire plugin works. However, I understand that you'd like to see something more substantial upfront.

With these new commits, I think I have come up with a solution that ensures versioning, saves to specified folders in AzureML (rather than relying on job IDs), and appears to work for both uri_files and uri_folders (based on my initial testing). If you have a more complex uri_folder configuration in your catalog.yml, I'd appreciate it if you could share it. This would help me test and refine the code if necessary.

Summary of current state using your pros and cons:

  • Support of uri_file: Yes
  • Support versioned = False option: No, I propose not to implement it
  • Allowing users to decide on nicer co-location of files that belong together in the underlying storage account: Yes, with some limitations (such as having to add the dataset name and the version to the path)
  • Potentially breaking lineage for local runs when allowing fsspec writing: did not cause issues for me as the changes to Output do not affect local runs. Moreover, the download of the dataset locally works whatever the underlying data asset path is
  • Increasing configuration complexity of datasets: 2 new arguments which are only useful for saving data assets. This could be a problem.
  • fsspec is potentially dependent on kedro/kedro-dataset changes (but maybe not a blocker see above): I don't know about this one but I think that if it breaks, it will break whether we specify the output path or not in Output
  • Changing traceability on the UUID in the path: uuid not in the path anymore but data asset shows which job created it in azure ml so you can always go back to the job and its parent pipeline.


# versioning system: to be discussed
output_path = (
f"{output_path}/{ds._azureml_dataset}/{ds.resolve_save_version()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I see with this approach is that this hard codes the version in the pipeline specification. This might not be an issue for a single run using kedro azureml run, but it means a rerun of that pipeline or a scheduled job created from the result of kedro azureml compile will (attempt) to save to the same location.

Ideally, the versioning would be added to the path when the node is run, but that doesn't look feasible given that the directory has to be empty for Azure ML to accept it.

I might be overlooking something, but I now feel that we won't be able to add a sensible implementation of this feature unless Azure ML accepts non-empty output directories...

@Gabriel2409
Copy link
Contributor Author

@tomasvanpottelbergh By adding {name} to the output path, it injects the job id when the job is run (see last commit)
I think this solves the issue you mentioned: Each new data asset is created in an empty folder.
However, the full path is starting to become pretty complex.
And it seems unit tests are failing with this change.

image
image

@tomasvanpottelbergh
Copy link
Contributor

@Gabriel2409 Interesting, I didn't know about this feature! Where did you find this in the documentation?

I agree that this makes the path too complex. If there is no way to let Azure ML inject the current timestamp in the path, I would stick to using the job ID for versioning purposes. This way we could probably make the uri_file dataset have the same default behaviour as the uri_folder when the user doesn't specify a path?

@Gabriel2409
Copy link
Contributor Author

@tomasvanpottelbergh https://learn.microsoft.com/en-us/azure/machine-learning/concept-expressions?view=azureml-api-2 gives you all the variables.

You can also find a lot of example pipelines on github. For ex, here.
https://github.com/Azure/azureml-examples/blob/ca613835cb7d2dc10dde501a243220c54f9f7fe0/sdk/python/jobs/parallel/2a_iris_batch_prediction/iris_batch_prediction.ipynb#L227

Note: these variables are actually already used in the plugin: when I inspected the kedro execute command, I had something looking like kedro azureml -e local execute --pipeline=__default__ --node=create_train_dataasset --az-output=projects_train_raw ${{outputs.projects_train_raw}}

@Gabriel2409
Copy link
Contributor Author

Current PR State

Notes about versions

  • It is possible to add the version to Output:
 return Output(
                type=ds._azureml_type,
                name=ds._azureml_dataset,
                path=output_path,
                version=ds.resolve_save_version()
            )

In azureml, new version becomes timestamp instead of number.
I am personally against it as azureml gives you the creation date of a new version.
Note that kedro does not allow to specify save-version with cli, only --load-versions.

Test results

uri_folder

  • basic_uri_folder as output:

    • local runs: saved in root_dir/basic_uri_folder/local. If I rerun, overwrites.
      I think it is ok as local saves of data assets should only be done for testing anyways.

    • azureml runs: correctly saves to azureml. Automatically creates a new version. No overwrites.

  • basic_uri_folder as input:

    • local runs: correctly downloads the latest version in root_dir/basic_uri_folder/<dataset_version> before launching the run

    • local runs with specified load version: kedro run -p create_azure_dataasset_from_local_files --load-versions basic_uri_folder:2: correctly downloads the correct version

    • azureml run: correctly loads the data asset

    • azureml run with load version: correctly loads the correct version

uri_file

  • basic_uri_file as output:

    • local runs: correctly saved in root_dir/basic_uri_file/local. If I rerun, overwrites.

    • azureml runs: correctly saves to azureml. Automatically creates a new version. No overwrites.

  • basic_uri_file as input:

    • local runs: correctly downloads the file in root_dir/basic_uri_file/<dataset_version> before launching the run

    • kedro run -p create_azure_dataasset_from_local_files --load-versions basic_uri_file:1: correctly downloads the correct version

    • azureml run: correctly loads the data asset

    • azureml run with load version: correctly loads the correct version

changing arguments

  • changing datastore:

    • datastore does not exist: failure
    • datastore exists: works correctly
  • change azureml_root_dir

    • nested structure: works correctly

I also ran some tests with more complex urifolders in a nested structure and all the files are correctly saved

Follow up

Before continuing with the PR, I need to know if you are ok with the current approach, or if you feel we should change the arguments, their names, or if you have any remarks.
Thank you very much.

@fdroessler
Copy link
Contributor

Hi sorry I was off on holiday for some days. I think the proposed solution looks ok. If there is a way to consolidate the azureml and local folder path that would be great but I'm struggling to see one that won't impose any structure onto other people's projects ... maybe we should rename it then to local/remote or local/azureml?

@Gabriel2409
Copy link
Contributor Author

Hi @fdroessler, the problem I see with renaming root_dir is that we would need to rename it for AzureMLPipelineDataSet as well for consistency.
But maybe azureml_root_dir is a bad argument name. I could rename it to azureml_parent_dir, azureml_datastore_subfolder, or another argument if you think this is better to avoid any confusion.

And do you have any insights on why the runs fail for uri_files but not uri_folders if click checks for the path existence?
This is the reason I had to change type=(str, click.Path(exists=True, file_okay=True, dir_okay=True)), to type=(str, click.Path(exists=False)), but I am not really happy with this change to be honest. Note that this error happens whether or not we specify the path in Output
My guess is that this comes from how the path property is defined.

@tomasvanpottelbergh
Copy link
Contributor

@Gabriel2409 I don't have a strong opinion on the naming, but like @fdroessler I think we need two separate root_dir parameters.

Regarding the click.Path check: I think there is no way around exists=False for uri_file datasets, since Azure ML passes the path of the file in that case, which doesn't exist yet before you save your dataset, contrary to the folder which Azure ML creates and passes for uri_folder datasets.

@Gabriel2409
Copy link
Contributor Author

Hi @tomasvanpottelbergh,

Thank you for the explanation for click. It makes sense. To be honest, it caused me a bit of a headache at the beginning before I managed to identify the issue.

Regarding the two different root_dir, I am not sure if you want me to modify something or not.
In the current PR, we have root_dir and azureml_root_dir so we currently have two different names.
The reason I did not rename root_dir to local_root_dir was to stay consistent with AzureMLPipelineDataSet.

If you are fine with these names, I can add the tests and update the documentation.

@tomasvanpottelbergh
Copy link
Contributor

@Gabriel2409 I'm fine with the naming, but maybe @marrrcin has another idea?

Modify multi_catalog to include extra parameters.
Update test_azureml_asset_dataset to ensure asset is constructured with
correct type. Add checks on specific node outputs in test_can_generate_azure_pipeline
@Gabriel2409
Copy link
Contributor Author

I added some tests.
As I only really changed the generated command, I think there is no need to add some tests on the dataset.
However, I did include some modifications to ensure that the multi_catalog contains one uri_file and one uri_folder and that the AzureMLDataAsset type is the same as the one passed in the request in test_azureml_asset_dataset.

For the generated command, in test_can_generate_azure_pipeline, I added checks on specific nodes directly, which may be suboptimal as it means the test will fail if we change the dummy pipeline.

@@ -81,6 85,8 @@ def __init__(
filepath_arg: str = "filepath",
azureml_type: AzureMLDataAssetType = "uri_folder",
version: Optional[Version] = None,
datastore: str = "${{default_datastore}}",
azureml_root_dir: str = "kedro_azureml", # maybe combine with root_dir?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment can be removed here

@@ -172,12 172,22 @@ def _get_output(self, name):
if name in self.catalog.list() and isinstance(
ds := self.catalog._get_dataset(name), AzureMLAssetDataSet
):
output_path = (
f"azureml://datastores/{ds._datastore}/paths/{ds._azureml_root_dir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to strip off trailing slashes from ds._azureml_root_dir, or use some URI parsing lib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an URI parsing lib may be overkill. Moreover datastore can be equal to ${{default_datastore}} and I don't know if that will pose a problem with these libraries.

However, what about ds._azureml_root_dir.strip("/") to remove both leading and trailing slash?

Another solution would be to add checks on the parameters on data set creation and throw an error if they don't match the correct format. Actually if we go this route, this is probably something we want to add to the datastore and root_dir parameter as well.

@Gabriel2409
Copy link
Contributor Author

@tomasvanpottelbergh
following your comment on azureml_root_dir, I added validation on dataset creation to ensure datastore and azureml_root_dir are correctly formatted.

Note that it is not a check on the Output url itself but I think it is enough.

@Gabriel2409
Copy link
Contributor Author

Hi @marrrcin @fdroessler @tomasvanpottelbergh,

I wanted to let you know that I've completed the final checks for this PR.
Thank you very much for your insights and your patience throughout the whole process.

Please take your time to review it at your convenience.

@marrrcin
Copy link
Contributor

marrrcin commented Dec 5, 2023

Thanks @Gabriel2409 , sorry for leaving this untouched for so long. The PR is really promissing and we will review it this month, due to our current workload. Thanks for your patience!

@tomasvanpottelbergh
Copy link
Contributor

Any update on this @marrrcin? It would be nice to have this feature merged finally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants