Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable factory references to create new dimensions on load. #6168

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Oct 9, 2024

Relates to #5369 , #6165

@pp-mo
Copy link
Member Author

pp-mo commented Oct 11, 2024

Now updated with prototype loading-control object
NOTE To-dos at least :

  • no proper tests yet
  • work needed to implement the loading policy in load_cubes/load_cube
  • fix crashing plot tests (now skipped)

Copy link
Contributor

@stephenworsley stephenworsley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a couple comments.

return policy


def _apply_loading_policy(cubes, policy=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a case for making something like this public, perhaps as a method of LoadPolicy. There could concievably be a situation where a user would want an equivalent combination, but it's not appropriate to do on load. Perhaps in cases where there needs to be some processing of cubes in order to make them mergeable which is not possible to do as a callback.

Copy link
Member Author

@pp-mo pp-mo Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this could be a useful general tool in its own right.
In that case it would be useful to expose the "POLICY" settings as simple arguments here
(especially if they get simpler, as described in the "new settings strategy" below).
Going forward, I'll try make that workable.

self.found_multiple_refs = False


# A single global object (per thread) to record whether multiple reference fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an instance per thread, is there any risk that behaviour is dependant on how multiprocessing is done? Or is the relevant loading done entirely within a single thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the loading code is all single-threaded in operation. It has to be since, while it has the input file open, it might not be safe to run parallel loading tasks since the file system interface may not be thread-safe (e.g. as for netcdf4-python) .

# the latest load operation.
# This is used purely to implement the iris.LOAD_POLICY.multiref_triggers_concatenate
# functionality.
_MULTIREF_DETECTION = MultipleReferenceFieldDetector()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this would be instantiated at the module level rather than during a load call. Would this not change your loading behaviour depending on what files you have already loaded during your session?

Copy link
Member Author

@pp-mo pp-mo Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explained here, that all the iris.load_xxx load functions call via _load_collection, which resets this. It is admittedly a weak contract (!).
We need it to behave like the settings controls. i.e. to exist as a common (but per-thread) global reference, so it can affect low-level behaviour without the need to pass the control down through all functions in the call chain.

Comment on lines 327 to 358
def __init__(
self,
support_multiple_references: bool = False,
multiref_triggers_concatenate: bool = False,
use_concatenate: bool = False,
use_merge: bool = True,
cat_before_merge: bool = False,
repeat_until_done: bool = False,
):
"""Container for loading controls."""
self.support_multiple_references = support_multiple_references
self.multiref_triggers_concatenate = multiref_triggers_concatenate
self.use_concatenate = use_concatenate
self.use_merge = use_merge
self.cat_before_merge = cat_before_merge
self.repeat_until_done = repeat_until_done

def __repr__(self):
msg = (
"LoadPolicy("
f"support_multiple_references={self.support_multiple_references}, "
f"multiref_triggers_concatenate={self.multiref_triggers_concatenate}, "
f"use_concatenate={self.use_concatenate}, "
f"use_merge={self.use_merge}, "
f"cat_before_merge={self.cat_before_merge}, "
f"repeat_until_done={self.repeat_until_done}"
")"
)
return msg

def copy(self):
return LoadPolicy(**{key: getattr(self, key) for key in self._allkeys})
Copy link
Contributor

@trexfeathers trexfeathers Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data classes give you some/all of this, and you can still add your own methods in addition.

Comment on lines 319 to 324
"support_multiple_references",
"multiref_triggers_concatenate",
"use_concatenate",
"use_merge",
"cat_before_merge",
"repeat_until_done",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of giving users more controls to play with, but having 6 booleans is very confusing, so I'm keen to look for ways it could be simpler to understand.

  1. It doesn't look like support_multiple_references is necessary
  2. use_concatenate, use_merge and cat_before_merge could become some sort of controllable sequence e.g. {"c", "m"} / {"m"} / {"m", "c"}.

Copy link
Member Author

@pp-mo pp-mo Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as discussed offline, I think we can simplify this with benefits.

See new strategy proposals, below.

Comment on lines 388 to 394
LOAD_POLICY_LEGACY = LoadPolicy()
LOAD_POLICY_RECOMMENDED = LoadPolicy(
support_multiple_references=True, multiref_triggers_concatenate=True
)
LOAD_POLICY_COMPREHENSIVE = LoadPolicy(
support_multiple_references=True, use_concatenate=True, repeat_until_done=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be simpler to follow if these 'presets' were all part of another object. Could they be class attributes e.g. LoadPolicy.recommended? It would be good if the only global constant were LOAD_POLICY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't like this much.
Under review, watch this space!

Comment on lines 362 to 365
"""Return context manager for temporary options.

Modifies the given parameters within a context, for the active thread.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with more detail and examples - I'm struggling to picture use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is not intended to be complete yet -- still just draft ideas

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had my rough look around.

I think the user experience is almost there. Seems like a good compromise that still holds the user's hand as much as possible (they don't write the code themselves), while still offering more customisation than currently.

The developer experience feels very complicated. I think this is worth some attention, since the developer experience around loading is already too complicated and I have noticed can be a big barrier for less experienced people. Afraid I don't have any immediate suggestions, only that it's worth some time.

Thanks!

@pp-mo
Copy link
Member Author

pp-mo commented Oct 14, 2024

@trexfeathers I think the user experience is almost there. Seems like a good compromise that still holds the user's hand as much as possible (they don't write the code themselves), while still offering more customisation than currently.

I'm not providing additional docstrings yet, but it's clearly needed (if not a userguide section?)
Likewise what you already said : the new functionality badly needs usage examples
-- on the TODO list

@pp-mo
Copy link
Member Author

pp-mo commented Oct 14, 2024

@trexfeathers The developer experience feels very complicated. I think this is worth some attention, since the developer experience around loading is already too complicated and I have noticed can be a big barrier for less experienced people. Afraid I don't have any immediate suggestions, only that it's worth some time.

You're totally right, and I'm currently planning to remove all the pieces here to a single sub-module "iris.io.loading", which I think will help a bit.
I have a draft of that, and it gathers the _CubeFilterCollection and _CubeFilter classes, supporting routines "_generate_cubes" and "_load_collection", all the new "policy" code and all the load_xxx functions.
Along with that, I am considering removing the _CubeFilterCollection and _CubeFilter classes altogether as I think they may obscure rather than help.

However, I want to add some proper testcases before I attempt a lift-and-shift, as otherwise some breakages might go unnoticed.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 14, 2024

Current best ideas for a "new control strategy"

Following offline discussion wuith @trexfeathers

Just 3 control keywords:

  • support_multiple_references = True/False
    • (as before, but now implies the old multiref_triggers_concatenate=True behaviour)
  • merge_concatenate_sequence = "m" / "c" / "mc" / "cm"
  • repeat_until_no_changes = True/False

Plus...

  • combine settings in a simple dictionary or namedtuple
  • use simple class constants for common choices
  • make support_multiple_references True by default, so it ..
    • includes the triggering of a final concatenate
    • delivers required solution for the key new cases (i.e. time-dependent hybrid height)
    • delivers existing result for 99% of existing cases
    • for 1% of broken legacy loads, can be easily forced to "old behaviour"

@pp-mo
Copy link
Member Author

pp-mo commented Oct 14, 2024

Progress Update :

For now I'm deferring the user-API improvements detailed above.
The plan now is roughly :

  • fix the existing functionality
    • note : latest commit includes changes to the 'unique' keyword in merges, which ought to fix some of the outstanding test failures
  • add explicit tests for various loading testcases, including multiple-references ones
  • fix the outstanding test failures resolve/reinstate skipped tests (i.e. the plotting crashes)
  • move all the disparate moving parts into a "iris.io.loading" submodule
  • simplify the user API
  • ? fix default loading policy to new-style ? (and fix any tests it breaks)
  • add new user docstrings whatsnew describing changes

Copy link

codecov bot commented Oct 14, 2024

Codecov Report

Attention: Patch coverage is 61.94690% with 43 lines in your changes missing coverage. Please review.

Project coverage is 89.64%. Comparing base (54f430b) to head (027b7a0).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
lib/iris/__init__.py 53.33% 24 Missing and 11 partials ⚠️
lib/iris/fileformats/rules.py 74.19% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6168       /-   ##
==========================================
- Coverage   89.81%   89.64%   -0.18%     
==========================================
  Files          88       88              
  Lines       23178    23277       99     
  Branches     4313     4333       20     
==========================================
  Hits        20818    20867       49     
- Misses       1628     1663       35     
- Partials      732      747       15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pp-mo pp-mo force-pushed the load_factory_dims branch 2 times, most recently from f7ddb65 to 0f1bae3 Compare October 15, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants