Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic ItemCollection implementation #430

Merged
merged 14 commits into from
Jun 14, 2021

Conversation

duckontheweb
Copy link
Contributor

@duckontheweb duckontheweb commented Jun 10, 2021

Related Issue(s): #

Description:

Adds ItemCollection class to work with GeoJSON Feature Collections containing only STAC Items. This class can be serialized/deserialized from JSON, and also implements Collection to make it more convenient to loop over items, get an item by index, and check if an Item is included. Note that until we implement more specific STAC object equality (as per #380), assert item in item_collection will always fail unless clone_items=False is used when instantiating the ItemCollection.

This PR also updates the I/O functions in pystac.__init__ to work with the new ItemCollection class in response to #371. Since ItemCollections are not STACObjects, this means that the read functions in that file (read_file, read_dict) are now not guaranteed to return a STACObject. This means users will need to either handle both cases in their code, or cast the result of those functions to the appropriate type. I'm interested in feedback on whether this is worth the change, or if we should keep these functions unchanged and make people use ItemCollection.from_file in order to read ItemCollections.

PR Checklist:

  • Code is formatted (run scripts/format)
  • Tests pass (run scripts/test)
  • This PR maintains or improves overall codebase code coverage.
  • Changes are added to the CHANGELOG. See the docs for information about adding to the changelog.

cc: @scottyhq

UPDATE 2021-06-10:

Implemented __contains__ to make ItemCollection a typed.Collection since we now have the option not to clone Items on instantiation and we are no longer manipulating the Item links.

@duckontheweb duckontheweb linked an issue Jun 10, 2021 that may be closed by this pull request
@duckontheweb duckontheweb added this to the 1.0.0-rc.1 milestone Jun 10, 2021
@matthewhanson
Copy link
Member

I think I lean toward STACObject.read_file only ever returning a STACObject. Seems wrong that a class factory function would return something other than that class.

I think it's ok to require users having to actively open an ItemCollection with ItemCollection.from_file. As long as they know that's how you open an ItemCollection.

I know it doesn't meet the request from @scottyhq , but I think that's just because before it was ambiguous on if an ItemCollection was a STACObject, because there was the ItemCollection class in stac-api-spec, and it had stac_version, etc.

Now that it's clear an ItemCollection is just a GeoJSON FeatureCollection with features that are STAC Items, it's reasonable that it you would open it differently.

@matthewhanson
Copy link
Member

Wait, I think I'm wrong here....I don't think a STACObject function should return a non STACObject.

But the pystac.read_file is just a convenience function, and now that I look at the code I think this is exactly the right way to do it. STACObject still only returns a STACObject, but the pystac.read_file function can return any class defined within the PySTAC library

@matthewhanson
Copy link
Member

matthewhanson commented Jun 10, 2021

Finished going through this, looks really good overall, with some comments above.

I can see eventually having some high level convenience functions in ItemCollection (such as calculating summaries over an ItemCollection), but I think this great for an initial implementation.

@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2021

Codecov Report

Merging #430 (2bb16a1) into main (8946c9b) will increase coverage by 0.11%.
The diff coverage is 94.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #430       /-   ##
==========================================
  Coverage   89.56%   89.67%    0.11%     
==========================================
  Files          39       40        1     
  Lines        5115     5172       57     
==========================================
  Hits         4581     4638       57     
  Misses        534      534              
Impacted Files Coverage Δ
pystac/item_collection.py 94.64% <94.64%> (ø)
pystac/__init__.py 100.00% <100.00%> ( 2.77%) ⬆️
pystac/serialization/__init__.py 91.30% <100.00%> ( 4.34%) ⬆️
pystac/stac_io.py 72.46% <0.00%> (ø)
pystac/serialization/migrate.py 85.00% <0.00%> ( 1.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8946c9b...2bb16a1. Read the comment docs.

@duckontheweb
Copy link
Contributor Author

@lossyrob This should be ready for your review, too, if you want to take a look.

Copy link
Member

@lossyrob lossyrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm besides the top level read and write methods, which I commented on. I'm -0.5 for the change to those methods, so if there's someone feels strongly it should be that way I'm fine with things getting merged in as is.

@@ -71,7 72,7 @@
)


def read_file(href: str) -> STACObject:
def read_file(href: str) -> Union[STACObject, ItemCollection]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to change these top level package read/write methods to include ItemCollection. It breaks user code that may rely on the STACObject type, forcing the type differentiation to happen by the caller. Also the last two parameters of write_file make it feel a bit shoe-horned. On the other hand, I can see people getting confused about why read_file wouldn't work on an ItemCollection if they didn't know it wasn't a core stac object.

I think my preference would be to keep these methods working with core STAC types, and force users to treat ItemCollections as a separate concept, which makes more sense with the spec as it currently stands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with all of the points you make here, and I think my preference would be to not change the top-level functions as well. If that ends up being the decision, I will add more clear documentation to the ItemCollection class indicating that it is not a STACObject and cannot be read using those top-level methods.

@scottyhq I'm curious how much of a priority it is for you to be able to read Item Collections using pystac.read_file vs. ItemCollection.from_file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this! Yes, I'm coming at this primarily from someone new to STAC, who is unfamiliar with spec details. The key expectation I have is that after searching a STAC API for data, and saving 'results.json' to work with later, there is an straightforward way to open that file and navigate it. For Python it seems like pystac_client takes care of the searching and pystac should care of the I/O and navigation of the results. I don't think people should have to understand the concepts of ItemCollections versus Collections, Core vs Not, for this fundamental workflow.

So if pystac.read_file can't handle ItemCollections, and a separate ItemCollection.from_file() is the way forward (or pystac_client.read_file()`?), I think that just needs to be clearly documented.

Also useful (and I think a non-breaking change) would then be for pystac.read_file to have error handing that can recognize the json is an ItemCollection and suggest the correct method to open it, rather than the current KeyError: 'id' ?

Copy link
Contributor Author

@duckontheweb duckontheweb Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scottyhq, that all makes sense to me. I'll change those top-level functions back to their original signatures and make sure we have clear docs on how to work with ItemCollections. I'm pretty sure the issue with a KeyError being raised in pystac.read_file is fixed by #402, but I'll add a test to be sure.

@matthewhanson Looking back at the code example in the original issue it seems like the name of the ItemSearch.items_as_collection method might also be misleading. Maybe we should rename that to items_as_item_collection so that users don't think they are saving a STAC Collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ItemCollection handling from top-level functions and added test that pystac.read_file raises a STACTypeError instead of the KeyError in 404bb99

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duckontheweb ItemSearch.items_as_collection was misleading. I've renamed it already, there's now get_pages (get the raw JSON of the pages), get_item_collection (gets pages as item collections), get_items (iterator through all pages and items), and get_all_items (gets all items from all pages and returns as single ItemCollection). This matches the get_ syntax used in PySTAC

@duckontheweb duckontheweb force-pushed the add/371-item-collection branch from 404bb99 to 2bb16a1 Compare June 14, 2021 01:31
@duckontheweb
Copy link
Contributor Author

@lossyrob It also just occurred to me that including ITEMCOLLECTION in the STACObjectTypes enum might be misleading as well. It looks like removing that would impact the migration and validation scripts (since it's part of the schema URI mapping), but I think we could work around that. What do you think about removing that from the Enum?

@lossyrob
Copy link
Member

@duckontheweb I agree that's not great, and should have been removed when ItemCollection was dropped. I'm 1 for dropping it from that Enum and using another separate constant in the migration/validation places where needed.

@duckontheweb
Copy link
Contributor Author

@duckontheweb I agree that's not great, and should have been removed when ItemCollection was dropped. I'm 1 for dropping it from that Enum and using another separate constant in the migration/validation places where needed.

I removed this in 47017df. I ended up just removing any ItemCollection references in the DefaultSchemaUriMap and in the migration code rather than try to patch it. My thinking is that we're not claiming to support ItemCollections as STACObjects, and since they won't appear while crawling a catalog this shouldn't negatively affect anything outside of users who are working with ItemCollections directly.

@duckontheweb
Copy link
Contributor Author

@lossyrob @matthewhanson Ready for another review whenever you have time. Thanks!

@matthewhanson
Copy link
Member

@duckontheweb Looks good to me to merge!

@duckontheweb duckontheweb merged commit 46fbaa7 into stac-utils:main Jun 14, 2021
@duckontheweb duckontheweb deleted the add/371-item-collection branch June 14, 2021 17:03
gjoseph92 pushed a commit to gjoseph92/stackstac that referenced this pull request Jul 7, 2021
stac-utils/pystac#430 implemented `pystac.ItemCollection`, which is being removed from `pystac-client` in the next release. This PR adds similar handling for `pystac.ItemCollection`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compatibility with pystac-client
5 participants