Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure version agnostic robust pickling of pandas, polars and numpy data structures passed as params to Python scripts or notebooks #3175

Merged
merged 5 commits into from
Nov 1, 2024

Conversation

johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented Oct 30, 2024

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced parameter handling in the Snakemake workflow for improved type management and storage safety.
    • Introduced a new Snakemake rule for outputting data in multiple formats (Pandas, NumPy, Polars).
    • Added functionality for validating and exporting parameters from the Snakemake workflow.
    • New test case added for validating parameter pickling.
  • Chores

    • Added polars as a new dependency in the test environment setup.

…mpy data structures passed as params to Python scripts or notebooks
Copy link
Contributor

coderabbitai bot commented Oct 30, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request introduces significant changes to the Snakemake class in the snakemake/script/__init__.py file, particularly in how parameters are managed. A new property decorator for the params attribute has been implemented, which utilizes the _safely_store_params method for type conversion and storage of parameters. Additionally, a new Snakemake rule is added in tests/test_params_pickling/Snakefile, and a corresponding validation script is introduced in tests/test_params_pickling/test.py. The test-environment.yml file has been updated to include the polars dependency, and a new test function is added to tests/tests.py.

Changes

File Change Summary
snakemake/script/__init__.py - Added property @property def params(self).
- Added method def _safely_store_params(self, params: io_.Params).
test-environment.yml - Added dependency polars.
tests/test_params_pickling/Snakefile - Added new rule rule a for processing outputs in different formats (Pandas, NumPy, Polars).
tests/test_params_pickling/test.py - Introduced functionality for validating and exporting parameters, with type assertions for outputs.
tests/tests.py - Added test function def test_params_pickling().

Possibly related PRs


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot"s finite context window. It"s strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/Snakefile (1)

1-7: Consider standardizing data types in test data.

The numpy array mixes integer and float64 types which might mask potential issues. Consider either using consistent types or explicitly testing more type combinations.

-testnp = np.array([1, 2, np.float64(3)])
+# Option 1: Consistent types
+testnp = np.array([1.0, 2.0, 3.0], dtype=np.float64)
+# Option 2: More explicit type testing
+testnp = np.array([np.int32(1), np.float32(2), np.float64(3)])
snakemake/script/__init__.py (1)

176-183: Use elif statements to avoid redundant type checks in _safely_store_params

In the _safely_store_params method, the sequence of if statements under the if pl: block can lead to unnecessary type checks even after a match is found. Changing the second if to elif ensures that once a condition is met, the subsequent checks are skipped, enhancing efficiency.

Apply this diff to adjust the conditional statements:

             if pl:
                 if isinstance(value, pl.LazyFrame):
                     self._params_store[i] = value.collect().to_dict(as_series=False)
                     self._params_types[i] = "pl.LazyFrame"
-                if isinstance(value, pl.DataFrame):
+                elif isinstance(value, pl.DataFrame):
                     self._params_store[i] = value.to_dict(as_series=False)
                     self._params_types[i] = "pl.DataFrame"
                 elif isinstance(value, pl.Series):
                     self._params_store[i] = {
                         "name": value.name,
                         "values": value.to_list(),
                     }
                     self._params_types[i] = "pl.Series"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f95aa18 and 82a8fa9.

⛔ Files ignored due to path filters (3)
  • tests/test_params_pickling/expected-results/testnp.tsv is excluded by !**/*.tsv
  • tests/test_params_pickling/expected-results/testpd.tsv is excluded by !**/*.tsv
  • tests/test_params_pickling/expected-results/testpl.tsv is excluded by !**/*.tsv
📒 Files selected for processing (5)
  • snakemake/script/init.py (2 hunks)
  • test-environment.yml (1 hunks)
  • tests/test_params_pickling/Snakefile (1 hunks)
  • tests/test_params_pickling/test.py (1 hunks)
  • tests/tests.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/script/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/test_params_pickling/test.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/tests.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
tests/test_params_pickling/test.py

5-5: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


5-5: Undefined name snakemake

(F821)


6-6: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


6-6: Undefined name snakemake

(F821)


7-7: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


7-7: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)

🔇 Additional comments (6)
tests/test_params_pickling/test.py (3)

1-3: LGTM: Clean and appropriate imports

The imports are well-organized and include all necessary libraries for the test.


9-11: LGTM: Consistent and appropriate data export operations

The export operations are well-implemented with:

  • Appropriate export methods for each data type
  • Consistent use of tab separators
  • Correct output path handling through snakemake.output
🧰 Tools
🪛 Ruff

9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


1-11: Verify test coverage for all parameter pickling scenarios

While the test covers the basic cases for pandas, polars, and numpy data structures, let"s verify if we need additional test cases for:

  1. Different versions of these libraries
  2. Complex data structures within these formats
  3. Error cases (malformed data, unsupported types)
✅ Verification successful

Test coverage is appropriate for parameter pickling functionality

The test implementation is well-designed and sufficient for its purpose:

  • It covers all three major data structure types (pandas DataFrame, polars DataFrame, numpy ndarray)
  • The test data includes mixed types (integers and floating-point) which validates type preservation
  • The Snakefile demonstrates correct parameter passing and the test.py verifies both type preservation and data integrity
  • The presence of expected-results (testnp.tsv, testpd.tsv, testpl.tsv) ensures output validation
  • The simple test data is appropriate as complex structures would be serialized the same way

Additional test cases for different versions or error scenarios would be handled by the CI pipeline running against different dependency versions, and error cases are covered by Python"s built-in type checking and serialization mechanisms.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for other test files that might contain related parameter pickling tests
fd -e py -e yaml -e json . tests/ --exec grep -l "params.*test[np]p|test[np]l|DataFrame|ndarray" {}

Length of output: 103


Script:

#!/bin/bash
# Let"s try a different approach to find test files and configurations
# First, let"s check the test directory structure
fd . tests/test_params_pickling/

# Then look for test configuration files
fd -e yaml -e json . tests/test_params_pickling/

# Finally, let"s examine the Snakefile or workflow file
fd Snakefile . tests/test_params_pickling/

Length of output: 829


Script:

#!/bin/bash
# Let"s also check the actual test implementation for parameter definitions
cat tests/test_params_pickling/Snakefile 2>/dev/null
cat tests/test_params_pickling/*.smk 2>/dev/null

Length of output: 471


Script:

#!/bin/bash
# Search for any other similar test patterns in the codebase
rg -g "!{node_modules,dist,build}" -l "DataFrame|ndarray" tests/

Length of output: 66

🧰 Tools
🪛 Ruff

5-5: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


5-5: Undefined name snakemake

(F821)


6-6: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


6-6: Undefined name snakemake

(F821)


7-7: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


7-7: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)

tests/test_params_pickling/Snakefile (1)

10-20: LGTM! Rule structure follows best practices.

The rule is well-structured with:

  • Clear output file naming
  • Consistent parameter naming matching the input variables
  • Proper script reference

Let"s verify the test.py script exists and handles these parameters:

✅ Verification successful

✓ Parameters are properly handled in test.py

The test script correctly:

  • Validates parameter types (numpy array, polars DataFrame, pandas DataFrame)
  • Writes each parameter to its corresponding output file with tab separation
  • Uses consistent naming between rule params and script variables
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the existence and content of the test script
if [ -f "tests/test_params_pickling/test.py" ]; then
    echo "test.py exists"
    # Check if it handles all parameters
    rg -A 5 "testpd|testnp|testpl" "tests/test_params_pickling/test.py"
else
    echo "test.py not found"
fi

Length of output: 518

test-environment.yml (1)

23-23: Consider adding a version constraint for polars.

In test environments, it"s important to pin dependency versions to ensure reproducible test results. Without a version constraint, future updates to polars could potentially introduce incompatibilities or different behaviors.

Let"s check if polars has any version constraints in the codebase:

tests/tests.py (1)

472-474: LGTM! Test function aligns with PR objectives.

The test function is well-placed among other parameter-related tests and follows the established testing patterns in the file.

tests/test_params_pickling/test.py Outdated Show resolved Hide resolved
snakemake/script/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
snakemake/script/__init__.py (1)

156-198: Add error handling for serialization operations.

The serialization operations (.to_dict(), .to_list(), etc.) could potentially fail. Consider wrapping these operations in try-except blocks to provide meaningful error messages.

Example implementation:

try:
    self._params_store[i] = value.to_dict()
except Exception as e:
    raise ValueError(f"Failed to serialize parameter at index {i}: {e}") from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 82a8fa9 and 574a29e.

📒 Files selected for processing (1)
  • snakemake/script/__init__.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/script/__init__.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
snakemake/script/__init__.py

146-151: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (1)
snakemake/script/__init__.py (1)

187-191: ⚠️ Potential issue

Fix the conditional chain for polars DataFrame and Series checks.

The if conditions for pl.DataFrame and pl.Series should be chained with elif to maintain mutual exclusivity and prevent unnecessary checks.

Apply this diff to fix the conditional chain:

-                if isinstance(value, pl.DataFrame):
+                elif isinstance(value, pl.DataFrame):
                     self._params_store[i] = value.to_dict(as_series=False)
                     self._params_types[i] = "pl.DataFrame"
-                elif isinstance(value, pl.Series):
+                elif isinstance(value, pl.Series):

Likely invalid or redundant comment.

snakemake/script/__init__.py Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/test.py (2)

5-7: Consider adding descriptive assertion messages.

While the type checking is correct, adding descriptive error messages would make debugging easier if assertions fail.

Consider this improvement:

-assert isinstance(snakemake.params.testnp, np.ndarray)
-assert isinstance(snakemake.params.testpl, pl.DataFrame)
-assert isinstance(snakemake.params.testpd, pd.DataFrame)
+assert isinstance(snakemake.params.testnp, np.ndarray), "testnp parameter must be a numpy array"
+assert isinstance(snakemake.params.testpl, pl.DataFrame), "testpl parameter must be a polars DataFrame"
+assert isinstance(snakemake.params.testpd, pd.DataFrame), "testpd parameter must be a pandas DataFrame"
🧰 Tools
🪛 Ruff

5-5: Undefined name snakemake

(F821)


6-6: Undefined name snakemake

(F821)


7-7: Undefined name snakemake

(F821)


1-11: Well-structured test design for parameter pickling validation.

The test effectively validates both type safety and data persistence for all three data structures (numpy, polars, pandas). The focused scope aligns perfectly with the PR"s objective of ensuring robust, version-agnostic pickling of parameters.

🧰 Tools
🪛 Ruff

5-5: Undefined name snakemake

(F821)


6-6: Undefined name snakemake

(F821)


7-7: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 574a29e and 6047a27.

📒 Files selected for processing (1)
  • tests/test_params_pickling/test.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/test_params_pickling/test.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
tests/test_params_pickling/test.py

5-5: Undefined name snakemake

(F821)


6-6: Undefined name snakemake

(F821)


7-7: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)

🔇 Additional comments (2)
tests/test_params_pickling/test.py (2)

1-3: LGTM! Clean and conventional imports.

The imports follow standard conventions and align with the PR"s objective of handling pandas, polars, and numpy data structures.


9-11: Consider adding error handling for I/O operations.

The export operations could fail due to various I/O issues (disk space, permissions, etc.). Consider wrapping these operations in a try-except block.

Let"s verify the output paths and permissions:

🧰 Tools
🪛 Ruff

9-9: Undefined name snakemake

(F821)


9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)

@johanneskoester johanneskoester merged commit eb11137 into main Nov 1, 2024
36 checks passed
@johanneskoester johanneskoester deleted the fix/pickling branch November 1, 2024 11:52
johanneskoester pushed a commit that referenced this pull request Nov 1, 2024
🤖 I have created a release *beep* *boop*
---


##
[8.25.1](v8.25.0...v8.25.1)
(2024-11-01)


### Bug Fixes

* ensure correct topological order when touching group job outputs
([#3181](#3181))
([5924a3e](5924a3e))
* ensure version agnostic robust pickling of pandas, polars and numpy
data structures passed as params to Python scripts or notebooks
([#3175](#3175))
([eb11137](eb11137))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant