-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure version agnostic robust pickling of pandas, polars and numpy data structures passed as params to Python scripts or notebooks #3175
Conversation
…mpy data structures passed as params to Python scripts or notebooks
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces significant changes to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot"s finite context window. It"s strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/Snakefile (1)
1-7
: Consider standardizing data types in test data.
The numpy array mixes integer and float64 types which might mask potential issues. Consider either using consistent types or explicitly testing more type combinations.
-testnp = np.array([1, 2, np.float64(3)])
+# Option 1: Consistent types
+testnp = np.array([1.0, 2.0, 3.0], dtype=np.float64)
+# Option 2: More explicit type testing
+testnp = np.array([np.int32(1), np.float32(2), np.float64(3)])
snakemake/script/__init__.py (1)
176-183
: Use elif
statements to avoid redundant type checks in _safely_store_params
In the _safely_store_params
method, the sequence of if
statements under the if pl:
block can lead to unnecessary type checks even after a match is found. Changing the second if
to elif
ensures that once a condition is met, the subsequent checks are skipped, enhancing efficiency.
Apply this diff to adjust the conditional statements:
if pl:
if isinstance(value, pl.LazyFrame):
self._params_store[i] = value.collect().to_dict(as_series=False)
self._params_types[i] = "pl.LazyFrame"
- if isinstance(value, pl.DataFrame):
+ elif isinstance(value, pl.DataFrame):
self._params_store[i] = value.to_dict(as_series=False)
self._params_types[i] = "pl.DataFrame"
elif isinstance(value, pl.Series):
self._params_store[i] = {
"name": value.name,
"values": value.to_list(),
}
self._params_types[i] = "pl.Series"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (3)
tests/test_params_pickling/expected-results/testnp.tsv
is excluded by!**/*.tsv
tests/test_params_pickling/expected-results/testpd.tsv
is excluded by!**/*.tsv
tests/test_params_pickling/expected-results/testpl.tsv
is excluded by!**/*.tsv
📒 Files selected for processing (5)
- snakemake/script/init.py (2 hunks)
- test-environment.yml (1 hunks)
- tests/test_params_pickling/Snakefile (1 hunks)
- tests/test_params_pickling/test.py (1 hunks)
- tests/tests.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/script/__init__.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
tests/test_params_pickling/test.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
tests/tests.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff
tests/test_params_pickling/test.py
5-5: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
5-5: Undefined name snakemake
(F821)
6-6: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
6-6: Undefined name snakemake
(F821)
7-7: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🔇 Additional comments (6)
tests/test_params_pickling/test.py (3)
1-3
: LGTM: Clean and appropriate imports
The imports are well-organized and include all necessary libraries for the test.
9-11
: LGTM: Consistent and appropriate data export operations
The export operations are well-implemented with:
- Appropriate export methods for each data type
- Consistent use of tab separators
- Correct output path handling through snakemake.output
🧰 Tools
🪛 Ruff
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
1-11
: Verify test coverage for all parameter pickling scenarios
While the test covers the basic cases for pandas, polars, and numpy data structures, let"s verify if we need additional test cases for:
- Different versions of these libraries
- Complex data structures within these formats
- Error cases (malformed data, unsupported types)
✅ Verification successful
Test coverage is appropriate for parameter pickling functionality
The test implementation is well-designed and sufficient for its purpose:
- It covers all three major data structure types (pandas DataFrame, polars DataFrame, numpy ndarray)
- The test data includes mixed types (integers and floating-point) which validates type preservation
- The Snakefile demonstrates correct parameter passing and the test.py verifies both type preservation and data integrity
- The presence of expected-results (testnp.tsv, testpd.tsv, testpl.tsv) ensures output validation
- The simple test data is appropriate as complex structures would be serialized the same way
Additional test cases for different versions or error scenarios would be handled by the CI pipeline running against different dependency versions, and error cases are covered by Python"s built-in type checking and serialization mechanisms.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for other test files that might contain related parameter pickling tests
fd -e py -e yaml -e json . tests/ --exec grep -l "params.*test[np]p|test[np]l|DataFrame|ndarray" {}
Length of output: 103
Script:
#!/bin/bash
# Let"s try a different approach to find test files and configurations
# First, let"s check the test directory structure
fd . tests/test_params_pickling/
# Then look for test configuration files
fd -e yaml -e json . tests/test_params_pickling/
# Finally, let"s examine the Snakefile or workflow file
fd Snakefile . tests/test_params_pickling/
Length of output: 829
Script:
#!/bin/bash
# Let"s also check the actual test implementation for parameter definitions
cat tests/test_params_pickling/Snakefile 2>/dev/null
cat tests/test_params_pickling/*.smk 2>/dev/null
Length of output: 471
Script:
#!/bin/bash
# Search for any other similar test patterns in the codebase
rg -g "!{node_modules,dist,build}" -l "DataFrame|ndarray" tests/
Length of output: 66
🧰 Tools
🪛 Ruff
5-5: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
5-5: Undefined name snakemake
(F821)
6-6: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
6-6: Undefined name snakemake
(F821)
7-7: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
tests/test_params_pickling/Snakefile (1)
10-20
: LGTM! Rule structure follows best practices.
The rule is well-structured with:
- Clear output file naming
- Consistent parameter naming matching the input variables
- Proper script reference
Let"s verify the test.py script exists and handles these parameters:
✅ Verification successful
✓ Parameters are properly handled in test.py
The test script correctly:
- Validates parameter types (numpy array, polars DataFrame, pandas DataFrame)
- Writes each parameter to its corresponding output file with tab separation
- Uses consistent naming between rule params and script variables
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the existence and content of the test script
if [ -f "tests/test_params_pickling/test.py" ]; then
echo "test.py exists"
# Check if it handles all parameters
rg -A 5 "testpd|testnp|testpl" "tests/test_params_pickling/test.py"
else
echo "test.py not found"
fi
Length of output: 518
test-environment.yml (1)
23-23
: Consider adding a version constraint for polars.
In test environments, it"s important to pin dependency versions to ensure reproducible test results. Without a version constraint, future updates to polars could potentially introduce incompatibilities or different behaviors.
Let"s check if polars has any version constraints in the codebase:
tests/tests.py (1)
472-474
: LGTM! Test function aligns with PR objectives.
The test function is well-placed among other parameter-related tests and follows the established testing patterns in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
snakemake/script/__init__.py (1)
156-198
: Add error handling for serialization operations.The serialization operations (
.to_dict()
,.to_list()
, etc.) could potentially fail. Consider wrapping these operations in try-except blocks to provide meaningful error messages.Example implementation:
try: self._params_store[i] = value.to_dict() except Exception as e: raise ValueError(f"Failed to serialize parameter at index {i}: {e}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake/script/__init__.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/script/__init__.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff
snakemake/script/__init__.py
146-151: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (1)
snakemake/script/__init__.py (1)
187-191
:
Fix the conditional chain for polars DataFrame and Series checks.
The if
conditions for pl.DataFrame
and pl.Series
should be chained with elif
to maintain mutual exclusivity and prevent unnecessary checks.
Apply this diff to fix the conditional chain:
- if isinstance(value, pl.DataFrame):
+ elif isinstance(value, pl.DataFrame):
self._params_store[i] = value.to_dict(as_series=False)
self._params_types[i] = "pl.DataFrame"
- elif isinstance(value, pl.Series):
+ elif isinstance(value, pl.Series):
Likely invalid or redundant comment.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/test.py (2)
5-7
: Consider adding descriptive assertion messages.While the type checking is correct, adding descriptive error messages would make debugging easier if assertions fail.
Consider this improvement:
-assert isinstance(snakemake.params.testnp, np.ndarray) -assert isinstance(snakemake.params.testpl, pl.DataFrame) -assert isinstance(snakemake.params.testpd, pd.DataFrame) +assert isinstance(snakemake.params.testnp, np.ndarray), "testnp parameter must be a numpy array" +assert isinstance(snakemake.params.testpl, pl.DataFrame), "testpl parameter must be a polars DataFrame" +assert isinstance(snakemake.params.testpd, pd.DataFrame), "testpd parameter must be a pandas DataFrame"🧰 Tools
🪛 Ruff
5-5: Undefined name
snakemake
(F821)
6-6: Undefined name
snakemake
(F821)
7-7: Undefined name
snakemake
(F821)
1-11
: Well-structured test design for parameter pickling validation.The test effectively validates both type safety and data persistence for all three data structures (numpy, polars, pandas). The focused scope aligns perfectly with the PR"s objective of ensuring robust, version-agnostic pickling of parameters.
🧰 Tools
🪛 Ruff
5-5: Undefined name
snakemake
(F821)
6-6: Undefined name
snakemake
(F821)
7-7: Undefined name
snakemake
(F821)
9-9: Undefined name
snakemake
(F821)
9-9: Undefined name
snakemake
(F821)
10-10: Undefined name
snakemake
(F821)
10-10: Undefined name
snakemake
(F821)
11-11: Undefined name
snakemake
(F821)
11-11: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
tests/test_params_pickling/test.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/test_params_pickling/test.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff
tests/test_params_pickling/test.py
5-5: Undefined name snakemake
(F821)
6-6: Undefined name snakemake
(F821)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🔇 Additional comments (2)
tests/test_params_pickling/test.py (2)
1-3
: LGTM! Clean and conventional imports.
The imports follow standard conventions and align with the PR"s objective of handling pandas, polars, and numpy data structures.
9-11
: Consider adding error handling for I/O operations.
The export operations could fail due to various I/O issues (disk space, permissions, etc.). Consider wrapping these operations in a try-except block.
Let"s verify the output paths and permissions:
🧰 Tools
🪛 Ruff
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🤖 I have created a release *beep* *boop* --- ## [8.25.1](v8.25.0...v8.25.1) (2024-11-01) ### Bug Fixes * ensure correct topological order when touching group job outputs ([#3181](#3181)) ([5924a3e](5924a3e)) * ensure version agnostic robust pickling of pandas, polars and numpy data structures passed as params to Python scripts or notebooks ([#3175](#3175)) ([eb11137](eb11137)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
QC
docs/
) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
Release Notes
New Features
Chores
polars
as a new dependency in the test environment setup.