feat: direct subsampling and applying of OSW during scoring#205
Merged
singjc merged 8 commits intoPyProphet:masterfrom Apr 26, 2026
Merged
feat: direct subsampling and applying of OSW during scoring#205singjc merged 8 commits intoPyProphet:masterfrom
singjc merged 8 commits intoPyProphet:masterfrom
Conversation
…scores arrays in lookup_values_from_error_table
…meters are returned as a NumPy array
…ompatibility issues
- Enhanced `PyProphetRunner` to return output file for OSW file type. - Improved error message for missing PYPROPHET_WEIGHTS table to include classifier type. - Introduced new test outputs for OSW subsampling and weight application. - Updated `OSWTestStrategy` to handle subsampling and weight application workflows. - Added tests for OSW subsampling and applying weights to the full dataset. Co-authored-by: Copilot <copilot@github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds first-class OSW subsampling support during scoring by pushing the sampled subset into DuckDB views, so downstream feature queries operate only on the sampled precursors (similar to parquet workflows). It also adjusts scoring/apply-weights behavior for OSW and includes a few compatibility fixes (NumPy/Pandas).
Changes:
- Add OSW DuckDB initialization to materialize a sampled precursor-id set and apply it across OSW feature views.
- Extend scoring CLI subsampling validation to include OSW (alongside parquet variants).
- Improve weight persistence/application behavior (SVM weights in OSW weight table, commit), plus NumPy/Pandas compatibility fixes and new regression tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pyprophet/io/_base.py |
Adds BaseOSWReader._init_duckdb_views() to create a sampled precursor-id temp table when subsampling. |
pyprophet/io/scoring/osw.py |
Calls _init_duckdb_views() and threads subsample filtering into DuckDB feature-view creation; updates OSW weight saving logic. |
pyprophet/cli/score.py |
Treats OSW as a supported format for --subsample_ratio workflows. |
pyprophet/scoring/runner.py |
Returns OSW path from scoring runner; improves apply-weights error message. |
pyprophet/scoring/classifiers.py |
Forces feature matrices/parameters to NumPy arrays (dtype float32) for better Pandas/NumPy interop and clearer failure mode. |
pyprophet/stats.py |
Copies arrays before calling optimized matching to avoid read-only buffer issues in newer NumPy. |
tests/test_pyprophet_score.py |
Adds OSW subsample and OSW apply-weights regression tests; plumbs --subsample_ratio into OSW strategy execution. |
tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample.out |
Adds golden output for OSW subsampling test. |
tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample_apply_weights.out |
Adds golden output for OSW apply-weights test. |
pyprophet/scoring/_optimized.c |
Updates cython-generated references (no intended functional change). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…adjust test commands for subsampling ratio Co-authored-by: Copilot <copilot@github.com>
… commands - Modified expected output values in test_pyprophet_score.test_osw_1.out and test_pyprophet_score.test_tsv_1.out to reflect changes in scoring results. - Increased subsampling ratio from 0.5 to 1.0 in pyprophet scoring commands for both metabolomics and regular OSW workflows in test_pyprophet_score.py. Co-authored-by: Copilot <copilot@github.com>
…nd improve stability across platforms
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improved and more flexible subsampling support for OSW files, aligning the behavior with other supported file types (such as parquet) and making it easier to perform semi-supervised learning on subsets of the data. The implementation ensures that subsampling is handled efficiently within DuckDB views, and that all downstream feature queries respect the sampled subset. Additionally, there are minor improvements to file type validation and weight saving logic.
Enhanced OSW subsampling and feature querying:
_init_duckdb_viewsmethod toBaseOSWReaderto create a temporary table of sampled precursor IDs when subsampling is enabled, allowing efficient filtering for all feature queries. This method is now called in the OSW reader before creating feature views. [1] [2] [3]pyprophet/io/scoring/osw.py(_fetch_ms2_features_duckdb,_fetch_ms1_features_duckdb,_fetch_transition_features_duckdb,_fetch_alignment_features_duckdb) to optionally filter by the sampled precursor IDs, ensuring that only the subsampled data is processed when requested. [1] [2] [3] [4] [5] [6] [7] [8] [9]User experience and compatibility improvements:
Minor/maintenance:
import mathtopyprophet/io/_base.pyfor use in subsample size calculation.