feat: direct subsampling and applying of OSW during scoring by singjc · Pull Request #205 · PyProphet/pyprophet

singjc · 2026-04-26T17:09:13Z

This pull request introduces improved and more flexible subsampling support for OSW files, aligning the behavior with other supported file types (such as parquet) and making it easier to perform semi-supervised learning on subsets of the data. The implementation ensures that subsampling is handled efficiently within DuckDB views, and that all downstream feature queries respect the sampled subset. Additionally, there are minor improvements to file type validation and weight saving logic.

Enhanced OSW subsampling and feature querying:

Added a new _init_duckdb_views method to BaseOSWReader to create a temporary table of sampled precursor IDs when subsampling is enabled, allowing efficient filtering for all feature queries. This method is now called in the OSW reader before creating feature views. [1] [2] [3]
Updated all DuckDB feature view creation methods in pyprophet/io/scoring/osw.py (_fetch_ms2_features_duckdb, _fetch_ms1_features_duckdb, _fetch_transition_features_duckdb, _fetch_alignment_features_duckdb) to optionally filter by the sampled precursor IDs, ensuring that only the subsampled data is processed when requested. [1] [2] [3] [4] [5] [6] [7] [8] [9]

User experience and compatibility improvements:

Extended file type validation in the scoring CLI so that OSW files, in addition to parquet formats, now support subsampling directly. The warning message has been updated to reflect this, and users of unsupported formats are advised to manually subsample their data.
In the OSW writer, updated the logic so that SVM classifier weights are saved in the same table as LDA weights, and ensured that the database commit is performed after saving. [1] [2]

Minor/maintenance:

Added import math to pyprophet/io/_base.py for use in subsample size calculation.
Updated Cython-generated file references to reflect newer numpy versions (no functional impact). [1] [2] [3] [4] [5] [6] [7] [8] [9]

…scores arrays in lookup_values_from_error_table

…meters are returned as a NumPy array

…ompatibility issues

- Enhanced `PyProphetRunner` to return output file for OSW file type. - Improved error message for missing PYPROPHET_WEIGHTS table to include classifier type. - Introduced new test outputs for OSW subsampling and weight application. - Updated `OSWTestStrategy` to handle subsampling and weight application workflows. - Added tests for OSW subsampling and applying weights to the full dataset. Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

This PR adds first-class OSW subsampling support during scoring by pushing the sampled subset into DuckDB views, so downstream feature queries operate only on the sampled precursors (similar to parquet workflows). It also adjusts scoring/apply-weights behavior for OSW and includes a few compatibility fixes (NumPy/Pandas).

Changes:

Add OSW DuckDB initialization to materialize a sampled precursor-id set and apply it across OSW feature views.
Extend scoring CLI subsampling validation to include OSW (alongside parquet variants).
Improve weight persistence/application behavior (SVM weights in OSW weight table, commit), plus NumPy/Pandas compatibility fixes and new regression tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`pyprophet/io/_base.py`	Adds `BaseOSWReader._init_duckdb_views()` to create a sampled precursor-id temp table when subsampling.
`pyprophet/io/scoring/osw.py`	Calls `_init_duckdb_views()` and threads subsample filtering into DuckDB feature-view creation; updates OSW weight saving logic.
`pyprophet/cli/score.py`	Treats OSW as a supported format for `--subsample_ratio` workflows.
`pyprophet/scoring/runner.py`	Returns OSW path from scoring runner; improves apply-weights error message.
`pyprophet/scoring/classifiers.py`	Forces feature matrices/parameters to NumPy arrays (dtype float32) for better Pandas/NumPy interop and clearer failure mode.
`pyprophet/stats.py`	Copies arrays before calling optimized matching to avoid read-only buffer issues in newer NumPy.
`tests/test_pyprophet_score.py`	Adds OSW subsample and OSW apply-weights regression tests; plumbs `--subsample_ratio` into OSW strategy execution.
`tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample.out`	Adds golden output for OSW subsampling test.
`tests/_regtest_outputs/test_pyprophet_score.test_osw_subsample_apply_weights.out`	Adds golden output for OSW apply-weights test.
`pyprophet/scoring/_optimized.c`	Updates cython-generated references (no intended functional change).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…adjust test commands for subsampling ratio Co-authored-by: Copilot <copilot@github.com>

… commands - Modified expected output values in test_pyprophet_score.test_osw_1.out and test_pyprophet_score.test_tsv_1.out to reflect changes in scoring results. - Increased subsampling ratio from 0.5 to 1.0 in pyprophet scoring commands for both metabolomics and regular OSW workflows in test_pyprophet_score.py. Co-authored-by: Copilot <copilot@github.com>

…nd improve stability across platforms

singjc and others added 4 commits April 17, 2026 09:57

Fix read-only buffer issue by ensuring writable copies of cutoff and …

0a7f816

…scores arrays in lookup_values_from_error_table

Fix sequence type error in LDALearner.get_parameters by ensuring para…

7b8c452

…meters are returned as a NumPy array

Ensure proper numpy array conversion in score methods to fix pandas c…

954dac1

…ompatibility issues

Copilot AI review requested due to automatic review settings April 26, 2026 17:09

singjc enabled auto-merge April 26, 2026 17:09

Copilot started reviewing on behalf of singjc April 26, 2026 17:09 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread pyprophet/io/scoring/osw.py

Comment thread pyprophet/io/scoring/osw.py

Comment thread pyprophet/io/scoring/osw.py

Comment thread tests/test_pyprophet_score.py

Comment thread tests/test_pyprophet_score.py

singjc and others added 4 commits April 26, 2026 13:58

Update OSWReader to accept customizable precursor ID expressions and …

98a2d88

…adjust test commands for subsampling ratio Co-authored-by: Copilot <copilot@github.com>

Refactor regtest output handling to normalize floating-point values a…

010018c

…nd improve stability across platforms

Fix precision of var_bseries_score in regtest output

ac42e6a

singjc merged commit 301e13b into PyProphet:master Apr 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: direct subsampling and applying of OSW during scoring#205

feat: direct subsampling and applying of OSW during scoring#205
singjc merged 8 commits intoPyProphet:masterfrom
singjc:master

singjc commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

singjc commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants