Skip to content

Check for silent lexicographic comparison against string-typed value#240

Draft
thodson-usgs wants to merge 5 commits intoDOI-USGS:mainfrom
thodson-usgs:add-filter-pitfall-check
Draft

Check for silent lexicographic comparison against string-typed value#240
thodson-usgs wants to merge 5 commits intoDOI-USGS:mainfrom
thodson-usgs:add-filter-pitfall-check

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented Apr 24, 2026

Summary (draft — for discussion)

Every queryable field on every OGC collection in the Water Data API is type=string server-side. So any unquoted numeric comparison in a CQL-text filter — value >= 1000, parameter_code = 60, district_code = 1 — isn't a valid numeric comparison: empirically the server returns HTTP 500 Internal Server Error. Even if the user does quote the literal, the comparison is lexicographic (e.g. value > '10' returns rows where value='34.52' because first-char '3' > '1'), and zero-padded codes like parameter_code = '60' silently match nothing because the real values are '00060'-shaped.

Either way, the user's intent was numeric, they get either 500 or silent wrong rows, and the failure is opaque. This PR catches the pattern client-side and raises a clear ValueError before the request fires.

Motivation

Flagged during review of DOI-USGS/dataRetrieval#880 (R side), where ldecicco-USGS pushed back on exposing a generic filter kwarg:

If I saw a generic filter argument, my first thought would be SWEET, I want to answer all these interesting questions about the data. So I'll set filter=value >1000. The problem with that is it works, but since value is a character, it's filtering all the values that are alphabetically above "1000" (like "12"). … My gut says there would be way more people trying stuff like than and then either get the wrong results unknowingly, or complain that dataRetrieval is broken…

This is the smallest change that addresses her concern without removing the filter feature.

What the check does

Runs once per cql-text filter, inside _plan_filter_chunks, before any HTTP traffic:

>>> waterdata.get_continuous(
...     monitoring_location_id="USGS-02238500",
...     parameter_code="00060",
...     filter="value >= 1000",
...     filter_lang="cql-text",
... )
ValueError: Filter compares 'value' to unquoted numeric 1000. Every queryable
on the Water Data API is typed as a string, so ``value >= 1000`` is not a
valid numeric comparison — empirically the server rejects unquoted numeric
literals with HTTP 500. Even if you quote the literal (``value >= '1000'``)
the comparison is lexicographic, which silently misses zero-padded codes
(e.g. ``parameter_code = '60'`` matches nothing because the real codes are
``'00060'``-shaped) and sorts ``value='12'`` above ``value='1000'``. For a
numeric filter, fetch a wider result and reduce in pandas after the call.

Scope: universal, not a watchlist

Every queryable property across every OGC endpoint is type=string — confirmed empirically:

endpoint numeric-looking string fields
continuous value, parameter_code, statistic_id
daily value, parameter_code, statistic_id
field-measurements value, parameter_code
latest-continuous value, parameter_code, statistic_id
latest-daily value, parameter_code, statistic_id
time-series-metadata parameter_code, statistic_id, hydrologic_unit_code
monitoring-locations monitoring_location_number, district_code, state_code, county_code
channel-measurements measurement_number, channel_flow, channel_width, channel_area, channel_velocity

Since there's no such thing as a legitimate numeric comparison on this API, the regex flags any <identifier> <op> <unquoted numeric literal> (or the reverse), regardless of field. Quoted literals (value >= '1000') are not flagged — the caller has signalled they want sort-order semantics.

Live evidence

filter="parameter_code = '00060'"   → 200, 5 rows (correct)
filter="parameter_code = 60"        → 500 Internal Server Error  ← we catch this
filter="value > '10'"               → 200, 31 rows of '34.52', '63160', …  (lex)
filter="value > 10"                 → 500 Internal Server Error  ← we catch this

Test plan

  • ruff check / ruff format --check pass.
  • pytest tests/waterdata_utils_test.py66/66 pass (32 prior + 34 new). The new tests cover:
    • 21 raise cases: every op (>=, >, <=, <, =, !=) × both orderings (x OP N and N OP x), floats, negatives, multiple real-world fields (value, parameter_code, statistic_id, district_code, county_code, hydrologic_unit_code, channel_flow, channel_velocity), and nested in AND/OR expressions.
    • 14 allow cases: quoted literals for every watchlist-replaced field, pure string comparisons, IN lists, and false-positive guards (identifiers appearing only inside quoted string literals like name = 'see district_code = 1 in docs').
    • End-to-end: the error surfaces through get_continuous before _construct_api_requests is ever called (mock-verified).
  • Full non-live suite — 136/136 pass.
  • Live-probed actual server behavior against USGS-02238500 / continuous — unquoted numeric RHS consistently returns 500; quoted literal returns 200 with lex-sorted results.

Open for discussion

  • Warn vs. raise? I went with raise because empirically the alternative is a 500 and silent opacity — a warning would be easy to miss, and "broken dataRetrieval" bug reports would follow.
  • Surface location? Currently in _plan_filter_chunks alongside the other filter validation (chunkability, URL-budget). Could hoist to the get_ogc_data entry.
  • False positives on CAST / function-call syntax (CAST(value AS FLOAT) > 1000): the regex is scoped to simple \b<ident>\b \s* <op> \s* <num> patterns and — empirically — the server doesn't support CAST or CQL2 functions on these endpoints anyway, so the false-positive rate should be near zero in practice.

Marked draft so it can ride along with the R-side discussion before landing.

🤖 Generated with Claude Code

thodson-usgs and others added 3 commits April 23, 2026 15:18
Add ``filter`` and ``filter_lang`` kwargs to the OGC ``waterdata``
getters (``get_continuous``, ``get_daily``, ``get_monitoring_locations``,
``get_time_series_metadata``, ``get_latest_continuous``,
``get_latest_daily``, ``get_field_measurements``, ``get_channel``, and
the other collections built on the same plumbing). The value is
forwarded verbatim to the server's OGC API ``filter`` query
parameter, letting callers use server-side CQL expressions that
aren't expressible via the other kwargs — most commonly, OR'ing
several disjoint time ranges into a single request.

- A new ``FILTER_LANG = Literal["cql-text", "cql-json"]`` type alias
  is added in ``waterdata.types`` and re-exported from the package.
  The server accepts ``cql-text`` (default) and ``cql-json`` today;
  the ``cql2-*`` dialects are not yet supported.

- A long top-level ``OR`` chain is transparently split into multiple
  sub-requests that each fit under the server's URI length limit,
  and the results are concatenated. Filters without a top-level
  ``OR`` are sent as a single request unchanged.

- ``get_continuous`` gains docstring examples showing both the
  simple two-window case and the programmatically-built many-window
  case that exercises the auto-chunking path.

- NEWS.md gains a v1.1.0 highlights block covering this change
  along with the other user-visible additions since release.

- ``tests/waterdata_utils_test.py`` grows coverage for filter
  forwarding, the OR-chunking paths, and error handling.
For each target timestamp, fetch the nearest continuous observation in
a single round trip. Builds a CQL OR-chain of per-target bracketed
windows, pipes it through ``get_continuous`` (which already auto-chunks
long filters across multiple sub-requests), then selects the single
observation closest to each target client-side.

Exists because the Water Data API matches a single-instant ``time=``
parameter exactly (10:30:31 returns zero rows on a 15-minute gauge),
does not implement ``sortby`` for arbitrary queryables, and does not
expose a ``T_NEAREST`` CQL function. The narrow-window + client-side
reduction is the one pattern that works today for multi-target
nearest lookups in one API call.

Tie handling is configurable via ``on_tie``:
  - "first" (default): keep the earlier observation
  - "last":  keep the later observation
  - "mean":  average numeric columns; set ``time`` to the target
@thodson-usgs thodson-usgs force-pushed the add-filter-pitfall-check branch from 559b466 to 68d1a81 Compare April 24, 2026 19:51
thodson-usgs and others added 2 commits April 24, 2026 15:03
``get_nearest_continuous`` is built entirely on top of the CQL
``filter`` passthrough (it constructs an OR-chain and calls
``get_continuous(..., filter=..., filter_lang='cql-text')``), so it
has no meaning without the filter feature. If the filter feature is
ever rolled back, this function rolls back with it.

Putting the function + its four private helpers in a dedicated
``dataretrieval/waterdata/nearest.py`` sibling of ``filters.py``
makes that linked rollback clean:

    rm filters.py nearest.py
    rm tests/waterdata_filters_test.py tests/waterdata_nearest_test.py

plus trimming two re-export lines in ``__init__.py`` and the filter
kwargs in ``api.py``. Versus the previous layout, that replaces
"find and excise a 232-line cluster across two blocks inside
``api.py``" with "delete a file."

Same public surface (``waterdata.get_nearest_continuous`` still
imported from the package root); mock paths in
``tests/waterdata_nearest_test.py`` updated to ``dataretrieval
.waterdata.nearest.get_continuous`` since ``nearest.py`` does
``from .api import get_continuous`` and mocks need to patch where
the name is looked up.

All 176 non-live tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running the rollback proposal literally against a disposable working
tree showed the documented "rollback surface" in ``filters.py``
understated the real cost by ~4× — specifically, api.py was carrying
~120 lines of filter-specific docstring text that would also have to
be removed.

Three changes to close the gap:

- Replace the 8 × 11-line ``filter`` / ``filter_lang`` kwarg docstring
  stanzas with 5-line pointers to ``dataretrieval.waterdata.filters``
  (one "see there for grammar / chunking / pitfall"). Users clicking
  through their IDE still get a useful reference; the authoritative
  prose lives in one place (the filters module docstring) rather than
  being duplicated eight times. Saves ~48 lines of api.py and makes
  each stanza disposable in 5 lines flat.

- Trim ``get_continuous``'s inline filter examples from two extended
  blocks (34 lines, showing both a hand-rolled OR pair and a
  programmatic chain) to a single compact example (14 lines, the
  two-window case). The second example was pedagogically redundant
  with what the filters module docstring covers. Saves ~20 lines.

- Update the "Isolation contract" in ``filters.py``'s module docstring
  to accurately enumerate the rollback surface: include
  ``nearest.py`` + its test file, the ``__all__`` entries in
  ``__init__.py``, the docstring pointers (now trimmed), and the one
  remaining filter example. Acknowledges the two-line ``filter_lang``
  → ``filter-lang`` URL translation as optional dead-code removal.

No behaviour change. All 176 non-live tests still pass. Measured
rollback surface after this commit: ~70 lines of surgery across 3
source files + 4 file deletions, vs. ~150 lines before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant