Skip to content

Toxicity evaluation#90

Open
rkritika1508 wants to merge 47 commits intomainfrom
feat/toxicity-evaluation
Open

Toxicity evaluation#90
rkritika1508 wants to merge 47 commits intomainfrom
feat/toxicity-evaluation

Conversation

@rkritika1508
Copy link
Copy Markdown
Collaborator

@rkritika1508 rkritika1508 commented Apr 13, 2026

Summary

Target issue is #89
Explain the motivation for making this change. What existing problem does the pull request solve?
Adds an offline evaluation script for toxicity detection covering three guardrail validators across two benchmark datasets.

New file: backend/app/evaluation/toxicity/run.py

  • Evaluates LlamaGuard7B, NSFWText, and ProfanityFree validators
  • Runs against two datasets:
    • HASOC (toxicity_test_hasoc.csv) — English/Hindi tweets labeled HOF/NOT
    • ShareChat (toxicity_test_sharechat.csv) — Hindi comments with binary labels
  • Collates all three validator predictions into a single CSV per dataset
  • Metrics (accuracy, precision, recall, F1, latency, memory) written per validator into a single JSON per dataset
  • Tracks latency (mean, p95, max) and peak memory via Profiler

Updated: backend/app/evaluation/README.md

  • Added toxicity to folder structure
  • Added full Toxicity section documenting datasets, expected columns, outputs, run command, and notes on remote inferencing / model download
  • Updated "Running All Evaluations" blurb to include toxicity

Updated: backend/scripts/run_all_evaluations.sh

  • Added toxicity/run.py to the RUNNERS array

Output structure

outputs/toxicity/
  predictions_hasoc.csv       # text, y_true, llamaguard_7b_pred, nsfw_text_pred, profanity_free_pred
  metrics_hasoc.json          # all 3 validators' metrics in one file
  predictions_sharechat.csv
  metrics_sharechat.json

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • NSFW text detection validator added with configurable threshold, validation method, device, and model
    • API now recognizes "nsfw_text" as an available validator type
  • Evaluation

    • Offline toxicity evaluation runner added and documented to run NSFW + other validators across sample datasets
    • Container image pre-downloads the NSFW model to reduce first-run latency
  • Tests

    • Expanded integration and unit tests covering NSFW validator behavior and multi-validator combinations

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Adds an NSFW text validator: schema, config class, enum and manifest registration, docs, Docker pre-download of the HF model, tests, evaluation runner, and required dependencies.

Changes

Cohort / File(s) Summary
Validator Registration & Schema
backend/app/core/enum.py, backend/app/core/validators/validators.json, backend/app/schemas/guardrail_config.py
Added NSFWText enum member, registered nsfw_text in validators.json, and included NSFWTextSafetyValidatorConfig in the ValidatorConfigItem discriminated union.
Validator Configuration
backend/app/core/validators/config/nsfw_text_safety_validator_config.py
New NSFWTextSafetyValidatorConfig class with fields (type, threshold, validation_method, device, model_name) and build() returning a Guardrails Hub NSFWText validator wired to those params and on_fail.
Documentation & Docker
backend/Dockerfile, backend/app/api/API_USAGE.md, backend/app/core/validators/README.md
Set HF_HOME and added HF model pre-download in Dockerfile; documented nsfw_text in API usage and validators README.
Tests
backend/app/tests/test_guardrails_api_integration.py, backend/app/tests/test_toxicity_hub_validators.py
Added integration tests exercising nsfw_text alone and combined with other validators; added unit tests for NSFWTextSafetyValidatorConfig.build() and validation/edge cases.
Evaluation & Runner
backend/app/evaluation/toxicity/run.py, backend/scripts/run_all_evaluations.sh, backend/app/evaluation/README.md
New offline toxicity evaluation script (HASOC, ShareChat) running NSFW+other validators; added runner to run_all_evaluations.sh and README updates.
Dependencies
backend/pyproject.toml
Added transformers>=5.0.0 and torch>=2.0.0; configured uv to use PyTorch CPU wheel index for torch.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant ValidatorConfig
    participant GuardrailsHub
    participant HFCache as "HF Cache\n(/app/hf_cache)"

    Client->>API: POST /validate (includes `nsfw_text` config)
    API->>ValidatorConfig: parse & resolve `NSFWTextSafetyValidatorConfig`
    ValidatorConfig->>GuardrailsHub: build() -> instantiate NSFWText(model_name, device, threshold, on_fail)
    GuardrailsHub->>HFCache: load model/tokenizer (cache_dir=/app/hf_cache)
    GuardrailsHub-->>API: validation result (pass/fail or fix)
    API-->>Client: response (success, data, errors)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

enhancement, ready-for-review

Suggested reviewers

  • nishika26
  • AkhileshNegi
  • dennyabrain

Poem

🐰 I hopped through docs and code so neat,
Cached models snug beneath my feet,
NSFW checks now guard the gate,
Tests and runners all in state,
A little hop for safer feeds. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Toxicity evaluation' directly describes the main change—adding an offline evaluation pipeline for toxicity detection across three validators and two datasets.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/toxicity-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (4)
backend/app/tests/test_guardrails_api_integration.py (2)

348-366: The low-threshold test doesn’t validate threshold behavior.

Using a clearly explicit sentence likely fails at both low and default/high thresholds, so this test may still pass if threshold handling is broken. Consider an A/B assertion with a borderline input across two thresholds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_guardrails_api_integration.py` around lines 348 - 366,
The test test_input_guardrails_with_nsfw_text_with_low_threshold uses an
obviously explicit sentence so it doesn't verify threshold behavior; replace the
input with a borderline phrase (e.g., mildly suggestive but not explicit) and
assert A/B behavior by posting the same input twice: once with a low threshold
(e.g., threshold=0.1) expecting success False (validator fails) and once with a
high threshold (e.g., threshold=0.9) expecting success True (validator passes);
refer to the VALIDATE_API_PATH call and the "validators" payload to implement
the two assertions within this test (or split into two tests) so threshold
sensitivity is validated.

331-383: Two NSFW exception tests overlap heavily.

test_input_guardrails_with_nsfw_text_on_explicit_content and test_input_guardrails_with_nsfw_text_exception_action currently validate almost the same path. Merging them would keep coverage while reducing integration test runtime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_guardrails_api_integration.py` around lines 331 - 383,
Two tests duplicate coverage:
test_input_guardrails_with_nsfw_text_on_explicit_content and
test_input_guardrails_with_nsfw_text_exception_action both post similar NSFW
inputs and assert failure; merge them by removing one and consolidating into a
single test (e.g., keep
test_input_guardrails_with_nsfw_text_on_explicit_content) that covers both input
variants or parameterize the test to run both payloads, ensuring you still call
VALIDATE_API_PATH with request_id/organization_id/project_id and the nsfw_text
validator (on_fail: "exception") and assert response.status_code == 200 and
body["success"] is False; update or remove the redundant test function to avoid
duplicate assertions and reduce runtime.
backend/app/tests/test_toxicity_hub_validators.py (1)

295-442: Consider parametrizing repeated NSFW config tests.

This block is clear, but a lot of cases follow the same patch/build/assert pattern. Consolidating with pytest.mark.parametrize would reduce repetition and future maintenance overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_toxicity_hub_validators.py` around lines 295 - 442,
Refactor the repeated patch/build/assert patterns in
TestNSFWTextSafetyValidatorConfig by consolidating similar test cases into
parametrized tests using pytest.mark.parametrize: identify groups like
build-with-defaults/custom/threshold/device/model/on_fail mappings (tests
test_build_with_defaults, test_build_with_custom_params,
test_build_with_threshold_at_zero, test_build_with_threshold_at_one,
test_build_with_device_none, test_build_with_model_name_none,
test_on_fail_fix_resolves_to_callable,
test_on_fail_exception_resolves_to_exception_action,
test_on_fail_rephrase_resolves_to_callable) and replace each group with a single
parametrized function that supplies (input-config-kwargs, expected kwargs or
callable/OnFailAction) and inside the test perform the same with
patch(_NSFW_PATCH) call to config.build() and assert mock_validator.call_args or
result; keep the standalone tests that exercise _on_fix behavior and
ValidationError cases (test_on_fix_sets_validator_metadata_when_fix_value_empty,
test_on_fix_does_not_set_metadata_when_fix_value_present,
test_invalid_on_fail_raises, test_wrong_type_literal_rejected,
test_extra_fields_rejected, test_threshold_must_be_numeric) unchanged.
backend/app/core/validators/config/nsfw_text_safety_validator_config.py (1)

11-11: Constrain validation_method to known values.

Using plain str here allows typos to pass schema validation and fail later at runtime. Restrict this field to explicit literals used by the validator interface.

Proposed fix
-    validation_method: str = "sentence"
+    validation_method: Literal["sentence", "full"] = "sentence"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/config/nsfw_text_safety_validator_config.py` at
line 11, Change the validation_method field from a bare str to a Literal of the
allowed values to prevent typos: update validation_method: str = "sentence" to
validation_method: Literal["sentence", "document"] = "sentence" (or the exact
literals used by the validator interface), and add the necessary import for
Literal from typing (or typing_extensions if supporting older Python versions);
ensure the class in nsfw_text_safety_validator_config.py uses the Literal type
so schema/type checkers enforce the allowed values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/API_USAGE.md`:
- Around line 100-104: The documentation contains a duplicated "type=" filter
line in API_USAGE.md (the two similar bullet lines listing allowed type values);
remove the stale duplicate so only one definitive "type=..." entry remains (keep
the version that includes nsfw_text if that is the intended supported value),
ensuring the Optional filters section lists each query param exactly once and
update any surrounding punctuation/formatting to remain consistent.

In `@backend/app/core/validators/config/nsfw_text_safety_validator_config.py`:
- Line 10: Add a schema-level constraint to the threshold field on
NSFWTextSafetyValidatorConfig so invalid floats outside [0.0, 1.0] fail
validation; replace the loose float declaration for threshold with a Pydantic
Field that enforces ge=0.0 and le=1.0 (keeping the default 0.8) and import Field
from pydantic if not already present.

In `@backend/app/core/validators/README.md`:
- Around line 430-434: Update the "Default stage strategy" section so it mirrors
the earlier recommendation by adding `nsfw_text` to both the `input` and
`output` lists; ensure the operational summary later in the README also includes
`nsfw_text` for input and output and that the wording/justification matches the
earlier "input" and "output" recommendation for consistency across the document.

In `@backend/app/evaluation/toxicity/run.py`:
- Around line 32-41: The VALIDATORS dict currently always instantiates
LlamaGuard7B which requires remote inference; change the registration so remote
validators are only added when a runtime flag is enabled (e.g.,
enable_remote_validators or USE_REMOTE_VALIDATORS env var) or provide a separate
OFFLINE_VALIDATORS mapping; specifically, update the code that builds VALIDATORS
to conditionally include the "llamaguard_7b" entry (the LlamaGuard7B(...)
lambda) based on that flag, or move LlamaGuard7B into a distinct
REMOTE_VALIDATORS collection and merge it into VALIDATORS only when the flag is
true so the runner can operate fully offline without creating LlamaGuard7B.
- Around line 94-98: The module currently executes the evaluation loop at import
time (iterating DATASETS and calling run_dataset, then printing OUT_DIR); wrap
that logic in a main guard so it only runs when executed as a script. Move the
for dataset_name, dataset_cfg in DATASETS: loop and the final print("Done.
Results saved to", OUT_DIR) into an if __name__ == "__main__": block (keeping
references to DATASETS, run_dataset, and OUT_DIR intact) so importing the module
won't trigger dataset reads or model loading.
- Around line 63-69: The current use of .astype(str) turns NaN into the literal
"nan" and causes validator.validate (called via p.record and assigned to
df[f"{validator_name}_result"]) to score missing text; change the preprocessing
to either skip null rows or replace missing values with an empty string before
validation (e.g., operate on df[text_col].fillna("").astype(str) or filter
df[text_col].notna()), then call p.record(lambda t: validator.validate(t,
metadata={}), <cleaned text>) so missing inputs are not treated as real samples
and latency/statistics aren't skewed.
- Around line 51-54: After calling df["y_true"] = df[label_col].map(label_map)
when label_map is not None, immediately validate for unmapped/NaN values and
raise an error listing the unexpected original labels instead of proceeding;
e.g., check df["y_true"].isna(), compute unexpected =
df.loc[df["y_true"].isna(), label_col].unique(), and raise ValueError with those
values (keep the existing astype(int) branch unchanged when label_map is None).

In `@backend/app/tests/test_guardrails_api_integration.py`:
- Around line 331-346: The test
test_input_guardrails_with_nsfw_text_on_explicit_content only asserts success is
False and can pass for infrastructure/errors rather than actual NSFW detection;
update the test (and the similar test around lines 368-383) to include a
positive control that should pass NSFW validation (e.g., a benign clean input)
and assert it returns success is True, and for the explicit-content case assert
the failure payload structure/content (inspect body["errors"] or body["reason"]
depending on how VALIDATE_API_PATH returns validator failures) to verify the
nsfw_text validator actually flagged the input; use integration_client and the
same request payload shape but change "input" and check specific fields
(validator type "nsfw_text", on_fail outcome, and expected failure
message/shape) instead of only asserting success is False.

In `@backend/Dockerfile`:
- Around line 53-56: Update the Dockerfile pre-download step to pin the Hugging
Face model by adding a fixed revision SHA to both
AutoTokenizer.from_pretrained(...) and
AutoModelForSequenceClassification.from_pretrained(...) calls; then add a
model_revision field to the NSFWTextSafetyValidatorConfig and pass that value
through when constructing the NSFWText validator so the validator defaults use
the same pinned revision SHA; ensure the same revision string is used in the
Dockerfile prefetch and the NSFWTextSafetyValidatorConfig.model_revision to
guarantee reproducible builds and evaluations.

In `@backend/pyproject.toml`:
- Around line 50-58: The uv configuration in [tool.uv.sources] currently pins
the torch package to the pytorch-cpu index (name "pytorch-cpu" / url
"https://download.pytorch.org/whl/cpu" with explicit = true), which prevents
installing CUDA-enabled wheels and conflicts with the NSFW validator's
documented device="cuda" option; fix this by either removing or making
non-explicit the torch entry so normal indices can resolve CUDA wheels, or
introduce separate dependency profiles (e.g., "pytorch-cpu" and "pytorch-cuda")
and update documentation and the NSFW validator docs/devices accordingly so
device="cuda" is only advertised when the CUDA profile is used (reference the
[tool.uv.sources] / [[tool.uv.index]] entries and the NSFW validator
device="cuda" documentation).

---

Nitpick comments:
In `@backend/app/core/validators/config/nsfw_text_safety_validator_config.py`:
- Line 11: Change the validation_method field from a bare str to a Literal of
the allowed values to prevent typos: update validation_method: str = "sentence"
to validation_method: Literal["sentence", "document"] = "sentence" (or the exact
literals used by the validator interface), and add the necessary import for
Literal from typing (or typing_extensions if supporting older Python versions);
ensure the class in nsfw_text_safety_validator_config.py uses the Literal type
so schema/type checkers enforce the allowed values.

In `@backend/app/tests/test_guardrails_api_integration.py`:
- Around line 348-366: The test
test_input_guardrails_with_nsfw_text_with_low_threshold uses an obviously
explicit sentence so it doesn't verify threshold behavior; replace the input
with a borderline phrase (e.g., mildly suggestive but not explicit) and assert
A/B behavior by posting the same input twice: once with a low threshold (e.g.,
threshold=0.1) expecting success False (validator fails) and once with a high
threshold (e.g., threshold=0.9) expecting success True (validator passes); refer
to the VALIDATE_API_PATH call and the "validators" payload to implement the two
assertions within this test (or split into two tests) so threshold sensitivity
is validated.
- Around line 331-383: Two tests duplicate coverage:
test_input_guardrails_with_nsfw_text_on_explicit_content and
test_input_guardrails_with_nsfw_text_exception_action both post similar NSFW
inputs and assert failure; merge them by removing one and consolidating into a
single test (e.g., keep
test_input_guardrails_with_nsfw_text_on_explicit_content) that covers both input
variants or parameterize the test to run both payloads, ensuring you still call
VALIDATE_API_PATH with request_id/organization_id/project_id and the nsfw_text
validator (on_fail: "exception") and assert response.status_code == 200 and
body["success"] is False; update or remove the redundant test function to avoid
duplicate assertions and reduce runtime.

In `@backend/app/tests/test_toxicity_hub_validators.py`:
- Around line 295-442: Refactor the repeated patch/build/assert patterns in
TestNSFWTextSafetyValidatorConfig by consolidating similar test cases into
parametrized tests using pytest.mark.parametrize: identify groups like
build-with-defaults/custom/threshold/device/model/on_fail mappings (tests
test_build_with_defaults, test_build_with_custom_params,
test_build_with_threshold_at_zero, test_build_with_threshold_at_one,
test_build_with_device_none, test_build_with_model_name_none,
test_on_fail_fix_resolves_to_callable,
test_on_fail_exception_resolves_to_exception_action,
test_on_fail_rephrase_resolves_to_callable) and replace each group with a single
parametrized function that supplies (input-config-kwargs, expected kwargs or
callable/OnFailAction) and inside the test perform the same with
patch(_NSFW_PATCH) call to config.build() and assert mock_validator.call_args or
result; keep the standalone tests that exercise _on_fix behavior and
ValidationError cases (test_on_fix_sets_validator_metadata_when_fix_value_empty,
test_on_fix_does_not_set_metadata_when_fix_value_present,
test_invalid_on_fail_raises, test_wrong_type_literal_rejected,
test_extra_fields_rejected, test_threshold_must_be_numeric) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5638df70-fc69-4bc5-a07e-10e516566e61

📥 Commits

Reviewing files that changed from the base of the PR and between e470a13 and edcdd26.

📒 Files selected for processing (11)
  • backend/Dockerfile
  • backend/app/api/API_USAGE.md
  • backend/app/core/enum.py
  • backend/app/core/validators/README.md
  • backend/app/core/validators/config/nsfw_text_safety_validator_config.py
  • backend/app/core/validators/validators.json
  • backend/app/evaluation/toxicity/run.py
  • backend/app/schemas/guardrail_config.py
  • backend/app/tests/test_guardrails_api_integration.py
  • backend/app/tests/test_toxicity_hub_validators.py
  • backend/pyproject.toml

Comment thread backend/app/api/API_USAGE.md
Comment thread backend/app/core/validators/README.md
Comment thread backend/app/evaluation/toxicity/run.py
Comment thread backend/app/evaluation/toxicity/run.py
Comment thread backend/app/evaluation/toxicity/run.py
Comment thread backend/app/evaluation/toxicity/run.py
Comment thread backend/app/tests/test_guardrails_api_integration.py
Comment thread backend/Dockerfile
Comment thread backend/pyproject.toml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/README.md`:
- Around line 290-295: The fenced code block containing the four output paths
(lines showing outputs/toxicity/predictions_hasoc.csv,
outputs/toxicity/metrics_hasoc.json, outputs/toxicity/predictions_sharechat.csv,
outputs/toxicity/metrics_sharechat.json) in README.md should include a language
identifier to satisfy MD040; update the opening fence from ``` to ```text so the
block becomes a ```text fenced block; keep the same contents and closing fence
unchanged.
- Around line 288-295: Update the documented example output paths under the
"**Output per dataset:**" section so they match the folder-tree layout (use
per-dataset subfolders), i.e., change the flat paths like
`outputs/toxicity/predictions_hasoc.csv` and
`outputs/toxicity/metrics_sharechat.json` to the nested forms
`outputs/toxicity/hasoc/predictions.csv`, `outputs/toxicity/hasoc/metrics.json`,
`outputs/toxicity/sharechat/predictions.csv`, and
`outputs/toxicity/sharechat/metrics.json` so the README's example paths align
with the folder structure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b240205-a6e8-4279-a05f-da2c128ffb09

📥 Commits

Reviewing files that changed from the base of the PR and between edcdd26 and a8b9818.

📒 Files selected for processing (2)
  • backend/app/evaluation/README.md
  • backend/scripts/run_all_evaluations.sh
✅ Files skipped from review due to trivial changes (1)
  • backend/scripts/run_all_evaluations.sh

Comment thread backend/app/evaluation/README.md
Comment thread backend/app/evaluation/README.md
@rkritika1508 rkritika1508 self-assigned this Apr 23, 2026
@rkritika1508 rkritika1508 added enhancement New feature or request ready-for-review labels Apr 23, 2026
@nishika26 nishika26 removed this from Kaapi-dev Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants