Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
650369c
added toxicity detection validators
rkritika1508 Apr 1, 2026
949647d
fixed import error
rkritika1508 Apr 1, 2026
da50537
removed redundant validators
rkritika1508 Apr 2, 2026
9ab64c7
Added NSFW text validator
rkritika1508 Apr 2, 2026
b64d0e9
fixed test
rkritika1508 Apr 2, 2026
57d97b2
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 2, 2026
09b6a05
fix: profanity free validator description
dennyabrain Apr 6, 2026
f4a11fa
doc: updated details of sentence parameter
dennyabrain Apr 7, 2026
f330f1b
fix: remove vscode files
dennyabrain Apr 7, 2026
51c9266
Added integration tests
rkritika1508 Apr 7, 2026
141e5fc
Merge branch 'main' into feat/toxicity-hub-validators
rkritika1508 Apr 7, 2026
c76f829
added integration tests
rkritika1508 Apr 7, 2026
baac9e4
fix: profanity free validator description
dennyabrain Apr 6, 2026
627fb4f
Added integration tests
rkritika1508 Apr 7, 2026
8b3da89
validator config: add name to config (#79)
nishika26 Apr 7, 2026
cc0bb14
added integration tests
rkritika1508 Apr 7, 2026
3037eb8
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 7, 2026
b69883d
added integration tests
rkritika1508 Apr 7, 2026
8f67176
updated readme
rkritika1508 Apr 7, 2026
affe72d
Added installation of huggingface model in dockerfile
rkritika1508 Apr 7, 2026
8b0a183
resolved comment
rkritika1508 Apr 7, 2026
14f6dc1
removed blank line
rkritika1508 Apr 7, 2026
74f8a82
updated policies for llama guard
rkritika1508 Apr 7, 2026
6676414
fixed tests
rkritika1508 Apr 7, 2026
0d15d0c
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 7, 2026
6443c1b
updated readme and fixed llama guard inference
rkritika1508 Apr 8, 2026
af933ef
fixed test
rkritika1508 Apr 8, 2026
9b6616a
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 9, 2026
9aca5f2
Merge branch 'main' into feat/toxicity-hub-validators
rkritika1508 Apr 10, 2026
664ded8
resolved comments
rkritika1508 Apr 10, 2026
0ce6ebb
Added evaluation readme (#82)
rkritika1508 Apr 10, 2026
ba27b80
resolved comments
rkritika1508 Apr 10, 2026
d7c5eba
resolved comments
rkritika1508 Apr 10, 2026
02fd043
fixed llama guard
rkritika1508 Apr 10, 2026
d9569ba
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 10, 2026
31af2f6
Toxicity Detection validators (#80)
rkritika1508 Apr 10, 2026
a061af8
Merge branch 'main' into feat/toxicity-huggingface-model
rkritika1508 Apr 10, 2026
88c1b56
removed unnecessary changes
rkritika1508 Apr 10, 2026
5b2fe3b
fix: update default nsfw_text model to michellejieli/NSFW_text_classi…
rkritika1508 Apr 10, 2026
fd3cddc
fix: use textdetox/xlmr-large-toxicity-classifier as default nsfw_tex…
rkritika1508 Apr 10, 2026
7264771
updated readme
rkritika1508 Apr 10, 2026
217ba9b
Merge branch 'main' into feat/toxicity-huggingface-model
nishika26 Apr 13, 2026
edcdd26
added toxicity evaluation
rkritika1508 Apr 13, 2026
28befd0
updated
rkritika1508 Apr 13, 2026
a8b9818
Merge branch 'main' into feat/toxicity-evaluation
rkritika1508 Apr 16, 2026
e64ad88
Merge branch 'main' into feat/toxicity-evaluation
rkritika1508 Apr 22, 2026
69fb065
Merge branch 'main' into feat/toxicity-evaluation
rkritika1508 Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 47 additions & 2 deletions backend/app/evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ backend/app/evaluation/
│ └── run.py # PII evaluation script
├── topic_relevance/
│ └── run.py # Topic relevance evaluation script
└── toxicity/ # Toxicity evaluation scripts
└── toxicity/
└── run.py # Toxicity evaluation script (LlamaGuard7B, NSFWText, ProfanityFree)
```

## Prerequisites
Expand Down Expand Up @@ -91,7 +92,7 @@ Validators that use LLM-as-judge approach will require credentials for LLM provi

## Running All Evaluations

To run all individual validator evaluations in sequence (lexical slur, PII, gender assumption bias, ban list, topic relevance):
To run all individual validator evaluations in sequence (lexical slur, PII, gender assumption bias, ban list, topic relevance, toxicity):

```bash
bash scripts/run_all_evaluations.sh
Expand Down Expand Up @@ -262,6 +263,49 @@ python3 app/evaluation/topic_relevance/run.py

---

### Toxicity (`llamaguard_7b`, `nsfw_text`, `profanity_free`)

**Script:** `app/evaluation/toxicity/run.py`

**Datasets:**
- `datasets/toxicity/toxicity_test_hasoc.csv`
- `datasets/toxicity/toxicity_test_sharechat.csv`

Expected columns — HASOC dataset:

- `text` — tweet/comment text to validate
- `task1` — ground truth label (`HOF` = hate/offensive/profanity → `1`, `NOT` → `0`)
- `lang` — language code (informational)

Expected columns — ShareChat dataset:

- `commentText` — comment text to validate
- `label` — binary ground truth (`1` = toxic, `0` = not toxic)
- `language` — language label (informational)

**What it does:** Runs three validators — `LlamaGuard7B`, `NSFWText`, and `ProfanityFree` — across both datasets independently. For each validator, a binary prediction is recorded (`1` if `FailResult`, `0` otherwise) and compared against the ground truth label to compute classification metrics.

**Output per dataset:**

```
outputs/toxicity/predictions_hasoc.csv
outputs/toxicity/metrics_hasoc.json
outputs/toxicity/predictions_sharechat.csv
outputs/toxicity/metrics_sharechat.json
```
Comment thread
rkritika1508 marked this conversation as resolved.
Comment thread
rkritika1508 marked this conversation as resolved.

Each predictions CSV contains the source text, ground truth (`y_true`), and one `*_pred` column per validator. Each metrics JSON contains accuracy, precision, recall, F1, and performance stats broken down per validator.

**Run:**

```bash
python3 app/evaluation/toxicity/run.py
```

> **Note:** `LlamaGuard7B` uses remote inferencing — requires a valid `GUARDRAILS_HUB_API_KEY` and internet access. `NSFWText` downloads the `textdetox/xlmr-large-toxicity-classifier` model on first run.

---

## Multiple Validators Evaluation (End-to-End)

This evaluation runs multiple validators **together** against a dataset via the live guardrails API. Unlike the individual evaluations above, this is an **end-to-end integration test** — it hits the API rather than calling validators directly.
Expand Down Expand Up @@ -393,6 +437,7 @@ Each evaluation script expects a specific filename — files must be named exact
| Ban List | `ban_list_testing_dataset.csv` |
| Multiple Validators | `multi_validator_whatsapp_dataset.csv` |
| Topic Relevance | `topic_relevance/education-topic-relevance-dataset.csv`, `topic_relevance/healthcare-topic-relevance-dataset.csv` |
| Toxicity | `toxicity/toxicity_test_hasoc.csv`, `toxicity/toxicity_test_sharechat.csv` |

Topic relevance also requires plain-text topic config files alongside each dataset:

Expand Down
98 changes: 98 additions & 0 deletions backend/app/evaluation/toxicity/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
from pathlib import Path
import pandas as pd
from guardrails.hub import LlamaGuard7B, NSFWText, ProfanityFree
from guardrails.validators import FailResult

from app.evaluation.common.helper import (
build_evaluation_report,
compute_binary_metrics,
Profiler,
write_csv,
write_json,
)

BASE_DIR = Path(__file__).resolve().parent.parent
OUT_DIR = BASE_DIR / "outputs" / "toxicity"

DATASETS = {
"hasoc": {
"path": BASE_DIR / "datasets" / "toxicity" / "toxicity_test_hasoc.csv",
"text_col": "text",
"label_col": "task1",
"label_map": {"HOF": 1, "NOT": 0},
},
"sharechat": {
"path": BASE_DIR / "datasets" / "toxicity" / "toxicity_test_sharechat.csv",
"text_col": "commentText",
"label_col": "label",
"label_map": None, # already binary int
},
}

VALIDATORS = {
"llamaguard_7b": lambda: LlamaGuard7B(on_fail="noop"),
"nsfw_text": lambda: NSFWText(
threshold=0.8,
validation_method="sentence",
device="cpu",
model_name="textdetox/xlmr-large-toxicity-classifier",
on_fail="noop",
),
"profanity_free": lambda: ProfanityFree(on_fail="noop"),
Comment thread
rkritika1508 marked this conversation as resolved.
}


def run_dataset(dataset_name: str, dataset_cfg: dict):
df = pd.read_csv(dataset_cfg["path"])
text_col = dataset_cfg["text_col"]
label_col = dataset_cfg["label_col"]
label_map = dataset_cfg["label_map"]

if label_map is not None:
df["y_true"] = df[label_col].map(label_map)
else:
df["y_true"] = df[label_col].astype(int)
Comment thread
rkritika1508 marked this conversation as resolved.

all_metrics = {}

for validator_name, build_fn in VALIDATORS.items():
print(f" Running {validator_name} on {dataset_name}...")
validator = build_fn()

with Profiler() as p:
df[f"{validator_name}_result"] = (
df[text_col]
.astype(str)
.apply(
lambda x: p.record(lambda t: validator.validate(t, metadata={}), x)
)
)
Comment thread
rkritika1508 marked this conversation as resolved.

df[f"{validator_name}_pred"] = df[f"{validator_name}_result"].apply(
lambda r: int(isinstance(r, FailResult))
)

metrics = compute_binary_metrics(df["y_true"], df[f"{validator_name}_pred"])
all_metrics[validator_name] = build_evaluation_report(
guardrail=validator_name,
dataset=dataset_name,
num_samples=len(df),
profiler=p,
metrics=metrics,
)

df = df.drop(columns=[f"{validator_name}_result"])

pred_cols = ["y_true"] + [f"{v}_pred" for v in VALIDATORS]
write_csv(
df[[text_col, *pred_cols]],
OUT_DIR / f"predictions_{dataset_name}.csv",
)
write_json(all_metrics, OUT_DIR / f"metrics_{dataset_name}.json")


for dataset_name, dataset_cfg in DATASETS.items():
print(f"Evaluating dataset: {dataset_name}")
run_dataset(dataset_name, dataset_cfg)

print("Done. Results saved to", OUT_DIR)
Comment thread
rkritika1508 marked this conversation as resolved.
1 change: 1 addition & 0 deletions backend/scripts/run_all_evaluations.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ RUNNERS=(
"$EVAL_DIR/gender_assumption_bias/run.py"
"$EVAL_DIR/ban_list/run.py"
"$EVAL_DIR/topic_relevance/run.py"
"$EVAL_DIR/toxicity/run.py"
)

echo "Running validator evaluations..."
Expand Down