Classifies domains A–F based on DNS records, WHOIS registration data, and website content analysis. Designed for B2B data quality workflows where you need to quickly separate active business domains from parked pages, redirects, and junk.
Requirements: Python 3.9+
These helper scripts are for macOS / Linux shells.
Bootstrap the virtual environment, upgrade pip, install dependencies, and install Playwright Chromium:
./scripts/bootstrap.shIf you want custom runtime settings or AI comparison, create .env before starting the API:
cp .env.example .envLaunch the FastAPI server:
./scripts/run-api.shThe API will be available at http://localhost:8000/docs.
To override the bind host or port:
HOST=127.0.0.1 PORT=9000 ./scripts/run-api.sh./scripts/bootstrap.shCreates.venvif needed, activates it, upgradespip, installs the local sibling repos../content-analyzer,../ml-classifier, and../httpfetch[full], installs this repo, and runsplaywright install chromium../scripts/run-api.shActivates.venvand startsuvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload.
bootstrap.sh expects these sibling repos to exist next to this project:
../content-analyzer
../ml-classifier
../httpfetch
If they live elsewhere, you can override their paths when running bootstrap:
CONTENT_ANALYZER_DIR=/path/to/content-analyzer \
ML_CLASSIFIER_DIR=/path/to/ml-classifier \
HTTPFETCH_DIR=/path/to/httpfetch \
./scripts/bootstrap.shmacOS / Linux:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e ../content-analyzer
pip install -e ../ml-classifier
pip install -e "../httpfetch[full]"
pip install -e ".[dev]"
playwright install chromium
uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reloadWindows PowerShell:
py -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e ..\content-analyzer
pip install -e ..\ml-classifier
pip install -e "..\httpfetch[full]"
pip install -e ".[dev]"
playwright install chromium
uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload| Grade | Classification | Meaning |
|---|---|---|
| A | business (score > 0.6) or government |
Confirmed business or institutional domain — high confidence |
| B | business (score ≤ 0.6) or DNS-upgraded |
Likely business — moderate confidence, or strong DNS signals behind a block |
| C | inactive, parked, ads, adult, gambling, personal, under_construction, gated, blocked, timeout, error |
Non-business or unclassifiable content |
| D | redirect |
Domain redirects to a different domain |
| E | ns_only |
Nameserver exists but no A record — no website |
| F | unregistered |
No nameserver — domain is not registered |
| Value | Description |
|---|---|
business |
Website content indicates an active business |
government |
Institutional TLD (.gov, .mil, .edu, .gov.uk, .gc.ca, etc.) — pre-pipeline fast path |
inactive |
Page loaded but has no meaningful content, or served inside a frameset |
parked |
Parked domain — ad lander, for-sale page, or content farm redirect |
ads |
Ad-only page with no substantive content |
under_construction |
Explicit "coming soon" / under construction page |
gated |
Login or CAPTCHA wall — no indexable content visible |
blocked |
WAF / Cloudflare challenge page |
adult |
Adult content |
gambling |
Gambling content |
personal |
Personal site, blog, or portfolio with no business signals |
redirect |
Redirects to a different domain |
ns_only |
DNS nameserver exists, no web server |
unregistered |
No DNS records |
timeout |
No response within time limit |
error |
Pipeline or HTTP error |
Grade B upgrades — a domain that would otherwise receive C is promoted to B when strong DNS signals indicate a real organisation behind a block or blank page (enterprise MX provider, strict DMARC policy, Microsoft 365 tenant verification), or when the DBI API confirms a real web presence via domainrank.
Language is reported as a separate language field (ISO 639-1 code, e.g. de, zh, fr) and does not affect the classification or grade.
Use the helper script for the standard local setup:
./scripts/bootstrap.shOr follow the manual commands shown in Quick Start.
If you want AI comparison or custom runtime settings, copy the example file first:
cp .env.example .envThen edit .env and set the values you need, for example DC_AI_API_KEY.
The main CLI entrypoint in this repo is:
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify --helpTo run a classification job:
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --helpThis avoids accidentally picking up a different domain-classify on your PATH.
If you are already in the repo root, the equivalent shorter form is:
.venv/bin/domain-classify --helpTo start the API with the helper script:
./scripts/run-api.sh/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --timeout 45 --workers 10 --verbose/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txtdomains.txt — one domain per line, lines starting with # are ignored:
# Known businesses
stripe.com
dnb.com
salesforce.com
# Suspected junk
17work.cn
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --output results.csv/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --format json/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --no-cache/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --proxy http://user:pass@host:port/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --log-file classifier.log --verboseUse --delimiter and --field to parse structured input files (CSV, TSV, pipe-delimited, etc.). The classifier appends its output columns to each original row and writes an <stem>_output<ext> file automatically.
# CSV with a header row — domain is in field 1
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input leads.csv --delimiter , --field 1
# Tab-delimited export — domain is in field 3
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input export.tsv --delimiter $'\t' --field 3
# Pipe-delimited, no header
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input data.txt --delimiter '|' --field 2Usage: /Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify [OPTIONS] [DOMAIN]
Arguments:
DOMAIN Single domain to classify [optional]
Options:
-i, --input PATH File with one domain per line
-o, --output PATH Output CSV/file path
--format TEXT Output format: table|json|csv [default: table]
-d, --delimiter TEXT Field delimiter for structured input (e.g. ',' or '\t')
-f, --field INT 1-based index of the domain field in delimited input
-w, --workers INT Concurrent domains [default: 5]
--no-cache Disable SQLite result cache
-t, --timeout INT Per-domain timeout (seconds) [default: 30]
-v, --verbose Enable debug logging to stderr
--log-file PATH Write logs to file (also prints to stderr when --verbose)
--ai-compare Compare static results with Claude AI classifier
--ai-all Run AI comparison on every domain (default: low-confidence only)
--proxy TEXT Proxy URL for all HTTP/Playwright fetches
--help Show this message and exit
The CLI also includes a comparison command for before/after result files:
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare before.csv after.csv
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare before.csv after.csv --changed-onlycompare options:
Usage: /Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare [OPTIONS] BEFORE AFTER
Arguments:
BEFORE CSV from an earlier run
AFTER CSV from the more recent run
Options:
-c, --changed-only Only show domains where grade or classification changed
--help Show this message and exit
Domain Classification Results
┌──────────────────┬───────┬──────────────────┬──────┬───────┬──────────┐
│ Domain │ Grade │ Classification │ Lang │ Score │ Time (ms)│
├──────────────────┼───────┼──────────────────┼──────┼───────┼──────────┤
│ stripe.com │ A │ business │ en │ 0.923 │ 4821 │
│ salesforce.com │ A │ business │ en │ 0.881 │ 3204 │
│ cdc.gov │ A │ government │ en │ 1.000 │ 12 │
│ example.de │ B │ business │ de │ 0.541 │ 2109 │
│ 17work.cn │ C │ timeout │ │ 1.000 │ 30012 │
│ parked.biz │ C │ parked │ en │ 1.000 │ 1832 │
│ old-co.com │ D │ redirect │ en │ 1.000 │ 521 │
│ empty.io │ E │ ns_only │ │ 1.000 │ 412 │
│ ghost-xyz.net │ F │ unregistered │ │ 1.000 │ 203 │
└──────────────────┴───────┴──────────────────┴──────┴───────┴──────────┘
domain, grade, classification, language, classification_score,
has_dns_a_record, is_registered, is_reachable, final_url, redirects_to,
copyright_found, has_phone, has_email, domain_age_days, registrar,
ssl_cert_error, processing_time_ms, error,
ai_classification, ai_confidence, ai_match, ai_reasoning,
dbi_company_name, dbi_domainrank, dbi_classification, dbi_grade, dbi_category
The classifier is also available as a FastAPI server.
uvicorn domain_classifier.api:app --reloadAPI docs available at http://localhost:8000/docs.
To bind a different host/port:
uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload# Single domain
curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"domain": "stripe.com"}'
# Batch (≤10 domains — synchronous response)
curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"domains": ["stripe.com", "parked.biz", "ghost-xyz.net"]}'
# Batch (>10 domains — returns a job_id)
curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"domains": ["a.com", "b.com", "..."]}'
# Poll job status
curl http://localhost:8000/result/<job_id># Health check
curl http://localhost:8000/health
# View saved labels
curl http://localhost:8000/labels
# Check retrain status
curl http://localhost:8000/retrain/status
# Feature importance from the trained model
curl http://localhost:8000/feature-importanceThe classifier ships with a heuristic scorer that works without any training data. To improve accuracy for your specific domain set, train an SGD model on labelled examples.
python -m domain_classifier.ml.train \
--data labelled_domains.csv \
--output domain_classifier/ml/models/classifier.jobliblabelled_domains.csv format:
domain,classification
stripe.com,business
parked-example.com,parked
under-construction.net,under_constructionThe browser extension (see chrome_extension/README.md) provides a labelling loop:
browse → classify → label → retrain — without leaving Chrome.
All settings can be overridden with environment variables prefixed DC_, or placed in a .env file in the project root.
| Variable | Default | Description |
|---|---|---|
DC_WORKERS |
5 |
Concurrent domains in batch |
DC_BROWSER_POOL_SIZE |
5 |
Playwright browser/page concurrency pool |
DC_HTTP_TIMEOUT |
10 |
HTTP request timeout (seconds) |
DC_PLAYWRIGHT_TIMEOUT |
20 |
Playwright render timeout (seconds) |
DC_DOMAIN_TIMEOUT |
30 |
Per-domain pipeline timeout (seconds) |
DC_MIN_BODY_CHARS |
500 |
Minimum response body size before browser fallback is considered |
DC_MAX_REDIRECTS |
10 |
Maximum redirects to follow |
DC_PROXY_URL |
(unset) | Proxy URL used by HTTP and Playwright fetches |
DC_CACHE_ENABLED |
true |
Enable SQLite result cache |
DC_CACHE_PATH |
domain_cache.db |
Cache database path |
DC_CACHE_TTL_HOURS |
24 |
Cache entry lifetime |
DC_MODEL_PATH |
domain_classifier/ml/models/classifier.joblib |
Trained model path |
DC_AI_API_KEY |
(unset) | Anthropic API key for AI comparison mode |
DC_AI_MODEL |
claude-haiku-4-5-20251001 |
Claude model used for AI comparison |
DC_AI_COMPARE_THRESHOLD |
0.6 |
Only run AI comparison when static score is below this |
DC_DBI_ENABLED |
true |
Enable DBI API lookups for inactive/blocked/error domains |
DC_DBI_AUTH_HEADER_PATH |
~/.config/dbiapi/auth_header |
Path to DBI API auth header file |
DC_DBI_TIMEOUT |
8 |
DBI API request timeout (seconds) |
DC_LOG_LEVEL |
INFO |
Default logging level |
DC_LOG_FILE |
(unset) | Optional rotating log file path |
DC_LOG_MAX_BYTES |
10485760 |
Max size of each log file before rotation |
DC_LOG_BACKUP_COUNT |
5 |
Number of rotated log files to keep |
python3 -m pytest tests/ -vDNS check
│
├─ No NS record ──────────────────────────────────► unregistered (F)
│
├─ Institutional TLD (.gov/.mil/.edu/etc.) ───────► government (A)
│
├─ NS only, no A record ──► WHOIS ───────────────► ns_only (E) or B if enterprise DNS
│
▼
Fetch (httpx → Playwright fallback for JS-rendered sites)
│
├─ Redirect to parking provider ─────────────────► parked (C)
├─ Redirect to other domain ──────────────────────► redirect (D)
├─ Blocked / error ───────────────────────────────► blocked/error → DNS upgrade → B or C
│
▼
Content analysis (BeautifulSoup, langdetect)
│
├─ Rule-based patterns (parked, gated, adult, etc.) ──► early exit C
│
▼
ML classification (SGDClassifier or heuristic fallback)
│
▼
Grade assignment (A–F)
│
▼
DBI lookup (inactive / blocked / error at C only)
│
└─ domainrank or business classification ─────────► upgrade to B
Playwright is used automatically when the httpx response has fewer than 500 raw bytes or fewer than 50 visible words — catching JS-heavy SPAs that return a large skeleton HTML with no rendered text.