Skip to content

ProfoundNetworks/domain-classifier

Repository files navigation

Domain Classifier

Classifies domains A–F based on DNS records, WHOIS registration data, and website content analysis. Designed for B2B data quality workflows where you need to quickly separate active business domains from parked pages, redirects, and junk.


Quick Start

Requirements: Python 3.9+

Start here

These helper scripts are for macOS / Linux shells.

Bootstrap the virtual environment, upgrade pip, install dependencies, and install Playwright Chromium:

./scripts/bootstrap.sh

If you want custom runtime settings or AI comparison, create .env before starting the API:

cp .env.example .env

Launch the FastAPI server:

./scripts/run-api.sh

The API will be available at http://localhost:8000/docs.

To override the bind host or port:

HOST=127.0.0.1 PORT=9000 ./scripts/run-api.sh

What the scripts do

  • ./scripts/bootstrap.sh Creates .venv if needed, activates it, upgrades pip, installs the local sibling repos ../content-analyzer, ../ml-classifier, and ../httpfetch[full], installs this repo, and runs playwright install chromium.
  • ./scripts/run-api.sh Activates .venv and starts uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload.

Local dependency layout

bootstrap.sh expects these sibling repos to exist next to this project:

../content-analyzer
../ml-classifier
../httpfetch

If they live elsewhere, you can override their paths when running bootstrap:

CONTENT_ANALYZER_DIR=/path/to/content-analyzer \
ML_CLASSIFIER_DIR=/path/to/ml-classifier \
HTTPFETCH_DIR=/path/to/httpfetch \
./scripts/bootstrap.sh

Manual setup

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e ../content-analyzer
pip install -e ../ml-classifier
pip install -e "../httpfetch[full]"
pip install -e ".[dev]"
playwright install chromium
uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload

Windows PowerShell:

py -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e ..\content-analyzer
pip install -e ..\ml-classifier
pip install -e "..\httpfetch[full]"
pip install -e ".[dev]"
playwright install chromium
uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload

Grades

Grade Classification Meaning
A business (score > 0.6) or government Confirmed business or institutional domain — high confidence
B business (score ≤ 0.6) or DNS-upgraded Likely business — moderate confidence, or strong DNS signals behind a block
C inactive, parked, ads, adult, gambling, personal, under_construction, gated, blocked, timeout, error Non-business or unclassifiable content
D redirect Domain redirects to a different domain
E ns_only Nameserver exists but no A record — no website
F unregistered No nameserver — domain is not registered

Classifications

Value Description
business Website content indicates an active business
government Institutional TLD (.gov, .mil, .edu, .gov.uk, .gc.ca, etc.) — pre-pipeline fast path
inactive Page loaded but has no meaningful content, or served inside a frameset
parked Parked domain — ad lander, for-sale page, or content farm redirect
ads Ad-only page with no substantive content
under_construction Explicit "coming soon" / under construction page
gated Login or CAPTCHA wall — no indexable content visible
blocked WAF / Cloudflare challenge page
adult Adult content
gambling Gambling content
personal Personal site, blog, or portfolio with no business signals
redirect Redirects to a different domain
ns_only DNS nameserver exists, no web server
unregistered No DNS records
timeout No response within time limit
error Pipeline or HTTP error

Grade B upgrades — a domain that would otherwise receive C is promoted to B when strong DNS signals indicate a real organisation behind a block or blank page (enterprise MX provider, strict DMARC policy, Microsoft 365 tenant verification), or when the DBI API confirms a real web presence via domainrank.

Language is reported as a separate language field (ISO 639-1 code, e.g. de, zh, fr) and does not affect the classification or grade.


Setup

Use the helper script for the standard local setup:

./scripts/bootstrap.sh

Or follow the manual commands shown in Quick Start.

Optional environment file

If you want AI comparison or custom runtime settings, copy the example file first:

cp .env.example .env

Then edit .env and set the values you need, for example DC_AI_API_KEY.


Running The Classifier

The main CLI entrypoint in this repo is:

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify --help

To run a classification job:

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --help

This avoids accidentally picking up a different domain-classify on your PATH.

If you are already in the repo root, the equivalent shorter form is:

.venv/bin/domain-classify --help

To start the API with the helper script:

./scripts/run-api.sh

CLI Usage

Classify a single domain

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com

Classify with explicit runtime options

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --timeout 45 --workers 10 --verbose

Classify multiple domains from a file

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt

domains.txt — one domain per line, lines starting with # are ignored:

# Known businesses
stripe.com
dnb.com
salesforce.com

# Suspected junk
17work.cn

Save results to CSV

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --output results.csv

Output as JSON

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --format json

Run without the SQLite cache

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --no-cache

Route fetches through a proxy

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify stripe.com --proxy http://user:pass@host:port

Write logs to a file

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input domains.txt --log-file classifier.log --verbose

Classify from a delimited file

Use --delimiter and --field to parse structured input files (CSV, TSV, pipe-delimited, etc.). The classifier appends its output columns to each original row and writes an <stem>_output<ext> file automatically.

# CSV with a header row — domain is in field 1
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input leads.csv --delimiter , --field 1

# Tab-delimited export — domain is in field 3
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input export.tsv --delimiter $'\t' --field 3

# Pipe-delimited, no header
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify --input data.txt --delimiter '|' --field 2

All options

Usage: /Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify classify [OPTIONS] [DOMAIN]

Arguments:
  DOMAIN                    Single domain to classify  [optional]

Options:
  -i, --input PATH          File with one domain per line
  -o, --output PATH         Output CSV/file path
      --format TEXT         Output format: table|json|csv  [default: table]
  -d, --delimiter TEXT      Field delimiter for structured input (e.g. ',' or '\t')
  -f, --field INT           1-based index of the domain field in delimited input
  -w, --workers INT         Concurrent domains  [default: 5]
      --no-cache            Disable SQLite result cache
  -t, --timeout INT         Per-domain timeout (seconds)  [default: 30]
  -v, --verbose             Enable debug logging to stderr
      --log-file PATH       Write logs to file (also prints to stderr when --verbose)
      --ai-compare          Compare static results with Claude AI classifier
      --ai-all              Run AI comparison on every domain (default: low-confidence only)
      --proxy TEXT          Proxy URL for all HTTP/Playwright fetches
      --help                Show this message and exit

Compare two CSV snapshots

The CLI also includes a comparison command for before/after result files:

/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare before.csv after.csv
/Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare before.csv after.csv --changed-only

compare options:

Usage: /Users/blakesitney/Projects/Domain_Classifier/.venv/bin/domain-classify compare [OPTIONS] BEFORE AFTER

Arguments:
  BEFORE                    CSV from an earlier run
  AFTER                     CSV from the more recent run

Options:
  -c, --changed-only        Only show domains where grade or classification changed
      --help                Show this message and exit

Example table output

              Domain Classification Results
┌──────────────────┬───────┬──────────────────┬──────┬───────┬──────────┐
│ Domain           │ Grade │ Classification   │ Lang │ Score │ Time (ms)│
├──────────────────┼───────┼──────────────────┼──────┼───────┼──────────┤
│ stripe.com       │ A     │ business         │ en   │ 0.923 │ 4821     │
│ salesforce.com   │ A     │ business         │ en   │ 0.881 │ 3204     │
│ cdc.gov          │ A     │ government       │ en   │ 1.000 │ 12       │
│ example.de       │ B     │ business         │ de   │ 0.541 │ 2109     │
│ 17work.cn        │ C     │ timeout          │      │ 1.000 │ 30012    │
│ parked.biz       │ C     │ parked           │ en   │ 1.000 │ 1832     │
│ old-co.com       │ D     │ redirect         │ en   │ 1.000 │ 521      │
│ empty.io         │ E     │ ns_only          │      │ 1.000 │ 412      │
│ ghost-xyz.net    │ F     │ unregistered     │      │ 1.000 │ 203      │
└──────────────────┴───────┴──────────────────┴──────┴───────┴──────────┘

CSV columns

domain, grade, classification, language, classification_score,
has_dns_a_record, is_registered, is_reachable, final_url, redirects_to,
copyright_found, has_phone, has_email, domain_age_days, registrar,
ssl_cert_error, processing_time_ms, error,
ai_classification, ai_confidence, ai_match, ai_reasoning,
dbi_company_name, dbi_domainrank, dbi_classification, dbi_grade, dbi_category

REST API

The classifier is also available as a FastAPI server.

Start the server

uvicorn domain_classifier.api:app --reload

API docs available at http://localhost:8000/docs.

To bind a different host/port:

uvicorn domain_classifier.api:app --host 0.0.0.0 --port 8000 --reload

Classify via API

# Single domain
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domain": "stripe.com"}'

# Batch (≤10 domains — synchronous response)
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domains": ["stripe.com", "parked.biz", "ghost-xyz.net"]}'

# Batch (>10 domains — returns a job_id)
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domains": ["a.com", "b.com", "..."]}'

# Poll job status
curl http://localhost:8000/result/<job_id>

Other API endpoints

# Health check
curl http://localhost:8000/health

# View saved labels
curl http://localhost:8000/labels

# Check retrain status
curl http://localhost:8000/retrain/status

# Feature importance from the trained model
curl http://localhost:8000/feature-importance

ML Model

The classifier ships with a heuristic scorer that works without any training data. To improve accuracy for your specific domain set, train an SGD model on labelled examples.

Train from a CSV

python -m domain_classifier.ml.train \
  --data labelled_domains.csv \
  --output domain_classifier/ml/models/classifier.joblib

labelled_domains.csv format:

domain,classification
stripe.com,business
parked-example.com,parked
under-construction.net,under_construction

Active learning via the Chrome Extension

The browser extension (see chrome_extension/README.md) provides a labelling loop: browse → classify → label → retrain — without leaving Chrome.


Configuration

All settings can be overridden with environment variables prefixed DC_, or placed in a .env file in the project root.

Variable Default Description
DC_WORKERS 5 Concurrent domains in batch
DC_BROWSER_POOL_SIZE 5 Playwright browser/page concurrency pool
DC_HTTP_TIMEOUT 10 HTTP request timeout (seconds)
DC_PLAYWRIGHT_TIMEOUT 20 Playwright render timeout (seconds)
DC_DOMAIN_TIMEOUT 30 Per-domain pipeline timeout (seconds)
DC_MIN_BODY_CHARS 500 Minimum response body size before browser fallback is considered
DC_MAX_REDIRECTS 10 Maximum redirects to follow
DC_PROXY_URL (unset) Proxy URL used by HTTP and Playwright fetches
DC_CACHE_ENABLED true Enable SQLite result cache
DC_CACHE_PATH domain_cache.db Cache database path
DC_CACHE_TTL_HOURS 24 Cache entry lifetime
DC_MODEL_PATH domain_classifier/ml/models/classifier.joblib Trained model path
DC_AI_API_KEY (unset) Anthropic API key for AI comparison mode
DC_AI_MODEL claude-haiku-4-5-20251001 Claude model used for AI comparison
DC_AI_COMPARE_THRESHOLD 0.6 Only run AI comparison when static score is below this
DC_DBI_ENABLED true Enable DBI API lookups for inactive/blocked/error domains
DC_DBI_AUTH_HEADER_PATH ~/.config/dbiapi/auth_header Path to DBI API auth header file
DC_DBI_TIMEOUT 8 DBI API request timeout (seconds)
DC_LOG_LEVEL INFO Default logging level
DC_LOG_FILE (unset) Optional rotating log file path
DC_LOG_MAX_BYTES 10485760 Max size of each log file before rotation
DC_LOG_BACKUP_COUNT 5 Number of rotated log files to keep

Run Tests

python3 -m pytest tests/ -v

Pipeline Stages

DNS check
    │
    ├─ No NS record ──────────────────────────────────► unregistered (F)
    │
    ├─ Institutional TLD (.gov/.mil/.edu/etc.) ───────► government (A)
    │
    ├─ NS only, no A record ──► WHOIS ───────────────► ns_only (E) or B if enterprise DNS
    │
    ▼
Fetch (httpx → Playwright fallback for JS-rendered sites)
    │
    ├─ Redirect to parking provider ─────────────────► parked (C)
    ├─ Redirect to other domain ──────────────────────► redirect (D)
    ├─ Blocked / error ───────────────────────────────► blocked/error → DNS upgrade → B or C
    │
    ▼
Content analysis (BeautifulSoup, langdetect)
    │
    ├─ Rule-based patterns (parked, gated, adult, etc.) ──► early exit C
    │
    ▼
ML classification (SGDClassifier or heuristic fallback)
    │
    ▼
Grade assignment (A–F)
    │
    ▼
DBI lookup (inactive / blocked / error at C only)
    │
    └─ domainrank or business classification ─────────► upgrade to B

Playwright is used automatically when the httpx response has fewer than 500 raw bytes or fewer than 50 visible words — catching JS-heavy SPAs that return a large skeleton HTML with no rendered text.

About

Domain classifier — grades domains A–F based on DNS and web content analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors