Agents for loading documents into the Soliplex Ingester system. This package provides tools to collect, validate, and ingest documents from multiple sources including local filesystems, WebDAV servers, and source code management platforms (GitHub, Gitea).
-
Filesystem Agent (
fs): Ingest documents from local directories- Recursive directory scanning
- MIME type detection
- Configuration validation
- Status checking to avoid re-ingesting unchanged files
-
WebDAV Agent (
webdav): Ingest documents from WebDAV servers- Support for any WebDAV-compliant server (Nextcloud, ownCloud, SharePoint, etc.)
- Recursive directory scanning
- MIME type detection
- Authentication support (username/password)
- Status checking to avoid re-ingesting unchanged files
- URL export for reviewing discovered files before ingestion
- URL-based ingestion from a curated file list with per-URL error tracking
- Skip hash check option for faster ingestion when re-downloading is acceptable
-
Web Agent (
web): Ingest web pages via HTTP- Fetch and ingest HTML content from URLs
- URL list support (inline, file, or single URL)
- Batch processing with workflow support
-
SCM Agent (
scm): Ingest files and issues from Git repositories- Support for GitHub and Gitea platforms
- Automatic file type filtering
- Issue ingestion with comments (rendered as Markdown)
- Batch processing with workflow support
- Status checking to avoid re-ingesting unchanged files
-
Manifest Runner (
manifest): Declarative multi-source ingestion- YAML-based manifest files defining ingestion components
- Supports all agent types (fs, scm, webdav, web) in a single manifest
- Shared configuration (metadata, extensions, workflow settings)
- Stale document removal (
delete_stale) across all components in a manifest - Cron-based scheduling via the REST API server
- Per-component credential and extension overrides
- Directory-level execution for running multiple manifests at once
-
REST API Server: Run agents as a web service
- FastAPI-based HTTP endpoints for all operations
- Multiple authentication methods (API key, OAuth2 proxy)
- Interactive API documentation with Swagger UI
- Health check endpoint for monitoring
- Container-ready with Docker support
Requirements:
- Python 3.13 or higher
- Soliplex Ingester running and accessible
Before using these tools, a working version of Soliplex Ingester must be available. The URL will need to be configured in the environment variables to function.
uv add soliplex.agentspip install soliplex.agentsgit clone <repository-url>
cd ingester-agents
uv syncThe agents use environment variables for configuration. Create a .env file or export these variables:
# Soliplex Ingester API endpoint
ENDPOINT_URL=http://localhost:8000/api/v1
# Ingester API authentication (for connecting to protected Ingester instances)
INGESTER_API_KEY=your-api-keyThe agents use unified authentication settings that work across all SCM providers (GitHub, Gitea, etc.):
# SCM authentication token (GitHub personal access token or Gitea API token)
scm_auth_token=your_scm_token_here
# SCM base URL (required for Gitea, optional for GitHub)
# For Gitea: Full API URL including /api/v1
# For GitHub: Defaults to https://api.github.com if not specified
scm_base_url=https://your-gitea-instance.com/api/v1Examples:
For GitHub:
export scm_auth_token=ghp_YourGitHubToken
# scm_base_url not needed for public GitHubFor Gitea:
export scm_auth_token=your_gitea_token
export scm_base_url=https://gitea.example.com/api/v1# WebDAV server URL
WEBDAV_URL=https://webdav.example.com
# WebDAV authentication
WEBDAV_USERNAME=your-username
WEBDAV_PASSWORD=your-password
# Disable TLS certificate verification (default: true)
SSL_VERIFY=trueAll WebDAV credentials can also be provided via command-line options (--webdav-url, --webdav-username, --webdav-password), which override the environment variables.
# File extensions to include (default: md,pdf,doc,docx)
EXTENSIONS=md,pdf,doc,docx
# Logging level (default: INFO)
LOG_LEVEL=INFO
# API Server Configuration
SERVER_HOST=127.0.0.1
SERVER_PORT=8001
# Authentication (for API server)
API_KEY=your-api-key
API_KEY_ENABLED=false
AUTH_TRUST_PROXY_HEADERS=false
# Manifest scheduling (requires SCHEDULER_ENABLED=true)
MANIFEST_DIR=/path/to/manifests
# S3-compatible storage (for urls_file references using s3:// URLs)
S3_ENDPOINT_URL=https://minio.example.com:9000For large repositories or rate-limited APIs, you can use the git command-line for file synchronization instead of API calls. This clones the repository locally and reads files from the filesystem.
# Enable git CLI mode
scm_use_git_cli=true
# Optional: Custom directory for cloned repos (default: system temp directory)
scm_git_repo_base_dir=/var/lib/soliplex/repos
# Optional: Timeout for git operations in seconds (default: 300)
scm_git_cli_timeout=600How it works:
- First sync: Clones the repository to a local temp directory (shallow clone, single branch)
- Subsequent syncs: Pulls latest changes using
git pull --ff-only - Pull failure: If pull fails, deletes the local clone and re-clones
- After sync: Runs
git clean -fdto remove untracked files
Notes:
- Issues are still fetched via API (git doesn't provide issue data)
- Requires git to be installed in the runtime environment
- The Docker image includes git by default
- All credentials are masked in log output for security
Security: Git CLI mode uses strict input sanitization to prevent command injection. Only alphanumeric characters, dashes, underscores, dots, and forward slashes are allowed in repository names and paths.
The CLI tool si-agent provides six main modes of operation:
fs: Filesystem agent for ingesting local documentsweb: Web agent for ingesting HTML pages from URLsscm: SCM agent for ingesting from Git repositorieswebdav: WebDAV agent for ingesting documents from WebDAV serversmanifest: Manifest runner for declarative multi-source ingestionserve: REST API server exposing agent functionality via HTTP
Ingest documents directly from a directory:
si-agent fs run-inventory /path/to/documents my-source-nameThat's it! The tool automatically:
- Scans the directory
- Builds the configuration
- Validates files
- Ingests documents
If you want to review or modify the inventory before ingestion:
1. Build Configuration (Optional)
Scan a directory and create an inventory file:
si-agent fs build-config /path/to/documentsThis creates an inventory.json file containing metadata for all discovered files. You can edit this file to add custom metadata or exclude specific files.
2. Validate Configuration
Check if files are supported (accepts file OR directory):
# Validate existing inventory file
si-agent fs validate-config /path/to/inventory.json
# Or validate directory directly (builds config on-the-fly)
si-agent fs validate-config /path/to/documents3. Check Status
See which files need to be ingested (accepts file OR directory):
# Using inventory file
si-agent fs check-status /path/to/inventory.json my-source-name
# Or using directory directly
si-agent fs check-status /path/to/documents my-source-nameAdd --detail flag to see the full list of files:
si-agent fs check-status /path/to/documents my-source-name --detailThe status check compares file hashes against the Ingester database:
- new: File doesn't exist in the database
- mismatch: File exists but content has changed
- match: File is unchanged (will be skipped during ingestion)
4. Load Inventory
Ingest documents (accepts file OR directory):
# From inventory file
si-agent fs run-inventory /path/to/inventory.json my-source-name
# Or from directory directly (recommended!)
si-agent fs run-inventory /path/to/documents my-source-nameAdvanced options:
# Process a subset of files (e.g., files 10-50)
si-agent fs run-inventory inventory.json my-source --start 10 --end 50
# Start workflows after ingestion
si-agent fs run-inventory /path/to/documents my-source \
--start-workflows \
--workflow-definition-id my-workflow \
--param-set-id my-params \
--priority 10List all issues from a repository:
# GitHub
si-agent scm list-issues github myorg/my-repo
# Gitea
si-agent scm list-issues gitea admin/my-repoList files in a repository:
# GitHub
si-agent scm get-repo github myorg/my-repo
# Gitea
si-agent scm get-repo gitea admin/my-repoIngest both files and issues from a repository. Issues are rendered as Markdown documents with their comments.
# GitHub
si-agent scm run-inventory github myorg/my-repo
# Gitea
si-agent scm run-inventory gitea admin/my-repoNote on Workflows: By default, start_workflows=True. To skip workflow triggering, explicitly set --no-start-workflows.
Run commit-based incremental synchronization. Only processes files that changed since the last sync, significantly reducing API calls and bandwidth usage.
# First run performs full sync and establishes sync state
si-agent scm run-incremental gitea admin/my-repo
# Subsequent runs only process changes since last sync
si-agent scm run-incremental gitea admin/my-repo --branch mainWith workflow triggering:
si-agent scm run-incremental gitea admin/my-repo \
--start-workflows \
--workflow-definition-id my-workflow \
--param-set-id my-params \
--priority 5Output JSON format:
si-agent scm run-incremental gitea admin/my-repo --do-jsonView and manage sync state for repositories:
# View current sync state
si-agent scm get-sync-state gitea admin/my-repo
# Reset sync state (forces full sync on next run)
si-agent scm reset-sync gitea admin/my-repoThe WebDAV agent allows you to ingest documents directly from WebDAV servers (like Nextcloud, ownCloud, SharePoint, etc.).
Ingest documents directly from a WebDAV directory:
# Set up environment
export WEBDAV_URL=https://webdav.example.com
export WEBDAV_USERNAME=your-username
export WEBDAV_PASSWORD=your-password
# Ingest documents from WebDAV path
si-agent webdav run-inventory /documents my-source-nameThat's it! The tool automatically:
- Connects to the WebDAV server
- Scans the directory
- Builds the configuration
- Validates files
- Ingests documents
1. Export URLs
Scan a WebDAV directory and export discovered file URLs to a file for review. This uses only directory listing (PROPFIND) and does not download file content:
si-agent webdav export-urls /documents urls.txt \
--webdav-url https://webdav.example.com \
--webdav-username user \
--webdav-password passThe output file contains one absolute WebDAV path per line:
/documents/report.md
/documents/sub/readme.pdf
/documents/notes.docx
Only files matching the configured EXTENSIONS filter are included.
2. Validate Configuration
Check if files are supported (downloads files to compute hashes):
# Validate WebDAV directory directly
si-agent webdav validate-config /documents \
--webdav-url https://webdav.example.com \
--webdav-username user \
--webdav-password pass3. Check Status
See which files need to be ingested:
si-agent webdav check-status /documents my-source-name \
--webdav-url https://webdav.example.com \
--webdav-username user \
--webdav-password passAdd --detail flag to see the full list of files.
4. Load Inventory
Ingest documents from a WebDAV directory:
si-agent webdav run-inventory /documents my-source-nameAdvanced options:
si-agent webdav run-inventory /documents my-source \
--start-workflows \
--workflow-definition-id my-workflow \
--param-set-id my-params \
--priority 10 \
--webdav-url https://webdav.example.com \
--webdav-username user \
--webdav-password pass5. Run from URL List
Ingest specific files from a URL list file instead of scanning an entire directory. This is useful when you want to ingest only a curated subset of files:
si-agent webdav run-from-urls urls.txt my-source-nameEach URL in the file is processed independently. If a file fails to download, the error is recorded and processing continues with the remaining URLs. Results are written to a JSON file named <input-file>.results.<timestamp>.json:
[
{"url": "/documents/report.md", "status": "success"},
{"url": "/documents/broken.pdf", "status": "error", "error_message": "404 Not Found"}
]Use --skip-hash-check to skip downloading files for hash comparison, which avoids downloading each file twice (once for hashing, once for ingestion):
si-agent webdav run-from-urls urls.txt my-source-name --skip-hash-checkThe manifest runner executes declarative YAML manifests that define multi-source ingestion jobs. A single manifest can combine filesystem, WebDAV, SCM, and web components under a shared source and configuration.
Run a single manifest file:
si-agent manifest run /path/to/manifest.ymlRun all manifests in a directory:
si-agent manifest run /path/to/manifests/Output results as JSON:
si-agent manifest run /path/to/manifest.yml --jsonA manifest file defines the ingestion source, optional shared configuration, and one or more components:
id: my-ingestion
name: My Document Ingestion
source: my-source-name
schedule:
cron: "0 0 * * *"
config:
metadata:
project: my-project
extensions:
- md
- pdf
delete_stale: true
start_workflows: true
workflow_definition_id: my-workflow
param_set_id: my-params
priority: 5
components:
- name: local-docs
type: fs
path: /path/to/documents
- name: web-pages
type: web
urls:
- https://example.com/page1
- https://example.com/page2
- name: repo-docs
type: scm
platform: github
owner: myorg
repo: my-repo
incremental: true
- name: shared-drive
type: webdav
url: https://webdav.example.com
path: /documentsTop-level fields:
- id (required): Unique identifier for the manifest. Must be unique across all manifests when running from a directory.
- name (required): Human-readable name for display and logging.
- source (required): Source name used for batch management in the Ingester.
- schedule: Optional cron schedule for automated execution via the REST API server.
- cron: Cron expression (e.g.,
"0 0 * * *"for daily at midnight).
- cron: Cron expression (e.g.,
- config: Optional shared configuration applied to all components.
- metadata: Key-value pairs attached to all ingested documents.
- extensions: File extensions to include (overrides the global
EXTENSIONSsetting). - delete_stale: Remove documents from the Ingester that no longer appear in any component (default: false). See Stale Document Removal below.
- start_workflows: Whether to start workflows after ingestion (default: false).
- workflow_definition_id: Workflow definition ID (required when start_workflows is true).
- param_set_id: Parameter set ID (required when start_workflows is true).
- priority: Workflow priority (default: 0).
- components (required): List of ingestion components (see below).
Filesystem (fs):
- name (required): Component name (must be unique within the manifest).
- path (required): Path to a local directory or inventory file.
- extensions: Override extensions for this component.
- metadata: Additional metadata merged with config-level metadata.
Web (web):
- name (required): Component name.
- url: Single URL to fetch.
- urls: List of URLs to fetch.
- urls_file: Path to a file containing URLs (one per line). Supports local paths,
s3://bucket/keyURLs, andhttp(s)://WebDAV URLs. - Exactly one of
url,urls, orurls_filemust be specified. - extensions: Override extensions for this component.
- metadata: Additional metadata merged with config-level metadata.
SCM (scm):
- name (required): Component name.
- platform (required):
githuborgitea. - owner (required): Repository owner or organization.
- repo (required): Repository name.
- incremental: Use commit-based incremental sync (default: false).
- branch: Branch to sync (default:
main). - content_filter: What to ingest:
all,files, orissues(default:all). - base_url: Override SCM base URL (uses
scm_base_urlenv var if not set). - auth_token: Override auth token name (resolved via Docker secrets or env vars).
- extensions: Override extensions for this component.
- metadata: Additional metadata merged with config-level metadata.
WebDAV (webdav):
- name (required): Component name.
- url (required): WebDAV server URL.
- path: WebDAV directory path to scan recursively.
- urls: List of specific WebDAV file paths to ingest.
- urls_file: Path to a file containing WebDAV URLs (one per line). Supports local paths,
s3://bucket/keyURLs, andhttp(s)://WebDAV URLs (fetched using the same WebDAV credentials). - Exactly one of
path,urls, orurls_filemust be specified. - username: Override WebDAV username (resolved via Docker secrets or env vars).
- password: Override WebDAV password (resolved via Docker secrets or env vars).
- extensions: Override extensions for this component.
- metadata: Additional metadata merged with config-level metadata.
Settings are resolved in the following order (highest priority first):
- Component-level settings (e.g.,
extensionson a component) - Manifest config-level settings (e.g.,
config.extensions) - Global environment settings (e.g.,
EXTENSIONSenv var)
For metadata, config-level and component-level values are merged, with component values taking precedence for duplicate keys.
When delete_stale: true is set in a manifest's config block, the runner will remove documents from the Ingester that no longer appear in any of the manifest's components. This keeps the Ingester in sync with the actual source data.
How it works:
- All components execute sequentially, collecting every discovered URI and its hash
- After all components complete successfully, a single
check_statuscall is made to the Ingester with the consolidated URI set anddelete_stale=true - The Ingester compares the submitted URIs against what it has stored for the source. Any documents belonging to the source whose URI is not in the submitted set are deleted.
Safety:
- If any component produces an error (exception or unknown type),
delete_staleis skipped entirely for that manifest run. This prevents accidental deletions when the URI set is incomplete due to a failed component. - Components that succeed still have their documents ingested normally — only the stale deletion step is skipped.
Example:
id: synced-docs
name: Synced Documentation
source: docs-source
config:
delete_stale: true
components:
- name: local-docs
type: fs
path: /data/docs
- name: shared-drive
type: webdav
url: https://webdav.example.com
path: /shared/docsIf a file is removed from /data/docs or from the WebDAV server, the next manifest run will detect that its URI is no longer present and delete it from the Ingester.
Note: SCM components using incremental: true only return files changed since the last sync, not the full file listing. When delete_stale is enabled with incremental SCM components, the stale detection may not have complete URI coverage for those components. Consider using full inventory mode (incremental: false) when delete_stale is needed with SCM sources.
When the REST API server is started with SCHEDULER_ENABLED=true and MANIFEST_DIR is set, manifests with a schedule block are automatically registered as cron jobs:
export SCHEDULER_ENABLED=true
export MANIFEST_DIR=/path/to/manifests
si-agent serveThe server loads all manifests from the directory at startup, validates that all manifest IDs are unique, and registers cron jobs for manifests that have a schedule defined.
Note: All commands support WebDAV credentials via environment variables (WEBDAV_URL, WEBDAV_USERNAME, WEBDAV_PASSWORD) or command-line options (--webdav-url, --webdav-username, --webdav-password).
Git Bash on Windows: If using Git Bash on Windows, use double slashes for WebDAV paths to prevent path conversion (e.g., //documents instead of /documents).
- Discovery: Files are discovered from the source (filesystem, WebDAV, SCM, or web)
- Hashing: Each file's hash is calculated
- Filesystem/WebDAV/Web sources: SHA256 hash
- SCM sources: SHA3-256 hash for files, SHA256 for issues
- Status Check: The system checks which files have changed or are new against the ingester database
- Batch Management:
- The system searches for an existing batch matching the source name
- If found, new documents are added to the existing batch (incremental ingestion)
- If not found, a new batch is created
- This enables efficient re-ingestion: only new or changed files are processed
- Ingestion: Files are uploaded to the Soliplex Ingester API
- Workflow Trigger (optional): Workflows can be started to process the ingested documents. See Ingester documentation for details.
The run-incremental command uses commit-based tracking for efficient synchronization:
- Sync State Check: Retrieves last processed commit SHA from the ingester
- Commit Enumeration: Fetches only commits since the last sync
- Change Detection: Extracts changed and removed file paths from commits
- Selective Fetch: Downloads only files that were modified
- Ingestion: Uploads changed files to the ingester
- State Update: Stores the latest commit SHA for subsequent syncs
This approach reduces API calls and bandwidth by 80-95% compared to full repository scans. On first run (or after reset), a full sync is performed to establish the baseline.
Both agents filter files by the EXTENSIONS configuration. The default extensions are: md, pdf, doc, docx.
To add more types:
export EXTENSIONS=md,pdf,doc,docx,txt,rstIt also validates that files have supported content types and rejects:
- ZIP archives
- RAR archives
- 7z archives
- Generic binary files without proper MIME types
The SCM agent only includes files with extensions specified in the EXTENSIONS configuration (default: md, pdf, doc, docx).
For SCM sources, issues (including their comments) are rendered as Markdown documents and ingested alongside repository files. This enables full-text search and analysis of issue discussions.
As an example, the soliplex documentation) can be loaded using both the filesystem and via git.
Quick version (NEW - no inventory.json needed):
git clone https://github.com/soliplex/soliplex.git
# Set up environment
export ENDPOINT_URL=http://localhost:8000/api/v1
# Ingest directly from directory!
uv run si-agent fs run-inventory <path-to-checkout>/soliplex/docs soliplex-docs
# Check that documents are in the ingester (your batch id may be different)
curl -X 'GET' \
'http://127.0.0.1:8000/api/v1/document/?batch_id=1' \
-H 'accept: application/json'Traditional version (with inventory.json):
git clone https://github.com/soliplex/soliplex.git
# Set up environment
export ENDPOINT_URL=http://localhost:8000/api/v1
# Create inventory (optional - only if you want to review/modify it)
uv run si-agent fs build-config <path-to-checkout>/soliplex/docs
# You may see messages about ignored files
# If you want to update the inventory.json file, do it here
# Validate configuration
uv run si-agent fs validate-config <path-to-checkout>/soliplex/docs/inventory.json
# If there are errors, fix them now
# Ingest
uv run si-agent fs run-inventory <path-to-checkout>/soliplex/docs/inventory.json soliplex-docs
# Check that documents are in the ingester (your batch id may be different)
curl -X 'GET' \
'http://127.0.0.1:8000/api/v1/document/?batch_id=1' \
-H 'accept: application/json'# Set up environment
export ENDPOINT_URL=http://localhost:8000/api/v1
export scm_auth_token=ghp_your_token_here
# Ingest repository
si-agent scm run-inventory github mycompany/soliplex
#check that documents are in the ingester: (your batch id may be different)
curl -X 'GET' \
'http://127.0.0.1:8000/api/v1/document/?batch_id=2' \
-H 'accept: application/json'
# Set up environment
export ENDPOINT_URL=http://localhost:8000/api/v1
export WEBDAV_URL=https://nextcloud.example.com/remote.php/dav/files/username
export WEBDAV_USERNAME=your-username
export WEBDAV_PASSWORD=your-password
# Ingest directly from WebDAV directory
si-agent webdav run-inventory /Documents/project-docs webdav-docs
# Check that documents are in the ingester (your batch id may be different)
curl -X 'GET' \
'http://127.0.0.1:8000/api/v1/document/?batch_id=3' \
-H 'accept: application/json'# Ingest and trigger processing workflows
si-agent fs run-inventory ./documents my-docs \
--start-workflows \
--workflow-definition-id document-analysis \
--priority 5The agents can be run as a REST API server using FastAPI. This exposes all agent operations as HTTP endpoints with support for authentication and interactive documentation.
# Basic
si-agent serve
# Custom host and port
si-agent serve --host 0.0.0.0 --port 8080
# Development mode with auto-reload
si-agent serve --reload
# Production with multiple workers
si-agent serve --workers 4The server supports multiple authentication methods:
si-agent serve
# All requests allowedexport API_KEY=your-api-key
export API_KEY_ENABLED=true
si-agent serveClients must include the API key in the Authorization header:
curl -H "Authorization: Bearer your-api-key" http://localhost:8001/api/fs/statusexport AUTH_TRUST_PROXY_HEADERS=true
si-agent serveThe server will trust authentication headers from a reverse proxy (e.g., OAuth2 Proxy):
X-Auth-Request-UserX-Forwarded-UserX-Forwarded-Email
NEW: All endpoints now accept both file paths (inventory.json) and directory paths!
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/fs/build-config |
Build inventory from directory |
POST |
/api/v1/fs/validate-config |
Validate inventory (file or directory) |
POST |
/api/v1/fs/check-status |
Check which files need ingestion (file or directory) |
POST |
/api/v1/fs/run-inventory |
Ingest documents (file or directory) |
Examples:
# Build configuration from directory
curl -X POST http://localhost:8001/api/v1/fs/build-config \
-F "path=/path/to/docs"
# Validate using directory (no inventory.json needed)
curl -X POST http://localhost:8001/api/v1/fs/validate-config \
-F "config_file=/path/to/docs"
# Or validate using existing inventory file
curl -X POST http://localhost:8001/api/v1/fs/validate-config \
-F "config_file=/path/to/docs/inventory.json"
# Ingest directly from directory
curl -X POST http://localhost:8001/api/v1/fs/run-inventory \
-F "config_file=/path/to/docs" \
-F "source=my-source"| Method | Endpoint | Description |
|---|---|---|
GET |
/api/scm/{platform}/{repo}/issues |
List repository issues |
GET |
/api/scm/{platform}/{repo}/files |
List repository files |
POST |
/api/scm/{platform}/{repo}/ingest |
Ingest repo files and issues |
Example:
# List GitHub issues
curl http://localhost:8001/api/scm/github/my-repo/issues?owner=myuser
# Ingest repository
curl -X POST http://localhost:8001/api/scm/github/my-repo/ingest \
-H "Content-Type: application/json" \
-d '{"owner": "myuser", "source": "my-source"}'| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/webdav/validate-config |
Validate inventory from WebDAV path |
POST |
/api/v1/webdav/check-status |
Check which files need ingestion |
POST |
/api/v1/webdav/run-inventory |
Ingest documents from WebDAV directory |
POST |
/api/v1/webdav/run-from-file |
Ingest documents from an uploaded URL list file |
Examples:
# Validate using WebDAV path
curl -X POST http://localhost:8001/api/v1/webdav/validate-config \
-F "config_path=/documents" \
-F "webdav_url=https://webdav.example.com"
# Ingest from WebDAV directory
curl -X POST http://localhost:8001/api/v1/webdav/run-inventory \
-F "config_path=/documents" \
-F "source=my-source" \
-F "webdav_url=https://webdav.example.com" \
-F "webdav_username=user" \
-F "webdav_password=pass"
# Ingest from uploaded URL list file (with skip hash check)
curl -X POST http://localhost:8001/api/v1/webdav/run-from-file \
-F "file=@urls.txt" \
-F "source=my-source" \
-F "skip_hash_check=true" \
-F "webdav_url=https://webdav.example.com" \
-F "webdav_username=user" \
-F "webdav_password=pass"| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/web/run-inventory |
Ingest web pages from a JSON array of URLs |
POST |
/api/v1/web/run-from-file |
Ingest web pages from an uploaded URL list file |
Examples:
# Ingest web pages from URL list
curl -X POST http://localhost:8001/api/v1/web/run-inventory \
-F "urls=[\"https://example.com/page1\", \"https://example.com/page2\"]" \
-F "source=my-source"
# Ingest web pages with workflow and metadata
curl -X POST http://localhost:8001/api/v1/web/run-inventory \
-F "urls=[\"https://example.com/page1\"]" \
-F "source=my-source" \
-F "start_workflows=true" \
-F "workflow_definition_id=my-workflow" \
-F "param_set_id=my-params" \
-F "metadata={\"project\": \"test\"}"
# Ingest web pages from uploaded file
curl -X POST http://localhost:8001/api/v1/web/run-from-file \
-F "file=@urls.txt" \
-F "source=my-source"| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/manifest/run |
Run manifests from a file or directory path |
POST |
/api/v1/manifest/run-single |
Run a single manifest file |
POST |
/api/v1/manifest/validate |
Validate manifests without executing |
Examples:
# Run all manifests in a directory
curl -X POST http://localhost:8001/api/v1/manifest/run \
-F "path=/path/to/manifests"
# Run a single manifest
curl -X POST http://localhost:8001/api/v1/manifest/run-single \
-F "path=/path/to/manifest.yml"
# Validate manifest files
curl -X POST http://localhost:8001/api/v1/manifest/validate \
-F "path=/path/to/manifests"| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Server health check |
Example:
curl http://localhost:8001/health
# Returns: {"status": "healthy"}Interactive API documentation is available at:
- Swagger UI:
http://localhost:8001/docs - ReDoc:
http://localhost:8001/redoc - OpenAPI JSON:
http://localhost:8001/openapi.json
The server is designed to run in containers:
# Build image
docker build -t ingester-agents:latest .
# Run with environment variables
docker run -d \
-p 8001:8000 \
-e ENDPOINT_URL=http://ingester:8000/api/v1 \
-e API_KEY_ENABLED=true \
-e API_KEY=your-secret-key \
ingester-agents:latest
# Check health
curl http://localhost:8001/healthThe Docker image includes:
- Non-root user for security
- Health checks for orchestration
- Proper signal handling
- Production-ready uvicorn configuration
Ensure your tokens have the required permissions:
- GitHub:
reposcope for private repositories, public access for public repos - Gitea: Access token with read permissions
Verify the ENDPOINT_URL is correct and the Ingester API is running:
curl http://localhost:8000/api/v1/batch/For SCM agents, ensure the repository name and owner are correct. Use the exact repository name, not the URL.
# Clone repository
git clone <repository-url>
cd ingester-agents
# Install dependencies with dev tools
uv sync
# Run tests
uv run pytest
# Run linter
uv run ruff checkThe project uses pytest with 100% code coverage requirements:
# Run unit tests with coverage
uv run pytest
# Run specific tests
uv run pytest tests/unit/test_client.py
# Generate coverage report
uv run pytest --cov-report=htmlThe project uses Ruff for linting and code formatting:
# Check code
uv run ruff check
# Auto-fix issues
uv run ruff check --fix
# Format code
uv run ruff formatsoliplex.agents/
├── src/soliplex/agents/
│ ├── cli.py # Main CLI entry point (includes 'serve' command)
│ ├── client.py # Soliplex Ingester API client
│ ├── config.py # Configuration, settings, and manifest models
│ ├── server/ # FastAPI server
│ │ ├── __init__.py # FastAPI app initialization, scheduler
│ │ ├── auth.py # Authentication (API key & OAuth2 proxy)
│ │ └── routes/
│ │ ├── __init__.py
│ │ ├── fs.py # Filesystem API endpoints
│ │ ├── scm.py # SCM API endpoints
│ │ ├── webdav.py # WebDAV API endpoints
│ │ ├── web.py # Web API endpoints
│ │ └── manifest.py # Manifest API endpoints
│ ├── common/ # Shared utilities
│ │ ├── urls_file.py # URL list reader (local, S3, WebDAV)
│ │ ├── s3.py # S3 object reader
│ │ └── config.py # MIME type detection, config helpers
│ ├── fs/ # Filesystem agent
│ │ ├── app.py # Core filesystem logic
│ │ └── cli.py # Filesystem CLI commands
│ ├── web/ # Web agent
│ │ └── app.py # Core web fetching logic
│ ├── webdav/ # WebDAV agent
│ │ ├── app.py # Core WebDAV logic
│ │ └── cli.py # WebDAV CLI commands
│ ├── manifest/ # Manifest runner
│ │ ├── runner.py # YAML loading, validation, dispatch
│ │ └── cli.py # Manifest CLI commands
│ └── scm/ # SCM agent
│ ├── app.py # Core SCM logic
│ ├── cli.py # SCM CLI commands
│ ├── base.py # Base SCM provider interface
│ ├── github/ # GitHub implementation
│ ├── gitea/ # Gitea implementation
│ └── lib/
│ ├── templates/ # Issue rendering templates
│ └── utils.py # Utility functions
├── example-manifests/ # Example manifests (fs, scm, web, webdav, composite, delete-stale)
├── tests/ # Test suite
│ └── unit/
│ ├── test_server_*.py # Server API tests
│ └── ...
├── Dockerfile # Production container
├── .dockerignore # Build context exclusions
└── DOCKERFILE_CHANGES.md # Docker implementation documentation
CLI Layer:
cli.py- Main entry point withfs,web,scm,webdav,manifest, andservecommands- Agent-specific CLI commands in
fs/cli.py,webdav/cli.py,scm/cli.py, andmanifest/cli.py
Server Layer:
server/- FastAPI applicationserver/auth.py- Flexible authentication (none, API key, OAuth2 proxy)server/routes/- REST API endpoints mirroring CLI functionality
Agent Layer:
fs/app.py- Filesystem operations (shared by CLI and API)web/app.py- Web page fetching and ingestion (shared by CLI and API)webdav/app.py- WebDAV operations (shared by CLI and API)scm/app.py- SCM operations (shared by CLI and API)manifest/runner.py- Manifest loading, validation, and dispatch to agentsclient.py- HTTP client for Soliplex Ingester API
Configuration:
config.py- Pydantic settings and manifest component models- Environment variables or
.envfile for configuration - YAML manifest files for declarative multi-source ingestion
See LICENSE file for details.
For issues and questions, please open an issue on the repository.