Axigen (ai) Insight is a milter-based email analysis service that integrates with Axigen and other MTAs to perform real-time phishing detection, sentiment analysis, and email summarization using a large language model. By default it connects to any OpenAI-compatible inference endpoint (including vLLM — which can run either locally or on another server on the network). It also supports Ollama when quick proof-of-concept or testing environments are required.
This article shows how to run the full stack entirely on-premise using an NVIDIA DGX Spark box as a dedicated AI inference server, with Qwen/Qwen3.6-35B-A3B-FP8 as the example model — a 35B Mixture-of-Experts model with a 262k-token context window in FP8 quantization. All email content is processed locally: no data leaves your network, and there is no dependency on any cloud AI API.
Note: This guide uses Axigen (ai) Insight's OpenAI-compatible backend. Axigen (ai) Insight ships with Axigen X7 and is kept up to date, so X7 deployments already include it and no specific version is required. For standalone deployments on older Axigen versions, make sure Axigen (ai) Insight is updated to a build that includes the OpenAI-compatible backend.
What Is NVIDIA DGX Spark?
The NVIDIA DGX Spark is a desktop AI supercomputer powered by the GB10 Grace Blackwell Superchip. It combines a 72-core Grace CPU and a Blackwell GPU on a single die with 128 GB of unified memory — CPU and GPU share the same physical memory pool, which enables running large language models that would otherwise require a rack-mounted GPU server.
The DGX Spark platform is sold under several OEM brands:
| Vendor | Model |
|---|---|
| NVIDIA | NVIDIA DGX Spark Founders Edition |
| ASUS | ASUS GX10 |
| HP | HP ZGX Nano AI Station G1n |
| Dell | Dell DGX Spark |
| Lenovo | Lenovo DGX Spark |
All variants use the same GB10 hardware and run the same DGX OS (Ubuntu-based). The configuration steps in this article apply equally to every variant.
Architecture Overview
│ milter protocol — port 8891
▼
axigen-insight daemon
│ HTTP POST /v1/chat/completions
│ model: qwen3.6-35b
▼
vLLM inference server (dgx-spark.example.com:8000)
│
▼
Qwen/Qwen3.6-35B-A3B-FP8 weights
128 GB unified memory, GB10 Grace Blackwell
Prerequisites
- An NVIDIA DGX Spark box (any OEM variant) with DGX OS installed and network-accessible from the Axigen host
- Docker with NVIDIA Container Toolkit configured on the DGX Spark box
- A Hugging Face account with a read token (HF_TOKEN) — required to download Qwen/Qwen3.6-35B-A3B-FP8
- A compatible LLM — this guide uses Qwen/Qwen3.6-35B-A3B-FP8. Any instruction-tuned model that fits in 128 GB and returns JSON from prompt instructions will work.
- Axigen (ai) Insight installed on the Axigen mail server host (it ships with Axigen X7 and is kept up to date)
Step 1: Start vLLM on the DGX Spark Box
Create the following docker-compose.yml on the DGX Spark box (for example in ~/vllm/docker-compose.yml):
vllm-qwen3.6:
image: vllm/vllm-openai:latest
container_name: vllm-qwen3.6
restart: unless-stopped
network_mode: host
ipc: host
shm_size: 16gb
environment:
- HF_HOME=/models
- HF_TOKEN=${HF_TOKEN}
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
volumes:
- /data/models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: >
Qwen/Qwen3.6-35B-A3B-FP8
--served-model-name qwen3.6-35b
--host 0.0.0.0
--port 8000
--max-model-len 262144
--gpu-memory-utilization 0.75
--enable-prefix-caching
--enable-chunked-prefill
--kv-cache-dtype fp8
--reasoning-parser qwen3
--max-num-batched-tokens 8192
--max-num-seqs 4
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:8000/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
start_period: 600s
Start and follow the logs:
docker compose up -d
docker compose logs -f
First launch downloads model weights. On the first start, vLLM downloads the Qwen3.6-35B-A3B-FP8 weights (~20 GB) from Hugging Face and compiles CUDA kernels. Download time depends on your internet connection. The API is unavailable until INFO: Application startup complete. appears in the logs. Subsequent restarts from locally cached weights take approximately 5 minutes.
Step 2: Verify the Inference Server
From the Axigen mail server host, confirm the model is available:
curl http://dgx-spark.example.com:8000/v1/models | python3 -m json.tool
The response should list the model with its served name:
"object": "list",
"data": [
{
"id": "qwen3.6-35b",
"object": "model"
}
]
}
Note the value of "id" — you will need it in the next step. The id is controlled by the --served-model-name argument in your docker-compose.yml.
Step 3: Configure Axigen (ai) Insight
Axigen (ai) Insight uses a layered configuration system. The base configuration is in config/config.yaml; local overrides live in config/config.local.yaml, which is never overwritten by upgrades.
This guide changes only the backend setting needed to point Axigen (ai) Insight at your vLLM endpoint. For everything else — the smart processing rules that decide which messages get analyzed, header behavior, fallback options, and operational tuning — follow the steps for configuring Axigen (ai) Insight in the Operation and Integration Guide.
Create or edit config/config.local.yaml on the Axigen mail server host:
llm_backend: "openai"
openai:
base_url: "http://dgx-spark.example.com:8000/v1" # replace with your DGX Spark hostname or IP
api_key: "openai" # vLLM does not enforce auth by default; any non-empty string works
model: "qwen3.6-35b" # must match --served-model-name
temperature: 0
top_p: 0.5
top_k: 20
seed: 42
max_tokens: 4096
timeout: 120s
max_retries: 3
retry_delay: 1s
think: false
Model name must match exactly. The value of model: must equal the --served-model-name passed to vLLM — in this example qwen3.6-35b, not Qwen/Qwen3.6-35B-A3B-FP8. A mismatch produces a "model not found" 404 error.
Step 4: Restart Axigen (ai) Insight and Verify
# Confirm successful connection to vLLM
journalctl -u axigen-insight -f | grep -E "backend|openai|health|error"
On successful startup you will see:
INF OpenAI backend health check passed model=qwen3.6-35b
Send a test message through Axigen and inspect the injected headers in the received email. A successfully analyzed message will carry headers similar to:
X-Axigen-Insight-Phishing-Confidence: 0.95
X-Axigen-Insight-Sentiment-Analysis: NEUTRAL
X-Axigen-Insight-Sentiment-Confidence: 0.98
X-Axigen-Insight-Summary: Internal message discussing team meeting logistics.
What to Expect
With a dedicated inference server running Qwen3.6-35B-A3B-FP8, axigen-insight processes each email in roughly 5–15 seconds end-to-end, depending on message length and current queue depth. Throughput scales with the number of concurrent connections Axigen routes through the milter — vLLM's continuous batching means short messages do not wait behind long ones.
Compared to a CPU-only or consumer-GPU Ollama setup, a dedicated GPU inference server eliminates the primary bottleneck for high-volume deployments. At typical enterprise mail volumes (tens of thousands of messages per day), average latency stays well under the milter timeout.
Technical Notes
Memory Usage
With --gpu-memory-utilization 0.75, vLLM reserves approximately 96 GB out of the 128 GB unified pool. Of this, the FP8 model weights consume roughly 20 GB; the remainder is pre-allocated KV cache. At low traffic the actual KV cache occupancy is a small fraction of that — the large allocation is intentional so vLLM knows in advance how many parallel requests it can serve without running out of memory.
Prefix Caching
With --enable-prefix-caching, vLLM caches the KV attention states for the system prompt, which is identical for every email analysis request. After a warm-up period of a few hundred requests, approximately 60% of all prompt tokens are served from cache, reducing the compute cost of each request.
FP8 Quantization
The Qwen3.6-35B-A3B-FP8 weights use FP8 quantization, which is natively supported by the GB10 Blackwell GPU in DGX Spark. This means no dequantization fallback is needed — inference runs at full hardware efficiency. The --kv-cache-dtype fp8 flag extends FP8 to the KV cache as well, further reducing memory pressure and enabling larger effective batch sizes.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| connection refused on startup | vLLM still loading the model | Wait for Application startup complete in the vLLM logs before starting Axigen (ai) Insight |
| A "model not found" error response | Model name mismatch | Query the models endpoint and use the exact id value in config.local.yaml |
| WRN Failed to load local config | YAML syntax error in config.local.yaml | Validate the file with a YAML parser before restarting the service |
| High failure rate in workflow statistics | JSON parse failure or timeout | Check the Axigen (ai) Insight service logs for errors; increase timeout: if requests are racing the clock under burst load |