Back to Blog Home
Jun 24, 2026

How to Run Axigen (ai) Insight with a Local LLM on NVIDIA DGX Spark

Axigen (ai) Insight is a milter-based email analysis service that integrates with Axigen and other MTAs to perform real-time phishing detection, sentiment analysis, and email summarization using a large language model. By default it connects to any OpenAI-compatible inference endpoint (including vLLM — which can run either locally or on another server on the network). It also supports Ollama when quick proof-of-concept or testing environments are required.

This article shows how to run the full stack entirely on-premise using an NVIDIA DGX Spark box as a dedicated AI inference server, with Qwen/Qwen3.6-35B-A3B-FP8 as the example model — a 35B Mixture-of-Experts model with a 262k-token context window in FP8 quantization. All email content is processed locally: no data leaves your network, and there is no dependency on any cloud AI API.

Axigen (ai) Insight on NVIDIA DGX Spark

Note: This guide uses Axigen (ai) Insight's OpenAI-compatible backend. Axigen (ai) Insight ships with Axigen X7 and is kept up to date, so X7 deployments already include it and no specific version is required. For standalone deployments on older Axigen versions, make sure Axigen (ai) Insight is updated to a build that includes the OpenAI-compatible backend.

What Is NVIDIA DGX Spark?

The NVIDIA DGX Spark is a desktop AI supercomputer powered by the GB10 Grace Blackwell Superchip. It combines a 72-core Grace CPU and a Blackwell GPU on a single die with 128 GB of unified memory — CPU and GPU share the same physical memory pool, which enables running large language models that would otherwise require a rack-mounted GPU server.

The DGX Spark platform is sold under several OEM brands:

Vendor Model
NVIDIA NVIDIA DGX Spark Founders Edition
ASUS ASUS GX10
HP HP ZGX Nano AI Station G1n
Dell Dell DGX Spark
Lenovo Lenovo DGX Spark

All variants use the same GB10 hardware and run the same DGX OS (Ubuntu-based). The configuration steps in this article apply equally to every variant.

Architecture Overview

Axigen MTA  (mail-server.example.com)
    │  milter protocol — port 8891
    ▼
axigen-insight daemon
    │  HTTP POST /v1/chat/completions
    │  model: qwen3.6-35b
    ▼
vLLM inference server  (dgx-spark.example.com:8000)
    │
    ▼
Qwen/Qwen3.6-35B-A3B-FP8 weights
    128 GB unified memory, GB10 Grace Blackwell

Prerequisites

  • An NVIDIA DGX Spark box (any OEM variant) with DGX OS installed and network-accessible from the Axigen host
  • Docker with NVIDIA Container Toolkit configured on the DGX Spark box
  • A Hugging Face account with a read token (HF_TOKEN) — required to download Qwen/Qwen3.6-35B-A3B-FP8
  • A compatible LLM — this guide uses Qwen/Qwen3.6-35B-A3B-FP8. Any instruction-tuned model that fits in 128 GB and returns JSON from prompt instructions will work.
  • Axigen (ai) Insight installed on the Axigen mail server host (it ships with Axigen X7 and is kept up to date)

Step 1: Start vLLM on the DGX Spark Box

Create the following docker-compose.yml on the DGX Spark box (for example in ~/vllm/docker-compose.yml):

services:
  vllm-qwen3.6:
    image: vllm/vllm-openai:latest
    container_name: vllm-qwen3.6
    restart: unless-stopped
    network_mode: host
    ipc: host
    shm_size: 16gb
    environment:
      - HF_HOME=/models
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    volumes:
      - /data/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      Qwen/Qwen3.6-35B-A3B-FP8
      --served-model-name qwen3.6-35b
      --host 0.0.0.0
      --port 8000
      --max-model-len 262144
      --gpu-memory-utilization 0.75
      --enable-prefix-caching
      --enable-chunked-prefill
      --kv-cache-dtype fp8
      --reasoning-parser qwen3
      --max-num-batched-tokens 8192
      --max-num-seqs 4
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
    healthcheck:
      test: ["CMD-SHELL", "curl -sf http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 600s

 

Start and follow the logs:

cd ~/vllm
docker compose up -d
docker compose logs -f

 

First launch downloads model weights. On the first start, vLLM downloads the Qwen3.6-35B-A3B-FP8 weights (~20 GB) from Hugging Face and compiles CUDA kernels. Download time depends on your internet connection. The API is unavailable until INFO: Application startup complete. appears in the logs. Subsequent restarts from locally cached weights take approximately 5 minutes.

Step 2: Verify the Inference Server

From the Axigen mail server host, confirm the model is available:

curl http://dgx-spark.example.com:8000/v1/models | python3 -m json.tool

The response should list the model with its served name:

{
    "object": "list",
    "data": [
        {
            "id": "qwen3.6-35b",
            "object": "model"
        }
    ]
}

 

Note the value of "id" — you will need it in the next step. The id is controlled by the --served-model-name argument in your docker-compose.yml.

Step 3: Configure Axigen (ai) Insight

Axigen (ai) Insight uses a layered configuration system. The base configuration is in config/config.yaml; local overrides live in config/config.local.yaml, which is never overwritten by upgrades.

This guide changes only the backend setting needed to point Axigen (ai) Insight at your vLLM endpoint. For everything else — the smart processing rules that decide which messages get analyzed, header behavior, fallback options, and operational tuning — follow the steps for configuring Axigen (ai) Insight in the Operation and Integration Guide.

Create or edit config/config.local.yaml on the Axigen mail server host:

# Use the OpenAI-compatible backend (vLLM)
llm_backend: "openai"

openai:
  base_url: "http://dgx-spark.example.com:8000/v1"  # replace with your DGX Spark hostname or IP
  api_key: "openai"  # vLLM does not enforce auth by default; any non-empty string works
  model: "qwen3.6-35b"  # must match --served-model-name
  temperature: 0
  top_p: 0.5
  top_k: 20
  seed: 42
  max_tokens: 4096
  timeout: 120s
  max_retries: 3
  retry_delay: 1s
  think: false

 

Model name must match exactly. The value of model: must equal the --served-model-name passed to vLLM — in this example qwen3.6-35b, not Qwen/Qwen3.6-35B-A3B-FP8. A mismatch produces a "model not found" 404 error.

Step 4: Restart Axigen (ai) Insight and Verify

systemctl restart axigen-insight

# Confirm successful connection to vLLM
journalctl -u axigen-insight -f | grep -E "backend|openai|health|error"

 

On successful startup you will see:

INF Starting axigen-insight ... llm_backend=openai
INF OpenAI backend health check passed model=qwen3.6-35b

 

Send a test message through Axigen and inspect the injected headers in the received email. A successfully analyzed message will carry headers similar to:

X-Axigen-Insight-Phishing-Analysis: SAFE
X-Axigen-Insight-Phishing-Confidence: 0.95
X-Axigen-Insight-Sentiment-Analysis: NEUTRAL
X-Axigen-Insight-Sentiment-Confidence: 0.98
X-Axigen-Insight-Summary: Internal message discussing team meeting logistics.

What to Expect

With a dedicated inference server running Qwen3.6-35B-A3B-FP8, axigen-insight processes each email in roughly 5–15 seconds end-to-end, depending on message length and current queue depth. Throughput scales with the number of concurrent connections Axigen routes through the milter — vLLM's continuous batching means short messages do not wait behind long ones.

Compared to a CPU-only or consumer-GPU Ollama setup, a dedicated GPU inference server eliminates the primary bottleneck for high-volume deployments. At typical enterprise mail volumes (tens of thousands of messages per day), average latency stays well under the milter timeout.

Technical Notes

Memory Usage

With --gpu-memory-utilization 0.75, vLLM reserves approximately 96 GB out of the 128 GB unified pool. Of this, the FP8 model weights consume roughly 20 GB; the remainder is pre-allocated KV cache. At low traffic the actual KV cache occupancy is a small fraction of that — the large allocation is intentional so vLLM knows in advance how many parallel requests it can serve without running out of memory.

Prefix Caching

With --enable-prefix-caching, vLLM caches the KV attention states for the system prompt, which is identical for every email analysis request. After a warm-up period of a few hundred requests, approximately 60% of all prompt tokens are served from cache, reducing the compute cost of each request.

FP8 Quantization

The Qwen3.6-35B-A3B-FP8 weights use FP8 quantization, which is natively supported by the GB10 Blackwell GPU in DGX Spark. This means no dequantization fallback is needed — inference runs at full hardware efficiency. The --kv-cache-dtype fp8 flag extends FP8 to the KV cache as well, further reducing memory pressure and enabling larger effective batch sizes.

Troubleshooting

Symptom Likely cause Fix
connection refused on startup vLLM still loading the model Wait for Application startup complete in the vLLM logs before starting Axigen (ai) Insight
A "model not found" error response Model name mismatch Query the models endpoint and use the exact id value in config.local.yaml
WRN Failed to load local config YAML syntax error in config.local.yaml Validate the file with a YAML parser before restarting the service
High failure rate in workflow statistics JSON parse failure or timeout Check the Axigen (ai) Insight service logs for errors; increase timeout: if requests are racing the clock under burst load

Related Resources

 

About the author:

Bogdan Moldovan

In my career of 25 years in IT, I’ve gathered vast experience & know-how in everything related to software development, telecom, VoIP, business development, sales, management, and more. My articles are generally technical & tailored to the email geeks out there — but I also like talking about thought leadership ideas and management topics from my unique perspective as CEO of Axigen.