What is CMVK?
CMVK (Cross-Model Verification Kernel) is a hallucination detection system that verifies AI outputs by comparing responses across multiple language models. Instead of trusting a single model's response, CMVK queries multiple models and uses consensus algorithms to determine accuracy and reliability.
Why Multi-Model Consensus?
Different LLMs have different training data and failure modes. When GPT-4, Claude, and Gemini all agree on a factual answer, it's far more likely to be accurate than any single model's response. CMVK leverages this principle for high-stakes decisions.
Key Features
- Multi-Model Queries — Query GPT-4, Claude, Gemini, and more simultaneously
- Consensus Algorithms — Configurable thresholds for agreement levels
- Semantic Comparison — Uses embeddings to compare meaning, not just text
- Drift Detection — Monitor model behavior changes over time
- Confidence Scoring — Quantify reliability of each verification
- Batch Processing — Efficiently verify multiple prompts at once
Installation
Install CMVK as a standalone module or as part of the full Agent OS kernel:
# Standalone installation
pip install agent-os-cmvk
# Or install via the kernel with CMVK extras
pip install agent-os-kernel[cmvk]
# Install with all LLM provider support
pip install agent-os-cmvk[all]
# Install with specific providers
pip install agent-os-cmvk[openai] # GPT-4, GPT-3.5
pip install agent-os-cmvk[anthropic] # Claude models
pip install agent-os-cmvk[google] # Gemini models
Environment Variables
Configure API keys for each provider you want to use:
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
Quick Example
The simplest way to use CMVK is with the verify() function:
from agent_os.cmvk import verify
# Verify a factual claim across multiple models
result = await verify(
prompt="What is the capital of France?",
models=["gpt-4", "claude-3-sonnet", "gemini-pro"],
threshold=0.8
)
print(f"Consensus: {result.consensus}") # True
print(f"Confidence: {result.confidence:.2f}") # 0.95
print(f"Answer: {result.answer}") # "Paris"
print(f"Agreement: {result.agreement_ratio}") # 1.0 (all models agreed)
# Check which models responded
for model, response in result.responses.items():
print(f"{model}: {response.answer}")
Handling Verification Failures
When models disagree, CMVK provides detailed information:
from agent_os.cmvk import verify, VerificationError
result = await verify(
prompt="What is the best programming language?",
models=["gpt-4", "claude-3-sonnet", "gemini-pro"],
threshold=0.9
)
if not result.consensus:
print("Models disagreed!")
print(f"Confidence: {result.confidence:.2f}")
# Inspect individual responses
for model, response in result.responses.items():
print(f" {model}: {response.answer}")
# Get the majority answer even without consensus
print(f"Majority answer: {result.majority_answer}")
print(f"Dissenting models: {result.dissenting_models}")
ConsensusVerifier Class
For more control over verification behavior, use the ConsensusVerifier class.
This allows you to configure model weights, comparison strategies, and caching.
from agent_os.cmvk import ConsensusVerifier, ModelConfig
# Create a verifier with custom configuration
verifier = ConsensusVerifier(
models=[
ModelConfig("gpt-4", weight=1.0, timeout=30),
ModelConfig("claude-3-opus", weight=1.2, timeout=45), # Higher weight
ModelConfig("gemini-pro", weight=0.8, timeout=25),
],
threshold=0.85,
comparison_strategy="semantic", # "exact", "semantic", or "fuzzy"
cache_responses=True,
max_retries=3
)
# Verify a claim
result = await verifier.verify("The speed of light is approximately 300,000 km/s")
print(f"Verified: {result.consensus}")
print(f"Weighted confidence: {result.weighted_confidence:.2f}")
Comparison Strategies
| Strategy | Description | Best For |
|---|---|---|
exact |
Exact string matching (case-insensitive) | Single-word answers, numbers |
semantic |
Embedding-based similarity comparison | Natural language answers |
fuzzy |
Token-level fuzzy matching | Lists, structured data |
Model Weighting
Assign different weights to models based on their reliability for your use case:
from agent_os.cmvk import ConsensusVerifier, ModelConfig
# For code-related questions, weight models differently
code_verifier = ConsensusVerifier(
models=[
ModelConfig("gpt-4", weight=1.0),
ModelConfig("claude-3-opus", weight=1.3), # Better at code
ModelConfig("gemini-pro", weight=0.7),
],
threshold=0.8,
comparison_strategy="exact" # Code needs exact matching
)
# For factual questions, use different weights
fact_verifier = ConsensusVerifier(
models=[
ModelConfig("gpt-4", weight=1.2), # Strong on facts
ModelConfig("claude-3-sonnet", weight=1.0),
ModelConfig("gemini-pro", weight=1.1),
],
threshold=0.9,
comparison_strategy="semantic"
)
DriftDetector Class
The DriftDetector monitors how model responses change over time. This is
crucial for detecting when a model's behavior shifts due to updates or fine-tuning.
from agent_os.cmvk import DriftDetector
# Initialize drift detector with baseline responses
detector = DriftDetector(
models=["gpt-4", "claude-3-sonnet"],
baseline_prompts=[
"What is 2+2?",
"What is the capital of France?",
"Who wrote Romeo and Juliet?",
],
drift_threshold=0.15, # Alert if drift exceeds 15%
check_interval_hours=24
)
# Establish baseline
await detector.establish_baseline()
# Later, check for drift
drift_report = await detector.check_drift()
print(f"Overall drift: {drift_report.overall_drift:.2%}")
print(f"Drift detected: {drift_report.drift_detected}")
# Inspect per-model drift
for model, metrics in drift_report.model_metrics.items():
print(f"\n{model}:")
print(f" Semantic drift: {metrics.semantic_drift:.2%}")
print(f" Response length change: {metrics.length_change:.1f}%")
print(f" Tone shift: {metrics.tone_shift:.2%}")
Drift Alerts
Configure automatic alerts when drift exceeds thresholds:
from agent_os.cmvk import DriftDetector, DriftAlert
detector = DriftDetector(
models=["gpt-4", "claude-3-sonnet"],
drift_threshold=0.15,
alerts=[
DriftAlert(
type="webhook",
url="https://your-api.com/drift-alert",
min_severity="warning"
),
DriftAlert(
type="email",
recipient="alerts@company.com",
min_severity="critical"
),
]
)
# Run continuous monitoring
await detector.start_monitoring() # Runs in background
Batch Verification
For high-throughput scenarios, use batch verification to efficiently process multiple prompts with optimized API calls.
from agent_os.cmvk import ConsensusVerifier, BatchVerifier
# Create batch verifier
verifier = ConsensusVerifier(
models=["gpt-4", "claude-3-sonnet", "gemini-pro"],
threshold=0.8
)
batch = BatchVerifier(
verifier=verifier,
batch_size=10, # Process 10 prompts at a time
max_concurrent=5, # Max 5 concurrent API calls per model
retry_failed=True
)
# Verify multiple claims
claims = [
"The Earth orbits the Sun",
"Water boils at 100°C at sea level",
"Python was created by Guido van Rossum",
"The Great Wall of China is visible from space", # Common misconception
"Lightning never strikes the same place twice", # Myth
]
results = await batch.verify_all(claims)
# Process results
for claim, result in zip(claims, results):
status = "✓" if result.consensus else "✗"
print(f"{status} {claim}")
if not result.consensus:
print(f" Confidence: {result.confidence:.2f}")
Batch with Progress Tracking
from agent_os.cmvk import BatchVerifier
from tqdm import tqdm
async def verify_with_progress(claims: list[str]):
batch = BatchVerifier(verifier=verifier)
results = []
async for result in batch.verify_stream(claims):
results.append(result)
# Update progress bar
print(f"Verified {len(results)}/{len(claims)}: {result.consensus}")
return results
# Or use callback
async def on_result(index: int, result):
print(f"Claim {index}: {'✓' if result.consensus else '✗'}")
await batch.verify_all(claims, on_complete=on_result)
Configuration Options
CMVK offers extensive configuration options for fine-tuning verification behavior:
Threshold Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold |
float | 0.8 | Minimum confidence for consensus (0.0-1.0) |
min_models |
int | 2 | Minimum models that must respond |
require_unanimous |
bool | False | Require all models to agree |
Model Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
models |
list | ["gpt-4", "claude-3"] | Models to query for verification |
timeout |
int | 30 | Seconds to wait per model response |
max_retries |
int | 3 | Retry attempts for failed API calls |
fallback_models |
list | [] | Backup models if primary fails |
Full Configuration Example
from agent_os.cmvk import ConsensusVerifier, ModelConfig, CMVKConfig
config = CMVKConfig(
# Threshold settings
threshold=0.85,
min_models=2,
require_unanimous=False,
# Model settings
models=[
ModelConfig(
name="gpt-4",
weight=1.0,
timeout=30,
temperature=0.0, # Deterministic
max_tokens=500
),
ModelConfig(
name="claude-3-opus",
weight=1.2,
timeout=45,
temperature=0.0,
max_tokens=500
),
ModelConfig(
name="gemini-pro",
weight=0.9,
timeout=25,
temperature=0.0,
max_tokens=500
),
],
fallback_models=["gpt-3.5-turbo", "claude-3-haiku"],
# Comparison settings
comparison_strategy="semantic",
embedding_model="text-embedding-3-small",
similarity_threshold=0.85,
# Performance settings
cache_responses=True,
cache_ttl_seconds=3600,
max_concurrent_requests=10,
# Retry settings
max_retries=3,
retry_delay_seconds=1.0,
exponential_backoff=True
)
verifier = ConsensusVerifier(config=config)
Use Cases
1. Fact-Checking
Verify factual claims before presenting them to users:
from agent_os.cmvk import verify
async def fact_check(claim: str) -> dict:
"""Verify a factual claim with high confidence."""
result = await verify(
prompt=f"Is this statement true or false? Answer only 'true' or 'false': {claim}",
models=["gpt-4", "claude-3-opus", "gemini-pro"],
threshold=0.9,
comparison_strategy="exact"
)
return {
"claim": claim,
"verified": result.consensus and result.answer.lower() == "true",
"confidence": result.confidence,
"sources": [m for m, r in result.responses.items() if r.answer.lower() == "true"]
}
# Usage
result = await fact_check("The Eiffel Tower is in Paris")
print(f"Verified: {result['verified']} (confidence: {result['confidence']:.0%})")
2. Code Validation
Verify that generated code is correct before execution:
from agent_os.cmvk import ConsensusVerifier
code_verifier = ConsensusVerifier(
models=["gpt-4", "claude-3-opus"],
threshold=0.95, # High threshold for code
comparison_strategy="exact"
)
async def validate_code(code: str, expected_behavior: str) -> bool:
"""Verify code does what it's supposed to do."""
prompt = f"""
Analyze this code and determine if it correctly implements the expected behavior.
Code:
```python
{code}
```
Expected behavior: {expected_behavior}
Answer only "correct" or "incorrect".
"""
result = await code_verifier.verify(prompt)
return result.consensus and result.answer.lower() == "correct"
# Usage
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
is_valid = await validate_code(code, "Calculate the nth Fibonacci number")
print(f"Code is valid: {is_valid}")
3. Content Quality Assurance
Verify AI-generated content meets quality standards:
from agent_os.cmvk import ConsensusVerifier
qa_verifier = ConsensusVerifier(
models=["gpt-4", "claude-3-sonnet", "gemini-pro"],
threshold=0.8,
comparison_strategy="semantic"
)
async def qa_content(content: str, guidelines: list[str]) -> dict:
"""Check if content meets quality guidelines."""
checks = {}
for guideline in guidelines:
prompt = f"""
Does this content follow this guideline? Answer 'yes' or 'no'.
Guideline: {guideline}
Content:
{content}
"""
result = await qa_verifier.verify(prompt)
checks[guideline] = {
"passes": result.consensus and result.answer.lower() == "yes",
"confidence": result.confidence
}
return {
"all_passed": all(c["passes"] for c in checks.values()),
"checks": checks
}
# Usage
guidelines = [
"Content is professional and appropriate",
"No harmful or misleading information",
"Grammar and spelling are correct"
]
qa_result = await qa_content("Your AI-generated article here...", guidelines)
4. Medical/Legal Verification
High-stakes verification with unanimous consensus:
from agent_os.cmvk import ConsensusVerifier, ModelConfig
# Ultra-strict verification for high-stakes scenarios
strict_verifier = ConsensusVerifier(
models=[
ModelConfig("gpt-4", weight=1.0, timeout=60),
ModelConfig("claude-3-opus", weight=1.0, timeout=60),
ModelConfig("gemini-pro", weight=1.0, timeout=60),
],
threshold=0.99,
require_unanimous=True, # ALL models must agree
min_models=3,
comparison_strategy="semantic"
)
async def verify_medical_claim(claim: str) -> dict:
"""Verify medical information with maximum scrutiny."""
result = await strict_verifier.verify(
f"Is this medical statement accurate according to current medical consensus? {claim}"
)
if not result.consensus:
return {
"verified": False,
"warning": "CAUTION: Models did not reach unanimous consensus",
"confidence": result.confidence,
"recommendation": "Consult a medical professional"
}
return {
"verified": True,
"confidence": result.confidence,
"disclaimer": "This is AI-generated verification, not medical advice"
}
Integration with KernelSpace
CMVK integrates seamlessly with the Agent OS KernelSpace for automatic verification of agent outputs.
from agent_os import KernelSpace
from agent_os.cmvk import ConsensusVerifier
# Create kernel with CMVK integration
kernel = KernelSpace(
policy="strict",
verification=ConsensusVerifier(
models=["gpt-4", "claude-3-sonnet"],
threshold=0.8
)
)
@kernel.register
async def research_agent(query: str):
"""Agent that automatically verifies its outputs."""
# Your agent logic here
response = await llm.complete(query)
# KernelSpace automatically verifies before returning
return response
# The kernel will:
# 1. Run the agent
# 2. Verify output with CMVK
# 3. Block response if consensus not reached
# 4. Log verification results to Flight Recorder
Policy-Based Verification
Configure verification requirements per policy:
from agent_os import KernelSpace, Policy
from agent_os.cmvk import ConsensusVerifier
# Define verification policy
verification_policy = Policy(
name="high-stakes-verification",
rules=[
{
"action": "financial_decision",
"verify": True,
"threshold": 0.95,
"models": ["gpt-4", "claude-3-opus", "gemini-pro"]
},
{
"action": "general_query",
"verify": True,
"threshold": 0.7,
"models": ["gpt-4", "claude-3-sonnet"]
},
{
"action": "internal_logging",
"verify": False # Skip verification for logs
}
]
)
kernel = KernelSpace(
policies=[verification_policy],
cmvk=ConsensusVerifier()
)
@kernel.register(action="financial_decision")
async def approve_transaction(amount: float, recipient: str):
"""High-stakes action requiring 95% consensus."""
# Automatically verified with strict threshold
return {"approved": True, "amount": amount}
Verification Events
Subscribe to verification events for monitoring:
from agent_os import KernelSpace
from agent_os.cmvk import VerificationEvent
kernel = KernelSpace(policy="strict")
@kernel.on(VerificationEvent.CONSENSUS_REACHED)
async def on_consensus(event):
print(f"✓ Verified: {event.confidence:.0%} confidence")
@kernel.on(VerificationEvent.CONSENSUS_FAILED)
async def on_failure(event):
print(f"✗ Verification failed: {event.confidence:.0%}")
print(f" Dissenting models: {event.dissenting_models}")
# Optionally escalate to human review
await notify_human_reviewer(event)
@kernel.on(VerificationEvent.MODEL_ERROR)
async def on_error(event):
print(f"⚠ Model {event.model} failed: {event.error}")
# Failover to backup models handled automatically
Next Steps
- API Reference — Complete CMVK API documentation
- EMK Module — Store verification results in episodic memory
- Examples — Production-ready CMVK examples
- Integrations — Use CMVK with LangChain, CrewAI