Headroom

Python 上下文优化中间层 — ContentRouter 智能路由 + SmartCrusher 统计压缩 + CCR 按需检索

架构概览

01 / 05

问题：工具输出膨胀

在 agentic 编程场景中，LLM 每调一次工具（读文件、执行命令、搜索代码、查询 API），返回的完整输出都会追加到上下文。一个典型的 2 小时编码会话中，工具输出占据总 context 的 80% 以上——其中 JSON 数组（如搜索结果、数据库查询）通常有数百条记录，构建日志可能包含数千行重复的编译信息，代码文件中大量 docstring 和注释对当前任务毫无帮助。这些冗余内容以每轮数千 token 的速度填满 200K 的上下文窗口。

典型 Agentic 会话 — 工具输出 Token 占比

42%

25%

18%

JSON 数组42%— 搜索结果、DB 查询、API 响应

源代码25%— 文件读取、代码搜索结果

构建日志18%— pytest, cargo build, npm test

纯文本10%— 文档、README、配置文件

其他5%— diff, HTML, 二进制预览

2 小时会话总量约150K~300K tokens

02 / 05

ContentRouter：智能路由

Headroom 的核心思路是「不同类型的内容需要不同的压缩算法」。ContentRouter 是入口调度器：它先用 Magika（Google 开源的 ML 内容检测器）或正则回退方案识别内容类型——JSON 数组、源代码、构建日志、grep 搜索结果、git diff、HTML 网页——然后将每种类型路由到专用压缩器。JSON 走 SmartCrusher（统计采样），代码走 CodeCompressor（tree-sitter AST 感知，移除注释和 docstring），日志走 LogCompressor（聚合重复行，保留错误），文本走 Kompress（ModernBERT INT8 量化推理）。

输入

[{"id": 1, "name": "Alice", "score": 95}, {"id": 2, ...}]

detect

SmartCrusher

压缩率 60-90%

检测方式：JSON 解析成功 + 顶层为数组 → SmartCrusher（统计分析 + 自适应采样）

03 / 05

SmartCrusher：统计驱动的 JSON 压缩

面对一个 500 条记录的 JSON 数组，SmartCrusher 不做简单截断，而是执行完整的统计分析：逐字段计算唯一性比率、方差、变点（sliding window 对比前后均值，差异超过 2 倍标准差即为变点）。然后用 Kneedle 算法计算自适应 K 值——在信息密度曲线上找到"拐点"，决定保留多少条。最终保留的条目包括：首尾 K 项（锚点）、包含 error/exception/failed 的条目（永不丢弃）、数值异常值（>2σ）、变点附近条目、以及与用户查询语义相关的条目（BM25 或 embedding 评分）。

50 条数据 → 12 条保留（38 条裁剪）

首/尾锚点

异常值 (>2σ)

错误条目

变点

查询相关

裁剪

76%

Token 缩减

LLM 调用

可逆

CCR 检索

04 / 05

CacheAligner：前缀稳定化

Anthropic 和 OpenAI 的 API 都提供 KV cache——如果请求的消息前缀与之前相同，provider 可以复用已计算的注意力矩阵，节省约 90% 的 prefill 计算成本。但 system prompt 中常嵌入动态内容（当前时间 "2026-04-15T10:30:00"、请求 UUID、API key），每次请求都不同，导致前缀 hash 变化，cache 完全失效。CacheAligner 用 DynamicContentDetector 识别 15+ 种动态模式（UUID、JWT、时间戳、hex hash、高熵字符串），将它们从 system prompt 主体移到尾部附录区，使前缀保持字节级一致。

Before — 前缀每次不同

You are a coding assistant.

Current project: token-lab

Today is 2026-04-15T10:30:00

Request ID: a3f7b2c1-...

Use tools when needed.

✗prefix hash: e7a2f...→ 每次请求都变

CacheAligner

After — 稳定前缀 + 动态附录

You are a coding assistant.

Current project: token-lab

Use tools when needed.

Dynamic Appendix:

date: 2026-04-15T10:30:00

request_id: a3f7b2c1-...

✓prefix hash: c4d8a...→ 跨请求稳定

效果：Anthropic cache read 价格是正常 input 的 10%——前缀稳定化后，重复对话的 system prompt 几乎可以全部命中 cache。

05 / 05

CCR：压缩上下文检索

传统压缩是单向的——信息一旦丢弃就无法恢复。Headroom 的 CCR（Compressed Context Retrieval）机制让压缩变成可逆操作：当 SmartCrusher 将 500 条压缩为 20 条时，原始数据存入 CompressStore（SQLite，默认 TTL 5 分钟）。压缩后的输出末尾附加 marker "[413 items compressed, hash=abc123]"，同时向 LLM 注入 headroom_retrieve 工具定义。当 LLM 发现 20 条代表性数据不够用时，它可以调用 headroom_retrieve(hash="abc123", query="error") 按需取回匹配的原始条目。

原始 tool output:

[{id: 1, ...}, {id: 2, ...}, ... 500 条]

→

压缩后:

[{id: 1, ...}, {id: 12, ...}, ... 20 条]

[413 items compressed. hash=abc123. Call headroom_retrieve to expand.]

原始 500 条存入 CompressStore（SQLite，TTL 5 分钟）

代码走读

从 compress() 入口到 CCR 工具注入，走读 5 个关键函数的 Python 源码。

01 / 05

入口：compress()

Headroom 的一函数 API。使用双重检查锁定（double-checked locking）懒加载单例 TransformPipeline，从消息中提取用户查询作为相关性评分的上下文，然后调用 pipeline.apply() 执行完整的变换链。失败时返回原始消息而非抛异常——作为中间件，不能因为压缩失败而阻断 LLM 请求。

# headroom/compress.py

# 懒加载单例 pipeline
_pipeline = None
_pipeline_lock = threading.Lock()

def compress(
    messages: list[dict],
    model: str = "claude-sonnet-4-5-20250929",
    model_limit: int = 200000,
    hooks: Any = None,
    config: CompressConfig | None = None,
) -> CompressResult:
    pipeline = _get_pipeline()  # 双重检查锁定，线程安全

    # 提取用户查询作为相关性评分上下文
    context = _extract_user_query(messages)

    # 执行完整变换链
    result = pipeline.apply(
        messages=messages, model=model,
        model_limit=model_limit,
        context=context,  # SmartCrusher 用这个做相关性评分
    )

    # 失败返回原始消息，不阻断 LLM 请求
    except Exception as e:
        return CompressResult(messages=messages)

02 / 05

ContentRouter.compress()

路由器对每条消息的 content 执行类型检测。_detect_content()优先使用 Magika ML 检测器（准确率 >95%），回退到正则启发式。检测结果映射到 ContentType 枚举，再分发到对应压缩器。每种压缩器独立处理，结果带路由元数据。

# headroom/transforms/content_router.py

def _detect_content(content: str) -> DetectionResult:
    magika = _get_magika_detector()
    if magika is not None:
        result = magika.detect(content)
        # Magika ML → ContentType 映射
        type_map = {
            "json": ContentType.JSON_ARRAY,
            "code": ContentType.SOURCE_CODE,
            "log":  ContentType.BUILD_OUTPUT,
            "diff": ContentType.GIT_DIFF,
            "text": ContentType.PLAIN_TEXT,
        }
        return DetectionResult(
            content_type=type_map.get(result.content_type.value),
            confidence=result.confidence,
        )
    else:
        # 回退到正则启发式检测
        return detect_content_type(content)

# 路由分发：每种类型 → 专用压缩器
# JSON 数组  → SmartCrusher (统计分析)
# 源代码     → CodeCompressor (AST 感知)
# 构建日志   → LogCompressor (聚合去重)
# 搜索结果   → SearchCompressor (保留相关)
# 纯文本     → Kompress (ModernBERT ML)

03 / 05

SmartCrusher._crush_array()

核心压缩流程。先用 compute_optimal_k() 计算自适应 K 值（Kneedle 算法在信息密度曲线上找拐点），再查询 TOIN（Tool Output Intelligence Network）获取跨用户学习到的压缩建议。如果 TOIN 检测到该工具的历史检索率很高，会自动降低压缩程度。

# headroom/transforms/smart_crusher.py

def _crush_array(self, items, query_context, tool_name, bias):
    # 1. Kneedle 算法计算自适应 K（不是硬编码）
    adaptive_k = compute_optimal_k(
        item_strings, bias=bias,
        min_k=3,  # 至少保留 3 条
    )

    # 2. 查询 TOIN（跨用户学习网络）
    toin_hint = toin.get_recommendation(tool_signature, query_context)
    if toin_hint.skip_compression:  # 该工具历史检索率高
        return items  # 不压缩

    # 3. 统计分析每个字段
    analysis = self.analyzer.analyze_array(items)

    # 4. 构建保留索引集
    keep = set()
    keep |= anchor_indices      # 首/尾 K 项
    keep |= error_indices       # 含 error/exception 的条目
    keep |= anomaly_indices     # 数值异常值 (> 2σ)
    keep |= change_point_indices # 变点 ±window
    keep |= relevant_indices    # BM25/embedding 高分

    # 5. CCR：存储原始数据，生成检索 hash
    ccr_hash = store.cache(items, key=content_hash)
    return [items[i] for i in sorted(keep)]

04 / 05

AnchorSelector.select_anchors()

不是固定的 first-K/last-K，而是根据数据模式动态分配锚点位置。搜索结果 → FRONT_HEAVY；日志 → BACK_HEAVY；时间序列 → BALANCED。还会分析用户查询中的时间意图关键词（"最近"/"历史"）动态调整权重分布。

# headroom/transforms/anchor_selector.py

class AnchorStrategy(Enum):
    FRONT_HEAVY  = "front_heavy"   # 搜索结果（top 条目最重要）
    BACK_HEAVY   = "back_heavy"    # 日志（最新条目最相关）
    BALANCED     = "balanced"      # 时间序列（需要两端趋势）
    DISTRIBUTED  = "distributed"   # 通用（均匀采样）

def select_anchors(self, items, max_items, pattern, query):
    # 1. 从数据模式推断策略
    strategy = self.get_strategy_for_pattern(pattern)

    # 2. 计算锚点预算（受 min/max 约束）
    budget = int(max_items * config.anchor_budget_pct)
    budget = max(config.min_anchor_slots, budget)

    # 3. 按策略分配权重
    weights = self.get_base_weights_for_strategy(strategy)
    # FRONT_HEAVY → front=0.6, middle=0.15, back=0.25
    # BACK_HEAVY  → front=0.2, middle=0.15, back=0.65

    # 4. 查询意图调整权重
    if "最近" in query or "latest" in query:
        weights.back += 0.15  # 偏向后端（最新数据）
    elif "历史" in query or "first" in query:
        weights.front += 0.15  # 偏向前端（早期数据）

    # 5. 按权重分配到三个区域
    front_n = int(budget * weights.front)
    back_n = int(budget * weights.back)
    middle_n = budget - front_n - back_n
    return front[:front_n] + middle_sample + back[-back_n:]

05 / 05

CCR Tool Injection

当压缩发生时，向 LLM 的 tools 数组注入 headroom_retrieve 工具定义，支持 OpenAI 和 Anthropic 两种格式。LLM 可以传入 hash 和可选 query 参数来过滤结果—— 不必取回全部原始数据，只取与当前问题相关的条目。

# headroom/ccr/tool_injection.py

CCR_TOOL_NAME = "headroom_retrieve"

def create_ccr_tool_definition(provider="anthropic"):
    # 注入到 LLM 的 tools 数组中
    return {
        "name": CCR_TOOL_NAME,
        "description": (
            "Retrieve original uncompressed content."
            "Use when you need more data than shown"
            "in compressed tool results. The hash is"
            "provided in markers like"
            "[N items compressed... hash=abc123]."
        ),
        "parameters": {
            "hash": {
                "type": "string",
                "description": "Hash from compression marker",
            },
            "query": {
                "type": "string",
                "description": "Filter results by query",
            },
        },
        "required": ["hash"],
    }