Fast Byte Latent Transformer

基本信息

arXiv ID: 2605.08044v1
作者: Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al.
发布日期: 2026-05-08
分类: cs.CL, cs.AI, cs.LG
PDF: arXiv PDF

关键图示

Fast Byte Latent Transformer Figure 1 — Figure 1

Fast Byte Latent Transformer Figure 2 — Figure 2

Fast Byte Latent Transformer Figure 3 — Figure 3

摘要

English

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT’s local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

中文

最近的字节级语言模型（LM）在不依赖子词词汇的情况下与令牌级模型的性能相匹配，但它们的实用性受到缓慢的逐字节自回归生成的限制。我们通过新的训练和生成技术解决了 Byte Latent Transformer (BLT) 中的这一瓶颈。首先，我们介绍 BLT 扩散 (BLT-D)，这是一种新模型，也是我们最快的 BLT 变体，它使用辅助分块扩散目标以及标准下一个字节预测损失进行训练。这使得推理过程能够在每个解码步骤并行生成多个字节，从而大大减少生成序列所需的前向传递的数量。其次，我们提出了两个受推测解码启发的扩展，它们以部分速度换取更高的生成质量：BLT 自推测 (BLT-S)，其中 BLT 的本地解码器继续生成超过其正常补丁边界的草稿字节，然后使用单个全模型前向传递进行验证； BLT 扩散+验证 (BLT-DV)，它在基于扩散的生成之后通过自回归验证步骤增强了 BLT-D。在生成任务中，所有方法都可以实现比 BLT 低 50% 以上的估计内存带宽成本。每种方法都有其独特的优势，共同消除了字节级 LM 实际使用的关键障碍。

核心贡献

English

This paper introduces three inference methods that accelerate Byte Latent Transformer (BLT) generation: (1) BLT-D, which adds block-wise discrete diffusion training to enable parallel multi-byte generation, achieving the largest speedups (up to 92% memory-bandwidth cost reduction); (2) BLT-S, which uses the existing local decoder for self-speculation, drafting bytes past patch boundaries with no quality loss; (3) BLT-DV, which adds autoregressive verification to BLT-D drafts, recovers quality while still achieving up to 81% cost reduction. All methods achieve >50% lower estimated memory-bandwidth cost than BLT at 1B and 3B scales.

中文

本文提出三种加速 Byte Latent Transformer (BLT) 生成的推理方法：(1) BLT-D——添加块级离散扩散训练以实现并行多字节生成，达到最大加速（最高 92% 内存带宽成本降低）；(2) BLT-S——利用现有局部解码器自推测，在补丁边界外起草字节且无损质量；(3) BLT-DV——对 BLT-D 草案添加自回归验证，恢复质量的同时仍实现最高 81% 成本降低。所有方法在 1B 和 3B 规模上实现 >50% 内存带宽成本降低。

方法概述

English

BLT architecture: bytes are dynamically grouped into variable-length patches by an entropy-based patcher; a local encoder and a large global Transformer operate on latent tokens; a local decoder generates bytes autoregressively. BLT-D modifies the decoder to accept corrupted byte blocks and trains with both next-byte prediction and masked-byte prediction. At inference, it initializes a block of [MASK] tokens and iteratively unmasks positions via semi-autoregressive diffusion. BLT-S lets the decoder draft bytes beyond normal patch boundaries, verified by a full-model forward pass. BLT-DV uses diffusion-drafted blocks verified autoregressively. All methods reuse the same model weights — no separate draft model.

中文

BLT 架构通过基于熵的分块器将字节动态分组为可变长度补丁；局部编码器和大全局 Transformer 处理潜在 token；局部解码器自回归生成字节。BLT-D 修改解码器接受损坏的字节块，同时用下一字节预测和掩码字节预测训练。推理时初始化 [MASK] 块并通过半自回归扩散迭代去掩码。BLT-S 让解码器在补丁边界外起草字节，全模型前向传播验证。BLT-DV 用扩散草案自回归验证。所有方法重用相同模型权重。

实验结果

English

At 1B scale on WMT translation and HumanEval code generation: BLT-D achieves 92% reduction in decoder NFEs with diffusion block size 32, with moderate BLEU drop on translation. BLT-S achieves up to 77% encoder/global-call reduction with zero quality loss (greedy-identical output guaranteed). BLT-DV at block size 16 achieves 81% reduction with quality close to autoregressive BLT. At 3B scale, similar trends hold. Likelihood evaluations confirm diffusion training does not degrade perplexity. Generation diversity analysis shows BLT-D produces more diverse outputs at high temperatures.

中文

1B 规模在 WMT 翻译和 HumanEval 代码生成上：BLT-D 在扩散块大小 32 时实现 92% 解码器 NFE 降低，翻译 BLEU 适度下降。BLT-S 实现最高 77% 编码器/全局调用降低且零质量损失。BLT-DV 在块大小 16 时实现 81% 降低且质量接近自回归 BLT。3B 规模趋势相似。困惑度评估确认扩散训练不降低困惑度。生成多样性分析显示 BLT-D 在高温下产生更多样化输出。

局限性与注意点

BLT-D 的扩散生成引入质量-效率权衡：块越大速度越快但质量越低。
内存带宽成本估算基于理想假设（权重完全驻留内存），实际加速因硬件而异。
仅在翻译和代码生成任务上验证，对话/推理等其他任务表现未知。
BLT-S 的草案长度受 patch 边界影响，过短则加速有限。
所有方法需要在 BLT 架构基础上修改训练流程，无法直接应用于预训练 BLT 模型。

Fast Byte Latent Transformer

基本信息

关键图示

摘要

English

中文

核心贡献

English

中文

方法概述

English

中文

实验结果

English

中文

局限性与注意点

相关概念