On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar¹, Irina Rish¹, Nicolas Flammarion², Francesco Croce²

▶ Mila - Quebec AI Institute ▶ EPFL

Correspondence: rishika.bhagwatkar@mila.quebec

Abstract Unsupervised Attack Unsupervised Adversarial Training Main results BibTeX

This is the first work to systematically study discrete image tokenizers against adversarial attacks and showing that simple, task-agnostic training can make multimodal systems far safer.

Abstract

Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

Proposed attack

Idea. Discrete tokenizers map an image to a sequence of visual tokens. If we can induce large changes in the tokenizer’s internal features (and therefore tokens), downstream systems built on top of those tokens—classification, retrieval, captioning, and VQA—can fail.

Unsupervised: no labels from downstream tasks are needed.
Inexpensive: attacking the tokenizer is cheaper than attacking the full downstream model.
Task-agnostic: the attack does not depend on the downstream architecture.

Overview of the tokenizer-based unsupervised, inexpensive, task-agnostic attack.

Attack overview

We craft an ℓ∞-bounded perturbation to maximize the mismatch between tokenizer features extracted from the clean image and the perturbed image, causing the produced token sequence to drift.

Defense

Robust tokenizers via unsupervised adversarial training. Inspired by robust CLIP-style training, we fine-tune only the tokenizer’s encoder. The objective encourages adversarial embeddings to stay close to clean embeddings, so small input perturbations do not significantly change the tokenizer’s internal representations (and thus the tokens).

Defense properties

Fully unsupervised
Task-agnostic

Robust tokenizer properties

Plug-and-play across downstream tasks
Robustness without fine-tuning the full model
Minimally degrades clean performance

Unsupervised adversarial training objective for robust tokenizers and key properties.

Defense overview

We adversarially fine-tune the tokenizer encoder so that feature representations are stable under bounded perturbations, improving robustness across different downstream systems.

Main results

Robust tokenizers → Robust Embedding Models. We evaluate robustness after swapping the original tokenizer encoder with our unsupervised adversarially fine-tuned version, while keeping the downstream model frozen. For FuseLIP (TiTok-based), robust tokenizers substantially increase adversarial robustness on both classification (Imagenette, Caltech101) and multimodal retrieval (OI-Crop, OI-Pos), and the training radius provides explicit control over the robustness–accuracy trade-off. For UniTok, the same tokenizer-only fine-tuning also improves robustness against end-to-end attacks across Imagenette, Caltech101, and ImageNet.

Table 1. Evaluation of FuseLIP on image classification and multimodal retrieval.

Tokenizer	Imagenette			Caltech101			OI-Crop			OI-Pos			Average
Tokenizer	clean	2/255	4/255	clean	2/255	4/255	clean	2/255	4/255	clean	2/255	4/255	clean	2/255	4/255
clean	93.6	2.6	0.0	74.4	0.6	0.0	71.8	7.4	0.8	69.2	5.4	1.4	77.3	4.0	0.6
AT^4/255	91.8	63.6	36.6	73.0	48.2	20.8	66.2	50.6	26.0	67.2	46.0	24.6	74.6	52.1	27.0
AT^8/255	89.6	69.0	48.8	72.4	51.6	32.8	62.0	48.8	35.8	64.8	51.2	35.6	72.2	55.2	38.3
AT^12/255	87.0	71.4	51.0	67.6	51.2	36.8	56.2	49.0	36.8	61.6	49.6	35.2	68.1	55.3	40.0
AT^16/255	83.4	66.6	50.0	61.2	47.6	37.4	50.0	47.2	35.8	59.4	48.8	39.2	63.5	52.6	40.6

Table 2. Evaluation of UniTok on image classification.

Tokenizer	Imagenette			Caltech101			ImageNet			Average
Tokenizer	clean	2/255	4/255	clean	2/255	4/255	clean	2/255	4/255	clean	2/255	4/255
clean	99.2	0.0	0.0	85.7	0.0	0.0	67.3	0.0	0.0	84.1	0.0	0.0
AT^4/255	99.2	92.1	75.0	81.2	56.9	22.4	66.9	31.9	10.5	82.4	60.3	36.0
AT^8/255	97.8	91.5	82.7	77.4	63.5	43.9	58.3	40.3	23.6	77.8	65.1	50.1
AT^12/255	95.6	88.7	81.4	72.4	60.1	47.6	50.4	36.5	25.6	72.8	61.8	51.5
AT^16/255	92.7	86.3	79.6	65.3	57.5	44.6	42.3	32.1	23.6	66.7	58.7	49.3

Robust tokenizers → Robust Multimodal LLMs. We next study UniTok-MLLM by replacing only the image tokenizer with our robust UniTok variant. On VQA (VQAv2, OK-VQA, GQA), robust tokenizers yield large gains in adversarial accuracy under ℓ∞ attacks. On captioning, both unsupervised targeted (tokenizer-only) and supervised targeted (end-to-end) attacks can steer the original model toward the target caption, while the model with the robust tokenizer stays close to the correct description.

Targeted attacks on captioning

The examples below demonstrate targeted attacks: (i) an unsupervised targeted attack that matches the perturbed image’s tokenizer embeddings to a target image, and (ii) a supervised targeted end-to-end attack that optimizes toward a specific target caption. In both cases, the robust tokenizer prevents the model from switching to the target caption.

Unsupervised targeted attack

The attack minimizes the embedding distance between a perturbed input and a target image using only the tokenizer. The original UniTok-MLLM shifts toward the target caption, while the robust tokenizer preserves a correct, safe caption.

Supervised targeted attack

The end-to-end attack directly optimizes the image perturbation toward a chosen target caption. With the original UniTok tokenizer, the model can be forced to output the target caption; swapping in the robust tokenizer prevents this behavior and keeps captions aligned with the input image.

BibTeX


@misc{robust_tokenizers_2026,
  title        = {On the Adversarial Robustness of Discrete Image Tokenizers},
  author       = {Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion, Francesco Croce},
  year         = {2026},
  note         = {Preprint},
}

Acknowledgement

This website template is adapted from popular academic project pages (Bulma-based).