Frequency Autoregressive Image Generation with Continuous Tokens

1Alibaba Group, DAMO Academy
teaser image.


🔥 Highlights

1. We propose the FAR paradigm, leveraging the spectral dependency of image data. FAR fits the causality requirement of AR models and preserves the spatial locality of image data, while being more sampling efficient.


2. We delve into the instantiation of FAR with the continuous tokenizer, introducing a series of techniques to address the optimization challenges and improve the efficiency of both training and inference. FAR only needs 10 steps to generate an image.


3. We demonstrate the effectiveness and scalability of FAR through comprehensive experiments on ImageNet dataset and further extend FAR to text-to-image generation.



Abstract

Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan "next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive FAR paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.

FAR Framework Overview

framework overview image.

Main Experiment Results

exp256 image.
exp512 image.

Visualization

framework overview image.

BibTeX

@article{yu2024an,
  author    = {Hu Yu and Hao Luo and Hangjie Yuan and Yu Rong and Feng Zhao},
  title     = {Frequency Autoregressive Image Generation with Continuous Tokens},
  journal   = {arxiv: 2503.05305},
  year      = {2025}
}