POLAR QUANTIZATION APPLIED TO KEY-VALUE VECTOR COMPRESSION IN CACHE MEMORIES OF LARGE LANGUAGE MODELS

Publicado em 25/06/2026 - ISBN: 978-65-272-2513-3

Título do Trabalho
POLAR QUANTIZATION APPLIED TO KEY-VALUE VECTOR COMPRESSION IN CACHE MEMORIES OF LARGE LANGUAGE MODELS
Autores
  • Igor Duarte
  • Hugo Hoffmann Borges
Modalidade
Resumo
Área temática
Engenharias, Tecnologias e Ciências Exatas
Data de Publicação
25/06/2026
País da Publicação
Brasil
Idioma da Publicação
pt-BR
Página do Trabalho
https://www.even3.com.br/anais/congresso-internacional-de-ciencias-cic/1556019-polar-quantization-applied-to-key-value-vector-compression-in-cache-memories-of-large-language-models
ISBN
978-65-272-2513-3
Palavras-Chave
Inteligência Artificial, Eficiência de IA, Otimização, Algoritmo
Resumo
The quantization of KV cache vectors is a promising strategy for reducing memory overhead within context windows, preserving the integrity of the model architecture while reducing vector length. Conventional approaches, such as block-wise normalization, require the storage of auxiliary parameters that can exceed 1 bit per quantized value, thereby diminishing compression effectiveness. The integration of Large Language Models (LLMs) into applications presents intrinsic challenges, particularly regarding computational efficiency, including memory utilization. In self-attention architectures such as Transformers, the sequential token generation process necessitates the continuous storage of Key-Value (KV) cache vectors for all previously processed tokens. This storage grows proportionally with both the length of the context and the size of the model, representing a significant memory usage bottleneck in LLM deployments. Minimizing its size increases the system-supported context window within the same hardware constraints. To seek efficient alternatives for handling this issue, the present research implements the PolarQuant algorithm proposed by Han et al. (2025), which provides a theoretical approach for this challenge. The method substitutes the Cartesian representation of vectors with polar coordinates. The algorithm is structured into three phases: (i) random preconditioning via Hadamard rotation, which preserves the norms and inner products of the original vectors while uniformly redistributing the signal energy; (ii) recursive polar transformation, applied to randomized vectors; and (iii) Lloyd-Max quantization, derived from the analytical distribution of angles following pre- and post-conditioning. A core characteristic of the method is that, after preconditioning, the data vectors follow well-established, mutually independent Gaussian distributions, with a progressive concentration around p/4 as the level advances. This obviates the necessity for explicit data normalization, thereby reducing the memory overhead typically associated with conventional approaches. As a practical contribution, a Python implementation of the algorithm was developed to reproduce the compression and reconstruction pipeline described in the Han et al. paper. Experiments with Gaussian vectors of dimension d = 32,768, using 8 bits for level 0 and 4 bits for the remaining levels, demonstrate that the compression algorithm achieved a reduction exceeding 80% on a 131,072-byte document. During reconstruction, a mean squared error of approximately 0.0022 was observed.
Título do Evento
Congresso Internacional de Ciências (CIC) – 2026
Título dos Anais do Evento
Congresso Internacional de Ciências (CIC)
Nome da Editora
Even3
Meio de Divulgação
Meio Digital

Como citar

DUARTE, Igor; BORGES, Hugo Hoffmann. POLAR QUANTIZATION APPLIED TO KEY-VALUE VECTOR COMPRESSION IN CACHE MEMORIES OF LARGE LANGUAGE MODELS.. In: Congresso Internacional de Ciências (CIC). Anais...Itaperuna(RJ) Afya Itaperuna - RJ, 2026. Disponível em: https//www.even3.com.br/anais/congresso-internacional-de-ciencias-cic/1556019-POLAR-QUANTIZATION-APPLIED-TO-KEY-VALUE-VECTOR-COMPRESSION-IN-CACHE-MEMORIES-OF-LARGE-LANGUAGE-MODELS. Acesso em: 26/06/2026

Trabalho

Even3 Publicacoes