Transformer tricks: Precomputing the first layer

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

This short paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, and PaLM). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model (such as Mistral-7B) is limited to 3% savings.

Related collections

Author and article information

Journal

Publication date Created: 20 February 2024

Article

ArXiV ID: 2402.13388

SO-VID: 857d4b0d-3459-4006-af8a-1986ebe81a84

License:

http://creativecommons.org/licenses/by-nc-nd/4.0/

History

Custom metadata

Comments 5 pages, 2 figures

Categories cs.LG

ScienceOpen disciplines: Artificial intelligence

Data availability:

ScienceOpen disciplines: Artificial intelligence

Transformer tricks: Precomputing the first layer

Read this article at

Abstract

Related collections

Semantic Knowledge Base

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 333