4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Transformer tricks: Precomputing the first layer

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This short paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, and PaLM). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model (such as Mistral-7B) is limited to 3% savings.

          Related collections

          Author and article information

          Journal
          20 February 2024
          Article
          2402.13388
          857d4b0d-3459-4006-af8a-1986ebe81a84

          http://creativecommons.org/licenses/by-nc-nd/4.0/

          History
          Custom metadata
          5 pages, 2 figures
          cs.LG

          Artificial intelligence
          Artificial intelligence

          Comments

          Comment on this article