Prompt Caching

Cache prompt messages.

To save on inference costs, you can enable prompt caching on supported models.

Most models automatically enable prompt caching, but note that some (see Anthropic below) require you to enable it on a per-message basis.

Anthropic Claude

Caching price changes:

  • Cache writes: charged at 1.25x the price of the original input pricing

  • Cache reads: charged at 0.1x the price of the original input pricing

Prompt caching with Anthropic requires the use of cache_control breakpoints. There is a limit of 4 breakpoints, and the cache will expire within 5 minutes. Therefore, it is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc. And there is a minimum prompt size of 1024 tokens.

Click here to read more about Anthropic prompt caching and its limitation.

The cache_control breakpoint can only be inserted into the text part of a multipart message.

System message caching example:

{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a historian studying the fall of the Roman Empire. You know the following book very well:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What triggered the collapse?"
        }
      ]
    }
  ]
}

User message caching example:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Given the book below:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        },
        {
          "type": "text",
          "text": "Name all the characters in the above book"
        }
      ]
    }
  ]
}

OpenAI

Caching price changes:

  • Cache writes: no cost

  • Cache reads: (depending on the model) charged at 0.25x or 0.50x the price of the original input pricing

Click here to view OpenAI's cache pricing per model.

Prompt caching with OpenAI is automated and does not require any additional configuration. There is a minimum prompt size of 1024 tokens.

Click here to read more about OpenAI prompt caching and its limitation.

Grok

Caching price changes:

  • Cache writes: no cost

  • Cache reads: charged at 0.25x the price of the original input pricing

Click here to view Grok's cache pricing per model.

Prompt caching with Grok is automated and does not require any additional configuration.

Google Gemini

Implicit Caching

Gemini 2.5 Pro and 2.5 Flash models now support implicit caching, providing automatic caching functionality similar to OpenAI’s automatic caching. Implicit caching works seamlessly — no manual setup or additional cache_control breakpoints required.

Pricing Changes:

  • No cache write or storage costs.

  • Cached tokens are charged at 0.25x the original input token cost.

Note that the TTL is on average 3-5 minutes, but will vary. There is a minimum of 1028 tokens for Gemini 2.5 Flash, and 2048 tokens for Gemini 2.5 Pro for requests to be eligible for caching.

Official announcement from Google

To maximize implicit cache hits, keep the initial portion of your message arrays consistent between requests. Push variations (such as user questions or dynamic context elements) toward the end of your prompt/requests.

Last updated