#Contextcaching | Explore Tumblr posts and blogs

govindhtech · 1 month ago

Text

Implicit Caching Is Now Supported In Gemini 2.5 Models

Google launched context caching in May 2024, allowing developers to explicitly save 75% of model reused context. Today, the Gemini API adds implicit caching, a popular request.

Implicit caching using Gemini API

Implicit caching saves developers cache costs by eliminating the need for explicit cache construction. Sending a request to a Gemini 2.5 model with the same prefix as a previous request can now cause a cache hit. You will enjoy the same 75% token discount as it dynamically passes cost savings back to you.

To increase the likelihood of a cache hit, include context that may vary from request to request, such as a user's inquiry, at the end of the prompt. More implicit caching best practices are in the Gemini API documentation.

Google Developers reduced Gemini 2.5 Flash's minimum request size to 1024 tokens and 2.5 Pro's to 2048 tokens to boost cache hits.

Gemini 2.5: Token discounts explained

Consider using its explicit caching API, which supports Gemini 2.5 and 2.0 models, to save money. If you use Gemini 2.5 models, the use information will show cached_content_token_count, which shows how many tokens in the request were cached and charged at the reduced price.

Preview models may be modified and have stricter rate limitations before stabilising.

Context caching

Standard AI processes may feed models the same input tokens. The Gemini API has two caching methods:

Automated implicit caching (no cost savings)

Caching manually (guaranteed savings)

Gemini 2.5 models default to implicit caching. If a request involves cached content, Google Developers refund you promptly.

Explicit caching can reduce costs, but it requires more developer work.

Implicit cache

The default setting for all Gemini 2.5 models is implicit caching. Should your request reach caches, cost savings are promptly passed on. This requires no action. It takes effect May 8, 2025. For context caching, 2.5 Flash and 2.5 Pro need 1,024 and 2,048 input tokens.

Increase implicit cache hit probability:

Consider opening your prompt with common, large content.

Use comparable prefixes to send requests fast.

Token cache hits are shown in the response object's usage_metadata field.

Explicit cache

Using the Gemini API's explicit caching, you can send content to the model once, cache the input tokens, and utilise them for other queries. Using cached tokens is cheaper than passing in the same tokens at precise volumes.

Caches can store tokens for a specified time before being automatically removed. This is the TTL for caching. Unset TTLs default to one hour. The caching cost depends on input token size and persistence period.

This section requires you have installed the Gemini SDK (or curl) and set up an API key, as shown in the quickstart.

#ImplicitCaching #GeminiAPI #Contextcaching #Explicitcaching #Gemini25 #Gemini25models #technology #TechNews #technologynews #news #govindhtech

0 notes