Token cache hit rate Testing

Overview

As generative AI models process user inputs, they segment each request and response into tokens. These tokens represent small units of text (words, subwords, or symbols). Token-based computation allows cost control and performance optimization — but it also creates opportunities to improve model speed and reduce latency through intelligent caching.

The Token Cache is a mechanism that stores previously computed tokens, allowing the model to reuse prior context efficiently. This document explains how the token cache works, why it matters for enterprise-level AI deployments, and how to validate OneRouter’s direct, transparent connectivity through code-based benchmarking.

Token Cache Principles

How Token Caching Works

Tokenization When you send a prompt to a model, it is first tokenized. Each unique token is assigned a numeric ID. Example (simplified):
```
"AI caching improves performance." 
→ [AI, caching, improves, performance, .]
```
Incremental Computation During inference, models build upon already computed states (hidden layers). If your next query shares a long prefix with a previous one, a cache lets the model skip redundant work.
Cache Retrieval The system stores key-value pairs for transformer layers. When a repeated sequence appears, the model retrieves the pre-computed attention keys/values instead of recalculating them.
Result
- Reduced token latency (fewer tokens need computation).
- Lower overall cost for repeated or streaming queries.
- Consistent model output for repeated prefixes.

Importance and Enterprise Value

Dimension

Value for Enterprises

Performance

Reduces average latency, enabling near real-time dialog systems and intelligent assistants.

Scalability

Reduces compute overhead, allowing large-scale deployments with lower GPU cost.

Consistency

Ensures stable responses for repeated prefix queries (e.g., ongoing chat contexts).

Cost Optimization

Minimizes redundant token charges, especially for recall-heavy use cases.

Sustainability

Lowers energy consumption through more efficient inference cycles.

Token caching directly enhances both the user experience (speed, responsiveness) and business efficiency (throughput per dollar spent).

Validation - Token cache comparison between OneRouter and Google

OneRouter is a transparent AI service dispatcher that ensures zero obfuscation between the client request and the model endpoint. It routes traffic directly to the original model provider, guaranteeing data integrity, full transparency, and predictable latency patterns.

Unlike proxy APIs that encapsulate or alter payloads, OneRouter simply forwards context and token metrics, allowing clients to verify direct connectivity.

Below is a simple benchmarking approach to demonstrate OneRouter’s transparency and verify that token caching behaves identically to the original AI model endpoint.

Environment Setup

Requirements:

Python ≥ 3.9
requests or httpx
Access to OneRouter and direct model endpoints.

pip install requests

Account & API Keys Setup

The first step to start using OneRouter is to create an account and get your API key.

The second step to start using Google AI Studio is create a project and get your API Key.

Benchmark Script Example

import requests
import json
import random
import string
import os
from openai import OpenAI

base_prompt = """
### Prompt Title:
**The Shattered Continent — A Comprehensive World‑Building and Narrative Instruction**

---

You are to imagine and describe, in vivid, cinematic, and intellectually coherent detail, a vast fictional world known as *Aelyndra*, a continent that was once united under luminous orders of scholars, mages, engineers, and philosophers, but is now fragmented by centuries of arcane wars, plagues, and ideological rifts. The purpose of this prompt is to generate an elaborate tapestry of interlocking stories, characters, cultures, technologies, and metaphysical mysteries. Every generated text based on this prompt should feel immersive, multi‑layered, and historically grounded within its own logic. The tone should balance grounded realism with mythic resonance, evoking both awe and melancholy.

Below are detailed aspects, lore structures, stylistic expectations, sensory directions, metaphysical principles, and narrative possibilities you should elaborate upon.

---

#### 1. **Historical Overview**

Describe a timeline spanning thousands of years, from the primordial formation of Aelyndra to its contemporary fractured age.
Include eras such as **The Genesis Fires**, when the first luminous beings descended and shaped the continents; **The Chain‑Forge Epoch**, when mortal civilizations learned to harness resonant metals that could channel thought; **The Concordant Millennia**, the golden age of united knowledge; and **The Sundering**, a cataclysmic fracturing that split both geography and the collective memory of humankind.

Every historical event must feel internally consistent: show cause and consequence. For instance, the loss of one coastal city’s library should have ripples across distant temples and later generations’ philosophies. The tone should be reflective and slightly tragic, as though the chronicler recounts a glorious but forgotten lineage.

---

#### 2. **Geography and Environment**

Construct a geography of striking variety and symbolic resonance — volcanic shores, glass deserts, cities built within petrified forests, islands that drift through the mist like sleeping giants. For each region, define climate, flora, fauna, and the materials used in architecture. The
**Amber Steppes**, for example, might shimmer with grasses that refract sunlight into living colors, while **The Hollow Expanse** could be a wasteland where the air hums with residual magic from ancient wars.

Integrate ecological logic: how trade winds, oceanic currents, or tectonic activity affect culture and migration. Mountains may separate kingdoms physically but rivers and undersea tunnels connect them secretly. Give attention to sensory cues — the smell of resin in mountain villages, the sound of iron insects ticking in the deserts at twilight, the taste of mineral dust in air after storms.

---

#### 3. **Peoples and Cultures**

List multiple civilizations and describe how they diverged culturally, linguistically, and spiritually after the Sundering. Avoid simplistic binaries of good versus evil; each culture must hold a mixture of beauty, cruelty, and contradiction.
For instance:

- **The Dathenians**, descendants of former astronomer‑priests, now live beneath great dome observatories shattered by meteor showers; their language evolved around the concept of cyclical silence, and their rituals involve rebuilding and unbuilding stone circles.
- **The Marquorians**, sea‑bound artisans who sculpt coral into living fortresses; they treat navigation as a spiritual rite, believing each voyage mirrors the journey of the soul beyond death.
- **The Oruvian Clans**, desert dwellers who master the remnants of sonic engineering, forging instruments that can blast sandstorms into harmonious patterns visible for miles.

Each description should anchor political systems, economic practices, mythological origins, and interpersonal customs: how they greet each other, mourn their dead, or repair their tools. Include the etymology of cultural names, food habits, clothing textures, and color symbolism.

---

#### 4. **Religions, Philosophy, and Magic Systems**

Magic in this world arises not from childish incantation but from **resonant cognition**, a symbiotic interaction between thought, mineral vibration, and light frequency. Those talented in the craft can bind emotion into material forms — forging “sentient metals” that remember their wielders’ fears or hopes. Magic is thus both scientific and spiritual, blurring boundaries between psychology, physics, and theology.

Develop diverse schools of philosophy debating ethical use of such power:

- The **Solace Theorists** argue that controlling resonance is an act of compassion — to heal broken matter.
- The **Iron Aesthetes** consider creation a cruel necessity, insisting that only destruction brings cosmic symmetry.
- The **Children of Echo** worship silence and claim that every magical act pollutes the universal rhythm.

In your generated text, treat these doctrines not merely as background flavor but as intellectual frameworks shaping language, law, art, and personal relationships.

---

#### 5. **Technology and Architecture**

Aelyndra’s civilizations developed hybrid science combining clockwork engineering, bio‑alchemy, and energy crystallization. Describe towers powered by luminous conduits that pulse in rhythm with heartbeat sensors, skyships navigated by harmonic crystals, temples where gears and vines intertwine as living mechanisms. Highlight how technology evolves according to resource distribution: coastal regions rely on fungal luminescence, while mountain regimes mine “thought‑ore.” The interplay of invention and superstition drives narrative tension: progress both liberates and curses.

Architectural imagery should emphasize scale and mood: narrow alleys carved into obsidian cliffs, floating monasteries tethered by cables of woven gold, and markets illuminated by singing light globes whose hum forms improvised melodies as people pass.

---

#### 6. **Narrative Archetypes**

Encourage stories about rediscovery, reconciliation, and ambiguity rather than simple triumph. Possible archetypes include:

- The **Historian Without Records**, traveling to piece together memories hidden in ruins.
- The **Exile Engineer**, carrying an artifact that generates voices of those it once killed.
- The **Dream Cartographer**, mapping emotions that alter geography in real time.
- The **Queen of Mirrors**, who governs through reflection because her actual body has dissolved into glass.

All characters must confront both external danger and metaphysical uncertainty. Their heroism is subtle — the courage to remember or forgive rather than to conquer.

---

#### 7. **Sensory and Emotional Atmosphere**

When generating scenes, prioritize evocative sensory layering:
- Sound: the low chime of suspended glass, whispering wind through broken halls, distant chanting over water.
- Sight: refracted twilight on metallic dunes, murals shimmering with bioluminescent ink.
- Texture: the contrast between rusted ruins and the softness of moss growing over them.
- Emotion: nostalgia, intellectual awe, gentle melancholy, quiet rebellion.

Narrative pacing should oscillate between stillness and momentum — slow revelation punctuated by flashes of insight or dread. Readers should feel as though they’re remembering a place they never visited.

---

#### 8. **Metaphysics and Ethics**

Articulate the metaphysical principle that the universe is a dialogue between **Memory** and **Entropy**. Every act of creation defies forgetting but accelerates decay elsewhere. As a result, civilizations in Aelyndra constantly face moral trade‑offs: Should they preserve ancient resonance‑engines at the cost of ecological balance, or let their light fade naturally? These philosophical dilemmas should infuse even ordinary conversations.

Include thought experiments, fragmentary proverbs, and paradoxical hymns: “What we rebuild, we erase differently.” Avoid clichés of prophecy; instead, show how destiny might itself be a side effect of collective guilt or yearning.

---

#### 9. **Language, Names, and Symbol Codes**

Build naming conventions that suggest linguistic diversity — alternating consonant clusters and harmonic vowels, or syntax where verbs precede emotion markers. Indicate how written language has evolved: maybe modern scribes use glowing ink, and every sentence emits faint music depending on its meaning. Each culture’s writing system reveals worldview: linear scripts for materialists, spiral glyphs for those who worship recursion.
Allow symbols like tri‑circles, mirrored sigils, or broken hexagrams to recur as motifs linking spirituality and mathematics.

---

#### 10. **Storytelling Mode and Style**

When generating prose or dialogue from this prompt:

- **Tone:** intellectual lyricism blended with tactile realism.
- **Point of View:** optional mixture of omniscient chronicler, first‑person witness, or mosaic of journal entries.
- **Pacing:** start with environment or philosophical reflection before advancing plot.
- **Voice:** maintain rich vocabulary and musical rhythm, avoiding modern slang.
- **Conflict Portrayal:** inner struggle takes precedence; physical battles should mirror psychological or ideological clashes.

Comparison points: the emotional gravity of high epic poetry, the forensic detail of travelogues, the mournful tone of lost civilizations.

---

#### 11. **Prompts for Expansion**

After establishing the world, encourage detailed responses to sub‑prompts such as:

1. Describe a festival in a ruined city rebuilt with living vines; include sensory details, songs, rituals, and philosophical conversations heard between drunk scholars.
2. Write letters exchanged between two philosophers debating whether machines can dream. Use subtle metaphors instead of direct exposition.
3. Paint a panoramic view of the continent from orbit after centuries of regrowth — show what remains luminous when human memory fades.
4. Chronicle a court where sentences are sung rather than spoken, and justice is determined by the harmony of the choir’s tone.
5. Depict children discovering an artifact that records emotions. Show how it alters their personal identities.

Each of these sub‑prompts must align with the metaphysical and cultural logic above.

---

#### 12. **Ethos of Generation**

When using this master prompt, emphasize imagination rooted in coherence. Every fantastical element should follow some rationale — whether physical, symbolic, or emotional. Avoid default tropes (knights, elves, dragons) unless reinvented with purpose. Portray diversity of belief and appearance; suggest realistic emotions amid mythic context. The world should feel *earned*, as though history genuinely unfolded there.

---

#### 13. **Purpose and Audience**

This is designed for creators seeking an inexhaustible setting for stories, poems, games, or conceptual art. It invites introspection, exploration of morality, and appreciation for transient beauty. Its ideal audience values depth over spectacle, meaning over mere ornament.

---

#### 14. **Instruction to the AI (if applicable)**

When generating content from this prompt, the AI should:

- Adopt a deliberate, reflective tone.
- Prioritize atmosphere and reasoning before action.
- Honor contradictions without resolving them.
- Provide continuity: refer back to established geography and philosophies.
- Avoid repetition, clichés, or superficial heroism.
- Strive for prose that reads like the memory of a dream encoded into scripture.

Output should feel semi‑academic yet emotionally resonant — a mixture of archived myth and eyewitness recollection.

Please limit the output content to within 32 characters.
---

### End of Prompt
""".strip()

def random_tokens(n):
    words = []
    for _ in range(n):
        l = random.randint(4, 10)
        w = "".join(random.choice(string.ascii_lowercase) for _ in range(l))
        words.append(w)
    return " ".join(words)


class OneRouter_Testing:
    def __init__(self, url, api_key, model):
        self.url = url
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        }
        self.model = model

    @staticmethod
    def compute_cache_rate(result):
        usage = result.get("usage")
        details = usage.get("prompt_tokens_details") if isinstance(usage, dict) else None
        if isinstance(details, dict):
            cached = details.get("cached_tokens", 0)
            text = details.get("text_tokens")
            if isinstance(text, (int, float)) and text:
                return cached / text
        return None

    def run(self, iterations=100):
        rates = []
        responses = []
        for _ in range(iterations):
            noise = random_tokens(128)
            payload = {
                "model": self.model,
                "messages": [
                    {
                        "role": "user",
                        "content": base_prompt + "\n\n" + noise,
                    }
                ],
                "usage": {"include": True},
            }
            response = requests.post(url=self.url, headers=self.headers, data=json.dumps(payload))
            result = response.json()
            responses.append(result)
            rate = self.compute_cache_rate(result)
            if isinstance(rate, (int, float)):
                rates.append(rate)
                print(f"onerouter_cache_rate={rate:.4f}")
            else:
                print("onerouter_cache_rate=0.0000")
        with open("onerouter_responses.jsonl", "w", encoding="utf-8") as f:
            for idx, r in enumerate(responses, start=1):
                f.write(json.dumps({"index": idx, "response": r}, ensure_ascii=False) + "\n")
        avg = sum(rates) / len(rates) if rates else 0.0
        print(f"onerouter_avg_cache_rate={avg:.4f}")
        return rates

class Google_Testing:
    def __init__(self, url, api_key, model):
        self.client = OpenAI(api_key=api_key, base_url=url)
        self.model = model

    @staticmethod
    def compute_cache_rate(result):
        usage = result.get("usage")
        text = usage.get("prompt_tokens")
        details = usage.get("prompt_tokens_details") if isinstance(usage, dict) else None
        if isinstance(details, dict):
            cached = details.get("cached_tokens", 0)
            if isinstance(text, (int, float)) and text:
                return cached / text
        return None

    def run(self, iterations=100):
        rates = []
        responses = []
        for _ in range(iterations):
            noise = random_tokens(128)
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": base_prompt + "\n\n" + noise},
                ],
            )
            result = response.model_dump()
            responses.append(result)
            rate = self.compute_cache_rate(result)
            if isinstance(rate, (int, float)):
                rates.append(rate)
                print(f"google_cache_rate={rate:.4f}")
            else:
                print("google_cache_rate=0.0000")
        with open("google_responses.jsonl", "w", encoding="utf-8") as f:
            for idx, r in enumerate(responses, start=1):
                f.write(json.dumps({"index": idx, "response": r}, ensure_ascii=False) + "\n")
        avg = sum(rates) / len(rates) if rates else 0.0
        print(f"google_avg_cache_rate={avg:.4f}")
        return rates

def plot_rates(onerouter_rates, google_rates, one_modelname, google_modelname, filename="cache_rate_comparison.png"):
    import matplotlib.pyplot as plt
    x1 = list(range(1, len(onerouter_rates) + 1))
    x2 = list(range(1, len(google_rates) + 1))
    plt.figure(figsize=(10, 4))
    if onerouter_rates:
        plt.plot(x1, onerouter_rates, label=f"OneRouter-{one_modelname}", color="#1f77b4")
    if google_rates:
        plt.plot(x2, google_rates, label=f"Google-{google_modelname}", color="#ff7f0e")
    plt.xlabel("Request Index")
    plt.ylabel("cache_rate")
    plt.title("Cache Rate per Request")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(filename, dpi=200)
    print(f"chart_saved={filename}")

if __name__ == "__main__":
    onerouter_modelname = "google-ai-studio/gemini-2.5-flash-preview-09-2025"
    google_modelname = "gemini-2.5-flash-preview-09-2025"

    one = OneRouter_Testing(
        url="https://llm.onerouter.pro/v1/chat/completions",
        api_key="<<Replace with your OneRouter API Key>>",
        model=onerouter_modelname,
    )
    one_rates = one.run(100)
    google = Google_Testing(
        url="https://generativelanguage.googleapis.com/v1beta/openai/",
        api_key="<<Replace with your Gooogle AI Studio API Key>>",
        model=google_modelname,
    )
    google_rates = google.run(100)
    plot_rates(
        one_rates, 
        google_rates,
        one_modelname=onerouter_modelname,
        google_modelname=google_modelname
    )

As you can see from the chart above, the request responses sent to OneRouter and those sent to Google AI Studio show almost identical token cache rates.

Key Takeaways

Token Cache is the foundation for real-time AI efficiency.
Enterprises benefit through optimized cost, speed, and consistent inference.
OneRouter provides infrastructure-level transparency, ensuring every cached token and every response is derived directly from the authentic model endpoint.
Verification via code tests is straightforward: identical token metrics confirm full transparency.

PreviousOpenAI Agents SDK

Last updated 2 days ago