Zum Hauptinhalt springen

Prompt-Caching

Unterstützte Anbieter

Für die unterstützten Anbieter folgt LiteLLM dem OpenAI Prompt Caching Usage Object Format

"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
# ANTHROPIC_ONLY #
"cache_creation_input_tokens": 0
}
  • prompt_tokens: Dies sind die nicht zwischengespeicherten Prompt-Token (gleich wie Anthropic, äquivalent zu Deepseek prompt_cache_miss_tokens).
  • completion_tokens: Dies sind die vom Modell generierten Ausgabe-Token.
  • total_tokens: Summe von prompt_tokens + completion_tokens.
  • prompt_tokens_details: Objekt, das cached_tokens enthält.
    • cached_tokens: Token, die bei diesem Aufruf ein Cache-Hit waren.
  • completion_tokens_details: Objekt, das reasoning_tokens enthält.
  • NUR ANTHROPIC: cache_creation_input_tokens sind die Anzahl der Token, die in den Cache geschrieben wurden. (Anthropic berechnet dafür Gebühren).

Schnellstart

Hinweis: OpenAI-Caching ist nur für Prompts mit 1024 Token oder mehr verfügbar

from litellm import completion 
import os

os.environ["OPENAI_API_KEY"] = ""

for _ in range(2):
response = completion(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
# marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
# The final turn is marked with cache-control, for continuing in followups.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)

print("response=", response)
print("response.usage=", response.usage)

assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0

Anthropic Beispiel

Anthropic berechnet die Kosten für Cache-Schreibvorgänge.

Geben Sie den zu cachnden Inhalt mit "cache_control": {"type": "ephemeral"} an.

Wenn Sie dies für einen anderen LLM-Anbieter übergeben, wird es ignoriert.

from litellm import completion 
import litellm
import os

litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""

response = completion(
model="anthropic/claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)

print(response.usage)

Deepeek Beispiel

Funktioniert genauso wie OpenAI.

from litellm import completion 
import litellm
import os

os.environ["DEEPSEEK_API_KEY"] = ""

litellm.set_verbose = True # 👈 SEE RAW REQUEST

model_name = "deepseek/deepseek-chat"
messages_1 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{
"role": "user",
"content": "Who was the founding emperor of the Qing Dynasty?",
},
]

message_2 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{"role": "user", "content": "When did the Shang Dynasty fall?"},
]

response_1 = litellm.completion(model=model_name, messages=messages_1)
response_2 = litellm.completion(model=model_name, messages=message_2)

# Add any assertions here to check the response
print(response_2.usage)

Kosten berechnen

Kosten für Prompt-Token mit Cache-Hit können von Prompt-Token mit Cache-Miss abweichen.

Verwenden Sie die Funktion completion_cost() zur Berechnung der Kosten (berechnet auch die Kosten für Prompt Caching). Weitere Hilfsfunktionen anzeigen

cost = completion_cost(completion_response=response, model=model)

Verwendung

from litellm import completion, completion_cost
import litellm
import os

litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""
model = "anthropic/claude-3-5-sonnet-20240620"
response = completion(
model=model,
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)

print(response.usage)

cost = completion_cost(completion_response=response, model=model)

formatted_string = f"${float(cost):.10f}"
print(formatted_string)

Modellunterstützung prüfen

Prüfen Sie, ob ein Modell Prompt Caching mit supports_prompt_caching() unterstützt

from litellm.utils import supports_prompt_caching

supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")

assert supports_pc

Dies prüft unsere gepflegte Modellinfo/Kosten-Map