[BETA]Anforderungs-Priorisierung
Info
Beta-Funktion. Nur zum Testen verwenden.
Priorisieren Sie LLM-API-Anfragen bei hohem Datenverkehr.
- Anfrage zur Prioritätswarteschlange hinzufügen
- Warteschlange abfragen, um zu prüfen, ob eine Anfrage gestellt werden kann. Gibt 'True' zurück
- wenn es gesunde Bereitstellungen gibt
- ODER wenn die Anfrage an der Spitze der Warteschlange steht
- Priorität - Je niedriger die Zahl, desto höher die Priorität
- z.B.
priority=0>priority=2000
- z.B.
Unterstützte Router-Endpunkte
acompletion(/v1/chat/completionsauf dem Proxy)atext_completion(/v1/completionsauf dem Proxy)
Schnellstart​
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "gpt-3.5-turbo",
"mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
"rpm": 1,
},
},
],
timeout=2, # timeout request if takes > 2s
routing_strategy="usage-based-routing-v2",
polling_interval=0.03 # poll queue every 3ms if no healthy deployments
)
try:
_response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey!"}],
priority=0, # 👈 LOWER IS BETTER
)
except Exception as e:
print("didn't make request")
LiteLLM Proxy​
Um Anfragen auf dem LiteLLM Proxy zu priorisieren, fügen Sie priority zur Anfrage hinzu.
- curl
- OpenAI SDK
curl -X POST 'https://:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "gpt-3.5-turbo-fake-model",
"messages": [
{
"role": "user",
"content": "what is the meaning of the universe? 1234"
}],
"priority": 0 👈 SET VALUE HERE
}'
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
],
extra_body={
"priority": 0 👈 SET VALUE HERE
}
)
print(response)
Fortgeschritten - Redis-Caching​
Verwenden Sie Redis-Caching, um die Anforderungs-Priorisierung über mehrere Instanzen von LiteLLM hinweg durchzuführen.
SDK​
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "gpt-3.5-turbo",
"mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
"rpm": 1,
},
},
],
### REDIS PARAMS ###
redis_host=os.environ["REDIS_HOST"],
redis_password=os.environ["REDIS_PASSWORD"],
redis_port=os.environ["REDIS_PORT"],
)
try:
_response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey!"}],
priority=0, # 👈 LOWER IS BETTER
)
except Exception as e:
print("didn't make request")
PROXY​
model_list:
- model_name: gpt-3.5-turbo-fake-model
litellm_params:
model: gpt-3.5-turbo
mock_response: "hello world!"
api_key: my-good-key
litellm_settings:
request_timeout: 600 # 👈 Will keep retrying until timeout occurs
router_settings:
redis_host; os.environ/REDIS_HOST
redis_password: os.environ/REDIS_PASSWORD
redis_port: os.environ/REDIS_PORT
$ litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000s
curl -X POST 'https://:4000/queue/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "gpt-3.5-turbo-fake-model",
"messages": [
{
"role": "user",
"content": "what is the meaning of the universe? 1234"
}],
"priority": 0 👈 SET VALUE HERE
}'