[BETA]Anforderungs-Priorisierung

Info

Beta-Funktion. Nur zum Testen verwenden.

Priorisieren Sie LLM-API-Anfragen bei hohem Datenverkehr.

Anfrage zur Prioritätswarteschlange hinzufügen
Warteschlange abfragen, um zu prüfen, ob eine Anfrage gestellt werden kann. Gibt 'True' zurück
- wenn es gesunde Bereitstellungen gibt
- ODER wenn die Anfrage an der Spitze der Warteschlange steht
Priorität - Je niedriger die Zahl, desto höher die Priorität
- z.B. priority=0 > priority=2000

Unterstützte Router-Endpunkte

acompletion (/v1/chat/completions auf dem Proxy)
atext_completion (/v1/completions auf dem Proxy)

Schnellstart

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5-turbo",
            "litellm_params": {
                "model": "gpt-3.5-turbo",
                "mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
                "rpm": 1,
            },
        },
    ],
    timeout=2, # timeout request if takes > 2s
    routing_strategy="usage-based-routing-v2",
    polling_interval=0.03 # poll queue every 3ms if no healthy deployments
)

try:
    _response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hey!"}],
        priority=0, # 👈 LOWER IS BETTER
    )
except Exception as e:
    print("didn't make request")

LiteLLM Proxy

Um Anfragen auf dem LiteLLM Proxy zu priorisieren, fügen Sie priority zur Anfrage hinzu.

curl
OpenAI SDK

curl -X POST 'https://:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
    "model": "gpt-3.5-turbo-fake-model",
    "messages": [
        {
        "role": "user",
        "content": "what is the meaning of the universe? 1234"
        }],
    "priority": 0 👈 SET VALUE HERE
}'

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": "this is a test request, write a short poem"
        }
    ],
    extra_body={ 
        "priority": 0 👈 SET VALUE HERE
    }
)

print(response)

Fortgeschritten - Redis-Caching

Verwenden Sie Redis-Caching, um die Anforderungs-Priorisierung über mehrere Instanzen von LiteLLM hinweg durchzuführen.

SDK

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5-turbo",
            "litellm_params": {
                "model": "gpt-3.5-turbo",
                "mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
                "rpm": 1,
            },
        },
    ],
    ### REDIS PARAMS ###
    redis_host=os.environ["REDIS_HOST"], 
    redis_password=os.environ["REDIS_PASSWORD"], 
    redis_port=os.environ["REDIS_PORT"], 
)

try:
    _response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hey!"}],
        priority=0, # 👈 LOWER IS BETTER
    )
except Exception as e:
    print("didn't make request")

PROXY

model_list:
    - model_name: gpt-3.5-turbo-fake-model
      litellm_params:
        model: gpt-3.5-turbo
        mock_response: "hello world!" 
        api_key: my-good-key

litellm_settings:
    request_timeout: 600 # 👈 Will keep retrying until timeout occurs

router_settings:
    redis_host; os.environ/REDIS_HOST
    redis_password: os.environ/REDIS_PASSWORD
    redis_port: os.environ/REDIS_PORT

$ litellm --config /path/to/config.yaml 

# RUNNING on http://0.0.0.0:4000s

curl -X POST 'https://:4000/queue/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
    "model": "gpt-3.5-turbo-fake-model",
    "messages": [
        {
        "role": "user",
        "content": "what is the meaning of the universe? 1234"
        }],
    "priority": 0 👈 SET VALUE HERE
}'

[BETA]Anforderungs-Priorisierung

Schnellstart​

LiteLLM Proxy​

Fortgeschritten - Redis-Caching​

SDK​

PROXY​

Schnellstart

LiteLLM Proxy

Fortgeschritten - Redis-Caching

SDK

PROXY