VLLM

LiteLLM unterstützt alle Modelle auf VLLM.

Eigenschaft	Details
Beschreibung	vLLM ist eine schnelle und einfach zu bedienende Bibliothek für LLM-Inferenz und Serving. Docs
Provider-Routing in LiteLLM	`hosted_vllm/` (für OpenAI-kompatiblen Server), `vllm/` (für die Verwendung des vLLM SDK)
Provider-Dokumentation	vLLM ↗
Unterstützte Endpunkte	`/chat/completions`, `/embeddings`, `/completions`

Schnellstart

Verwendung - litellm.completion (Aufruf eines OpenAI-kompatiblen Endpunkts)

vLLM bietet OpenAI-kompatible Endpunkte - so rufen Sie sie mit LiteLLM auf

Um LiteLLM zum Aufrufen eines gehosteten vLLM-Servers zu verwenden, fügen Sie Ihrer Completion-Anfrage Folgendes hinzu:

model="hosted_vllm/<Ihr-vllm-modellname>"
api_base = "Ihr gehosteter-vllm-server"

import litellm 

response = litellm.completion(
            model="hosted_vllm/facebook/opt-125m", # pass the vllm model name
            messages=messages,
            api_base="https://hosted-vllm-api.co",
            temperature=0.2,
            max_tokens=80)

print(response)

Verwendung - LiteLLM Proxy Server (Aufruf eines OpenAI-kompatiblen Endpunkts)

So rufen Sie einen OpenAI-kompatiblen Endpunkt mit dem LiteLLM Proxy Server auf

Konfigurieren Sie die config.yaml

model_list:
  - model_name: my-model
    litellm_params:
      model: hosted_vllm/facebook/opt-125m  # add hosted_vllm/ prefix to route as OpenAI provider
      api_base: https://hosted-vllm-api.co      # add api base for OpenAI compatible provider

Starten Sie den Proxy
```
$ litellm --config /path/to/config.yaml
```

Anfrage an LiteLLM Proxy Server senden

OpenAI Python v1.0.0+
curl

import openai
client = openai.OpenAI(
    api_key="sk-1234",             # pass litellm proxy key, if you're using virtual keys
    base_url="http://0.0.0.0:4000" # litellm-proxy-base url
)

response = client.chat.completions.create(
    model="my-model",
    messages = [
        {
            "role": "user",
            "content": "what llm are you"
        }
    ],
)

print(response)

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Authorization: Bearer sk-1234' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "my-model",
    "messages": [
        {
        "role": "user",
        "content": "what llm are you"
        }
    ],
}'

Embeddings

SDK
PROXY

from litellm import embedding   
import os

os.environ["HOSTED_VLLM_API_BASE"] = "https://:8000"

embedding = embedding(model="hosted_vllm/facebook/opt-125m", input=["Hello world"])

print(embedding)

Konfigurieren Sie config.yaml

model_list:
    - model_name: my-model
      litellm_params:
        model: hosted_vllm/facebook/opt-125m  # add hosted_vllm/ prefix to route as OpenAI provider
        api_base: https://hosted-vllm-api.co      # add api base for OpenAI compatible provider

Starten Sie den Proxy

$ litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

Testen Sie es!

curl -L -X POST 'http://0.0.0.0:4000/embeddings' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{"input": ["hello world"], "model": "my-model"}'

Siehe Beispiele für OpenAI SDK/Langchain/etc.

Video-URL an VLLM senden

Beispielimplementierung von VLLM hier

(Vereinheitlichte) Dateinachricht
(VLLM-spezifische) Videonachricht

Verwenden Sie dies, um eine Video-URL an VLLM + Gemini im gleichen Format zu senden, unter Verwendung des OpenAI-Dateien-Nachrichtentyps.

Es gibt zwei Möglichkeiten, eine Video-URL an VLLM zu senden

Die Video-URL direkt übergeben

{"type": "file", "file": {"file_id": video_url}},

Die Videodaten als Base64 übergeben

{"type": "file", "file": {"file_data": f"data:video/mp4;base64,{video_data_base64}"}}

SDK
PROXY

from litellm import completion

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Summarize the following video"
            },
            {
                "type": "file",
                "file": {
                    "file_id": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
                }
            }
        ]
    }
]

# call vllm 
os.environ["HOSTED_VLLM_API_BASE"] = "https://hosted-vllm-api.co"
os.environ["HOSTED_VLLM_API_KEY"] = "" # [optional], if your VLLM server requires an API key
response = completion(
    model="hosted_vllm/qwen", # pass the vllm model name
    messages=messages,
)

# call gemini 
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"
response = completion(
    model="gemini/gemini-1.5-flash", # pass the gemini model name
    messages=messages,
)

print(response)

Konfigurieren Sie config.yaml

model_list:
    - model_name: my-model
      litellm_params:
        model: hosted_vllm/qwen  # add hosted_vllm/ prefix to route as OpenAI provider
        api_base: https://hosted-vllm-api.co      # add api base for OpenAI compatible provider
    - model_name: my-gemini-model
      litellm_params:
        model: gemini/gemini-1.5-flash  # add gemini/ prefix to route as Google AI Studio provider
        api_key: os.environ/GEMINI_API_KEY

Starten Sie den Proxy

$ litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

Testen Sie es!

curl -X POST http://0.0.0.0:4000/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
    "model": "my-model",
    "messages": [
        {"role": "user", "content": 
            [
                {"type": "text", "text": "Summarize the following video"},
                {"type": "file", "file": {"file_id": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}}
            ]
        }
    ]
}'

Verwenden Sie dies, um eine Video-URL an VLLM in seinem nativen Nachrichtenformat (video_url) zu senden.

Es gibt zwei Möglichkeiten, eine Video-URL an VLLM zu senden

Die Video-URL direkt übergeben

{"type": "video_url", "video_url": {"url": video_url}},

Die Videodaten als Base64 übergeben

{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_data_base64}"}}

SDK
PROXY

from litellm import completion

response = completion(
            model="hosted_vllm/qwen", # pass the vllm model name
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Summarize the following video"
                        },
                        {
                            "type": "video_url",
                            "video_url": {
                                "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
                            }
                        }
                    ]
                }
            ],
            api_base="https://hosted-vllm-api.co")

print(response)

Konfigurieren Sie config.yaml

model_list:
    - model_name: my-model
      litellm_params:
        model: hosted_vllm/qwen  # add hosted_vllm/ prefix to route as OpenAI provider
        api_base: https://hosted-vllm-api.co      # add api base for OpenAI compatible provider

Starten Sie den Proxy

$ litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

Testen Sie es!

curl -X POST http://0.0.0.0:4000/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
    "model": "my-model",
    "messages": [
        {"role": "user", "content": 
            [
                {"type": "text", "text": "Summarize the following video"},
                {"type": "video_url", "video_url": {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}}
            ]
        }
    ]
}'

(Veraltet) für das `vllm pip package`

Verwendung - `litellm.completion`

pip install litellm vllm

import litellm 

response = litellm.completion(
            model="vllm/facebook/opt-125m", # add a vllm prefix so litellm knows the custom_llm_provider==vllm
            messages=messages,
            temperature=0.2,
            max_tokens=80)

print(response)

Batch Completion

from litellm import batch_completion

model_name = "facebook/opt-125m"
provider = "vllm"
messages = [[{"role": "user", "content": "Hey, how's it going"}] for _ in range(5)]

response_list = batch_completion(
            model=model_name, 
            custom_llm_provider=provider, # can easily switch to huggingface, replicate, together ai, sagemaker, etc.
            messages=messages,
            temperature=0.2,
            max_tokens=80,
        )
print(response_list)

Prompt-Vorlagen

Für Modelle mit speziellen Prompt-Vorlagen (z.B. Llama2) formatieren wir den Prompt, um Vorlage zu passen.

Was, wenn wir ein Modell, das Sie benötigen, nicht unterstützen? Sie können auch Ihre eigene benutzerdefinierte Prompt-Formatierung angeben, falls wir Ihr Modell noch nicht abgedeckt haben.

Bedeutet dies, dass Sie für alle Modelle einen Prompt angeben müssen? Nein. Standardmäßig verketten wir den Inhalt Ihrer Nachricht, um einen Prompt zu erstellen (erwartetes Format für Bloom, T-5, Llama-2 Basismodelle usw.).

Standard-Prompt-Vorlage

def default_pt(messages):
    return " ".join(message["content"] for message in messages)

Code für die Funktionsweise von Prompt-Vorlagen in LiteLLM

Modelle, für die wir bereits Prompt-Vorlagen haben

Modellname	Funktioniert für Modelle	Funktionsaufruf
meta-llama/Llama-2-7b-chat	Alle meta-llama llama2 Chat-Modelle	`completion(model='vllm/meta-llama/Llama-2-7b', messages=messages, api_base="ihr_api_endpoint")`
tiiuae/falcon-7b-instruct	Alle Falcon-Instruct-Modelle	`completion(model='vllm/tiiuae/falcon-7b-instruct', messages=messages, api_base="ihr_api_endpoint")`
mosaicml/mpt-7b-chat	Alle mpt Chat-Modelle	`completion(model='vllm/mosaicml/mpt-7b-chat', messages=messages, api_base="ihr_api_endpoint")`
codellama/CodeLlama-34b-Instruct-hf	Alle CodeLlama-Instruct-Modelle	`completion(model='vllm/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="ihr_api_endpoint")`
WizardLM/WizardCoder-Python-34B-V1.0	Alle WizardCoder-Modelle	`completion(model='vllm/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="ihr_api_endpoint")`
Phind/Phind-CodeLlama-34B-v2	Alle Phind-CodeLlama-Modelle	`completion(model='vllm/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="ihr_api_endpoint")`

Benutzerdefinierte Prompt-Vorlagen

# Create your own custom prompt template works 
litellm.register_prompt_template(
    model="togethercomputer/LLaMA-2-7B-32K",
    roles={
            "system": {
                "pre_message": "[INST] <<SYS>>\n",
                "post_message": "\n<</SYS>>\n [/INST]\n"
            },
            "user": { 
                "pre_message": "[INST] ",
                "post_message": " [/INST]\n"
            }, 
            "assistant": {
                "pre_message": "\n",
                "post_message": "\n",
            }
        } # tell LiteLLM how you want to map the openai messages to this model
)

def test_vllm_custom_model():
    model = "vllm/togethercomputer/LLaMA-2-7B-32K"
    response = completion(model=model, messages=messages)
    print(response['choices'][0]['message']['content'])
    return response

test_vllm_custom_model()

Implementierungscode

VLLM

Schnellstart

Verwendung - litellm.completion (Aufruf eines OpenAI-kompatiblen Endpunkts)​

Verwendung - LiteLLM Proxy Server (Aufruf eines OpenAI-kompatiblen Endpunkts)​

Embeddings​

Video-URL an VLLM senden​

(Veraltet) für das vllm pip package​

Verwendung - litellm.completion​

Batch Completion​

Prompt-Vorlagen​

Modelle, für die wir bereits Prompt-Vorlagen haben​

Benutzerdefinierte Prompt-Vorlagen​