Verwenden von Audio-Modellen

So senden/empfangen Sie Audio über einen /chat/completions Endpunkt

Audioausgabe eines Modells

Beispiel für die Erstellung einer menschenähnlichen Audioantwort auf eine Eingabeaufforderung

LiteLLM Python SDK
LiteLLM Proxy Server

import os 
import base64
from litellm import completion

os.environ["OPENAI_API_KEY"] = "your-api-key"

# openai call
completion = await litellm.acompletion(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[{"role": "user", "content": "Is a golden retriever a good family dog?"}],
)

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Definieren Sie ein Audiomodell in config.yaml

model_list:
  - model_name: gpt-4o-audio-preview # OpenAI gpt-4o-audio-preview
    litellm_params:
      model: openai/gpt-4o-audio-preview
      api_key: os.environ/OPENAI_API_KEY 

Führen Sie den Proxy-Server aus

litellm --config config.yaml

Testen Sie es mit dem OpenAI Python SDK

import base64
from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY", # sk-1234
    base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
    f.write(wav_bytes)

Audioeingabe eines Modells

LiteLLM Python SDK
LiteLLM Proxy Server

import base64
import requests

url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode("utf-8")

completion = litellm.completion(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this recording?"},
                {
                    "type": "input_audio",
                    "input_audio": {"data": encoded_string, "format": "wav"},
                },
            ],
        },
    ],
)

print(completion.choices[0].message)

Definieren Sie ein Audiomodell in config.yaml

model_list:
  - model_name: gpt-4o-audio-preview # OpenAI gpt-4o-audio-preview
    litellm_params:
      model: openai/gpt-4o-audio-preview
      api_key: os.environ/OPENAI_API_KEY 

Führen Sie den Proxy-Server aus

litellm --config config.yaml

Testen Sie es mit dem OpenAI Python SDK

import base64
from openai import OpenAI

client = OpenAI(
    api_key="LITELLM_PROXY_KEY", # sk-1234
    base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)


# Fetch the audio file and convert it to a base64 encoded string
url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message)

Prüfen, ob ein Modell `audio_input` und `audio_output` unterstützt

LiteLLM Python SDK
LiteLLM Proxy Server

Verwenden Sie litellm.supports_audio_output(model="") -> gibt True zurück, wenn das Modell Audioausgabe generieren kann

Verwenden Sie litellm.supports_audio_input(model="") -> gibt True zurück, wenn das Modell Audioeingabe akzeptieren kann

assert litellm.supports_audio_output(model="gpt-4o-audio-preview") == True
assert litellm.supports_audio_input(model="gpt-4o-audio-preview") == True

assert litellm.supports_audio_output(model="gpt-3.5-turbo") == False
assert litellm.supports_audio_input(model="gpt-3.5-turbo") == False

Definieren Sie Visionsmodelle in config.yaml

model_list:
  - model_name: gpt-4o-audio-preview # OpenAI gpt-4o-audio-preview
    litellm_params:
      model: openai/gpt-4o-audio-preview
      api_key: os.environ/OPENAI_API_KEY
  - model_name: llava-hf          # Custom OpenAI compatible model
    litellm_params:
      model: openai/llava-hf/llava-v1.6-vicuna-7b-hf
      api_base: http://:8000
      api_key: fake-key
    model_info:
      supports_audio_output: True        # set supports_audio_output to True so /model/info returns this attribute as True
      supports_audio_input: True         # set supports_audio_input to True so /model/info returns this attribute as True

Führen Sie den Proxy-Server aus

litellm --config config.yaml

Rufen Sie /model_group/info auf, um zu prüfen, ob Ihr Modell vision unterstützt

curl -X 'GET' \
  'https://:4000/model_group/info' \
  -H 'accept: application/json' \
  -H 'x-api-key: sk-1234'

Erwartete Antwort

{
  "data": [
    {
      "model_group": "gpt-4o-audio-preview",
      "providers": ["openai"],
      "max_input_tokens": 128000,
      "max_output_tokens": 16384,
      "mode": "chat",
      "supports_audio_output": true, # 👈 supports_audio_output is true
      "supports_audio_input": true, # 👈 supports_audio_input is true
    },
    {
      "model_group": "llava-hf",
      "providers": ["openai"],
      "max_input_tokens": null,
      "max_output_tokens": null,
      "mode": null,
      "supports_audio_output": true, # 👈 supports_audio_output is true
      "supports_audio_input": true, # 👈 supports_audio_input is true
    }
  ]
}

Antwortformat mit Audio

Unten sehen Sie eine Beispiel-JSON-Datenstruktur für eine message, die Sie von einem /chat/completions Endpunkt erhalten könnten, wenn Sie Audioeingaben an ein Modell senden.

{
  "index": 0,
  "message": {
    "role": "assistant",
    "content": null,
    "refusal": null,
    "audio": {
      "id": "audio_abc123",
      "expires_at": 1729018505,
      "data": "<bytes omitted>",
      "transcript": "Yes, golden retrievers are known to be ..."
    }
  },
  "finish_reason": "stop"
}

audio Wenn die Audioausgabe-Modalität angefordert wird, enthält dieses Objekt Daten zur Audioantwort des Modells
- audio.id Eindeutiger Bezeichner für die Audioantwort
- audio.expires_at Der Unix-Zeitstempel (in Sekunden) für den Zeitpunkt, zu dem diese Audioantwort auf dem Server für die Verwendung in Konversationen mit mehreren Runden nicht mehr zugänglich ist.
- audio.data Base64-codierte Audio-Bytes, die vom Modell generiert wurden, im im Request angegebenen Format.
- audio.transcript Transkription der vom Modell generierten Audiodaten.

Verwenden von Audio-Modellen

Audioausgabe eines Modells​

Audioeingabe eines Modells​

Prüfen, ob ein Modell audio_input und audio_output unterstützt​

Antwortformat mit Audio​

Audioausgabe eines Modells

Audioeingabe eines Modells

Prüfen, ob ein Modell `audio_input` und `audio_output` unterstützt

Antwortformat mit Audio