LLMs bewerten - MLflow Evals, Auto Eval

LiteLLM mit MLflow verwenden

MLflow bietet eine API namens mlflow.evaluate(), die Ihnen bei der Bewertung Ihrer LLMs hilft: https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

Voraussetzungen

pip install litellm

pip install mlflow

Schritt 1: LiteLLM Proxy über die CLI starten

LiteLLM ermöglicht es Ihnen, einen OpenAI-kompatiblen Server für alle unterstützten LLMs zu erstellen. Weitere Informationen zum LiteLLM-Proxy finden Sie hier

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

So erstellen Sie den Proxy für andere unterstützte LLMs:

$ export AWS_ACCESS_KEY_ID=""
$ export AWS_REGION_NAME="" # e.g. us-west-2
$ export AWS_SECRET_ACCESS_KEY=""

$ litellm --model bedrock/anthropic.claude-v2

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

$ litellm --model huggingface/<your model name> --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

$ export ANTHROPIC_API_KEY=my-api-key

$ litellm --model claude-instant-1

Angenommen, Sie führen vllm lokal aus

$ litellm --model vllm/facebook/opt-125m

$ litellm --model openai/<model_name> --api_base <your-api-base>

$ export TOGETHERAI_API_KEY=my-api-key

$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k

$ export REPLICATE_API_KEY=my-api-key

$ litellm \
  --model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3

$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf

$ export PALM_API_KEY=my-palm-key

$ litellm --model palm/chat-bison

$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base

$ litellm --model azure/my-deployment-name

$ export AI21_API_KEY=my-api-key

$ litellm --model j2-light

$ export COHERE_API_KEY=my-api-key

$ litellm --model command-nightly

Schritt 2: MLflow ausführen

Vor der Ausführung der Evaluierung setzen wir openai.api_base auf den LiteLLM-Proxy aus Schritt 1.

openai.api_base = "http://0.0.0.0:8000"

import openai
import pandas as pd
openai.api_key = "anything"             # this can be anything, we set the key on the proxy
openai.api_base = "http://0.0.0.0:8000" # set api base to the proxy from step 1


import mlflow
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is the largest country",
            "What is the weather in sf?",
        ],
        "ground_truth": [
            "India is a large country",
            "It's cold in SF today"
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

MLflow-Ausgabe

{'toxicity/v1/mean': 0.00014476531214313582, 'toxicity/v1/variance': 2.5759661361262862e-12, 'toxicity/v1/p90': 0.00014604929747292773, 'toxicity/v1/ratio': 0.0, 'exact_match/v1': 0.0}
Downloading artifacts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1890.18it/s]
See evaluation table below:
                        inputs              ground_truth                                            outputs  token_count  toxicity/v1/score
0  What is the largest country  India is a large country   Russia is the largest country in the world in...           14           0.000146
1   What is the weather in sf?     It's cold in SF today   I'm sorry, I cannot provide the current weath...           36           0.000143

LiteLLM mit AutoEval verwenden

AutoEvals ist ein Werkzeug zur schnellen und einfachen Bewertung von KI-Modellausgaben unter Berücksichtigung bewährter Verfahren. https://github.com/braintrustdata/autoevals

Voraussetzungen

pip install litellm

pip install autoevals

Schnellstart

In diesem Code-Beispiel verwenden wir den Evaluator Factuality() aus autoevals.llm, um zu testen, ob eine Ausgabe faktisch korrekt ist, verglichen mit einem ursprünglichen (erwarteten) Wert.

Autoevals verwendet standardmäßig gpt-3.5-turbo / gpt-4-turbo zur Bewertung von Antworten.

Sehen Sie sich die AutoEvals-Dokumentation zu den unterstützten Evaluatoren an – Übersetzung, Zusammenfassung, Sicherheits-Evaluatoren usw.

# auto evals imports 
from autoevals.llm import *
###################
import litellm

# litellm completion call
question = "which country has the highest population"
response = litellm.completion(
    model = "gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": question
        }
    ],
)
print(response)
# use the auto eval Factuality() evaluator
evaluator = Factuality()
result = evaluator(
    output=response.choices[0]["message"]["content"],       # response from litellm.completion()
    expected="India",                                       # expected output
    input=question                                          # question passed to litellm.completion
)

print(result)

Ausgabe der Bewertung – von AutoEvals

Score(
    name='Factuality', 
    score=0, 
    metadata=
        {'rationale': "The expert answer is 'India'.\nThe submitted answer is 'As of 2021, China has the highest population in the world with an estimated 1.4 billion people.'\nThe submitted answer mentions China as the country with the highest population, while the expert answer mentions India.\nThere is a disagreement between the submitted answer and the expert answer.", 
        'choice': 'D'
        }, 
    error=None
)

LLMs bewerten - MLflow Evals, Auto Eval

LiteLLM mit MLflow verwenden​

Voraussetzungen​

Schritt 1: LiteLLM Proxy über die CLI starten​

Schritt 2: MLflow ausführen​

MLflow-Ausgabe​

LiteLLM mit AutoEval verwenden​

Voraussetzungen​

Schnellstart​

Ausgabe der Bewertung – von AutoEvals​

LiteLLM mit MLflow verwenden

Voraussetzungen

Schritt 1: LiteLLM Proxy über die CLI starten

Schritt 2: MLflow ausführen

MLflow-Ausgabe

LiteLLM mit AutoEval verwenden

Voraussetzungen

Schnellstart

Ausgabe der Bewertung – von AutoEvals