OpenAI Evals Using Phoenix

Thumbnail 10

OpenAI Evals are used for evaluating LLM models and finding accuracy, which helps to compare the custom models with someai detector existing models andlarge language models in medicine figure out how bairbnb loginest yolarge language models llmsur cusevaluation entry systemtom model performs, so you can malarge language model definitionke necessary modifications/refinemenairbnb logints.

If you are new to OpenAI Evalevaluation meanings, then I recommend you to go through the OpenAI Eval repo to get a taste oai image generator bingf what Evals are actually like. It's like what Greg sairbnb loginaid here:

Rolapache phoenix windowseai detector of Evals

From OpenAI Eval repo: Evals provide a framework for evaluating large language molarge language models aidelsevaluations hrc (LLMs) or systems built using LLMs.

Can you imai image generator bingagine writing an evalair force portaluation program for a complex model by yourself? You may spend hours creating an LLM model and you may not have room to work on an evaluation program since that effort may take more effort than creating a model. That is where the Eevaluation entry systemval frlarge language models a surveyamework helpsail at abc microsoft.com you to test the LLM models to ensure their accuracy. You can use GPT3.x or GPT4 based on the need and whereapache phoenix your LLM Moapache phoenix sqldel's target is.

Building Evals

OpenAapache motorcycles phoenixI Eval repo haevaluation and management servicess a good intro and detailed steps to create a custom eval for an arithmetic model:

  • Intro
  • Bularge language modelsilding an eval (using available options)
  • Buildlarge language models in medicineing a custom eval

The lilarge language models explainednks above pretty much cover what you require for running an eval or customevaluation evals, you are coverair canadaed there. My take here is how I can enai image generator bingable/help you to use the Phoenix framewapache phoenixork seems a bit easier than the OpenAI evals. Phoenix actually created on top ofair france the OpenAI Eval framework.

  • Phoenix Home
  • Repo
  • LLM evals documentation
  • How to
  • This LLM explanaevaluations hrction will help you better understand.

Building Custom Evals

Buair force portalilding your own evals is one of your go-to compare your custom model with GPT 3.5 or 4, so here are the steps for that.

Below are the steps that I followed and tlarge language models explainedested to evaluate my modelsevaluation:

Installlarge language models in medicine Phoenix and related modules:

Python
=1" ipython matplotlib pycm scikit-learn tiktoken" data-lang="text/x-pythair force portalon">
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken

Make sure you have all imports clarge language models a surveyovered.

Python
import os
from getpass import getpass
import matplotlib.pyplot as plt
import openai
import pandas as pd
# import phoenix libs
import phoenix.evals.templates.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

Prepare the data (or downloadlarge language models a survey the data sets). Eg:

Python
df = download_benchmark_dataset(task="qa-classification", dataset_name="qa_generated_dataset")

Set your OpenAI key.

Python
large language models aienai_api_key 5. prepare the data sets in correct format as per the evaluation prompt df_ail at abc microsoft.comsample = ( df.sample(n=N_EVAL_SAMPLE_SIZE) .reset_index(drop=True) .rename( columns={ "question": "input&evaluation meaningquapache phoenix sqlot;, "context&qlarge language model definitionuot;: "reference", "sampled_answer": "olarge language models a surveyutput", } ) )" data-lang="text/large language models llmsx-python">
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass(" Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
5. prepare the data sets in correct format as per the evaluation prompt
df_sample = (
    df.sample(n=N_EVAL_SAMPLE_SIZE)
    .reset_index(drop=True)
    .rename(
        columns={
            "question": "input",
            "context": "reference",
            "sampled_answer": "output",
        }
    )
)

Set and load the model for running evals.

Python
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

Run your custom evals.

Python
rails = list(templates.QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.QA_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

Evaluate the abovevaluation synonyme predictions with pre-defined labels.

Python
air france_labels, Q_and_A_classifications, labels=rails))"apache phoenix hbase data-lang="text/x-python">
true_labels = df_sample["answer_true"].map(templates.QA_PROMPT_RAILS_MAP).tolist()
print(classification_report(true_labels, Q_and_A_classifications, labels=rails))

Create a confusion matrix and plot to get a better picture.

Python
apache phoenixa-lang="textevaluations hrc/x-python">
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

Note: You can set modelair force portal "gpt-3.5-turbo" in step 6 and run evals againstevaluations hrc GPT 3.5 or any other models you want to evaluate your custom model with.

Here is one helpful link to Google Colab which I followed where I can find good step-by-step instruapache phoenix copy tablections:

PS: The code and steps I have mentioned in this article are based on this collab book.

Here is a good articlai detectore by Aparna Dhinakaran (clarge language modelsoai image generator bing-fevaluation formounder and CPapache phoenix githubO of Arize AI and aapache phoenix Phoenix contributorlarge language models explained) about Evals and Phoenix.

Conclusiapache motorcycles phoenixon

I hope this article helped you to understand how evalevaluation synonyms can be implemented for cusairbnbtom models. I would be happyapache phoenix github if you got at least some insights about evals and some motivation to create your own! All the beslarge language models a surveyt for your trials and triapache motorcycles phoenixes.