Bringing Evals into the Langchain::Assistant #853

andreibondarev · 2024-10-23T02:55:44Z

andreibondarev
Oct 23, 2024
Maintainer

It's time to introduce a light-weight way to run evals on the Langchain::Assistant execution output. Regardless of what kind of metrics are being evaluated I'd like to figure out a good DSL for how it's integrated into the Langchain::Assistant.

Agent interactions are generally:

Single-turn (1 user message is given to an LLM and the LLM generates output)
Multi-turn (A sequential exchange between user messages and assistant messages with function calling).

Given a collection of AI agent inputs and corresponding ideal outputs -- we should be able to run our AI agent through this dataset and compare.

A few questions to consider:

What should the DSL for integrating Evals look like?
Should you be able to use the datasets available on HuggingFace?
What does the SOTA for evaluating AI agents currently look like?

andreibondarev · 2024-10-23T15:27:18Z

andreibondarev
Oct 23, 2024
Maintainer Author

@bborn I created this discussion thread to talk about how we could integrate the evals. Maybe we could flesh things out here before implementing?

1 reply

bborn Oct 23, 2024

Sounds good. As a starting point, I'd say that I think it would be cool to have the eval close to the actual execution of the LLM (and not just on Agents, but also on bare completions/chat completions). So in my mind that looks like:

llm.complete(params, eval_params)

Another way to think about it would be more like a test case for a prompt, so maybe something along the lines of a Mini-test or Spec DSL for asserting that your prompt/agent produces completions that meet certain requirements.

andreibondarev · 2024-10-24T01:18:28Z

andreibondarev
Oct 24, 2024
Maintainer Author

@bborn Take a glance: #855 (comment)

1 reply

bborn Oct 24, 2024

Hmm - I'm not even sure this needs to be called an "Eval" - it's just a convenience class around Langchain::Utils::CosineSimilarity. In other words: getting the embedding for two strings and then getting the cosine score is pretty trivial - I think exposing it as something that lets you monitor the quality of your prompts is more interesting.

So I think this is good, but for it to be more helpful we need to think about how it actually gets applied to live prompts and agents, and how Langchain will help create/manage/eval datasets.

Thoughts:

Datasets get complicated pretty easily. You can have bare text inputs, message arrays, multimodal (image_url or base64 encoded), etc. Outputs can be string, messages, tool_calls, and can also include retrieved context (for RAGAS).
A big part of evaluating a dataset is just running the inputs through your LLM or Agent to generate the outputs. This is basically batch processing, but with rate limiting and error handling, gets tricky fast (especially if you consider running a very large dataset like some of the LLM benchmarks)
Evals may need to allow defining success/failure thresholds (e.g. "a cosine score < X is a fail")

I think a useful way of approaching evals would be to tie them directly to the prompt usage. That way you can build LLM functionality into your product alongside a system that can help evaluate the quality of the outputs. Example:

def run_weather_bot(user_input)
  openai = Langchain::LLM::OpenAI.new(api_key: ENV["OPENAI_API_KEY"])

  eval_runner = AsyncEvalRunner.new(
      evals: [Langchain::Evals::LLM::LLM.new(llm: llm, criteria: "The response should be polite and use the weather tool when possible")], 
      notifications: [Mailer.failed_eval, SlackNotifier.eval_channel]
  )

  assistant = Langchain::Assistant.new(
    llm: openai,
    instructions: "You are a Meteorologist Assistant that is able to pull the weather for any location. ",
    tools: [
      Langchain::Tool::Weather.new(api_key: ENV["OPEN_WEATHER_API_KEY"])
    ],
    evaluate_with: eval_runner
  )

  assistant_response = assistant.add_message_and_run content: user_input
end

run_weather_both("What's the weather in NYC?")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bringing Evals into the Langchain::Assistant #853

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Bringing Evals into the Langchain::Assistant #853

Uh oh!

andreibondarev Oct 23, 2024 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

andreibondarev Oct 23, 2024 Maintainer Author

Uh oh!

bborn Oct 23, 2024

Uh oh!

andreibondarev Oct 24, 2024 Maintainer Author

Uh oh!

bborn Oct 24, 2024

andreibondarev
Oct 23, 2024
Maintainer

Replies: 2 comments 2 replies

andreibondarev
Oct 23, 2024
Maintainer Author

andreibondarev
Oct 24, 2024
Maintainer Author