MEAI.Eval with Langfuse #6860

verdie-g · 2025-09-26T19:37:47Z

verdie-g
Sep 26, 2025

I would like to share my experience using Microsoft.Extensions.AI.Evaluation with Langfuse and discuss if there are improvements to be made.

Langfuse Concepts

Dataset

A dataset is a collection of inputs and expected outputs and is used to test your application.

Here is a dataset named "Test" that has 3 items. No outputs are defined for now because I'm going to use the generic MEAIE built-in evaluators.

Dataset run

A dataset is run against some user-defined code and scores can be pushed on traces associated with a dataset run item. An entire dataset run can also be scored.

Here is dataset run named "run-name-5" with 3 data set run items. One of the item has a score named "score-name" with a value 123.456 and a comment.

MEAIE Integration

Terminology

Here is the mapping from my understanding

Langfuse	MEAIE
Dataset	Scenario
Dataset run	Execution
Dataset run item	Iteration
Score	ScenarioRunResult

I first thought I could have one execution per dataset run item but I understand that ScenarioRun.EvaluateAsync should be called only once per execution X iteration. So I'm associating an iteration to a dataset run item which is a little weird.

Download dataset

MEAIE doesn't deal with that. In a NUnit TestCaseSource, I download all the dataset items

private static async IAsyncEnumerable<DatasetItem> TestCases()
{
    var langfuseClient = TestSetup.Services.GetRequiredService<LangfuseClient>();

    int totalPages = int.MaxValue;
    for (int currentPage = 1; currentPage < totalPages;)
    {
        var items = await langfuseClient.Api.Public.DatasetItems.GetAsync(c =>
        {
            c.QueryParameters.DatasetName = DatasetName;
            c.QueryParameters.Page = currentPage;
            c.QueryParameters.Limit = 20;
        });

        currentPage = items!.Meta!.Page!.Value;
        totalPages = items.Meta!.TotalPages!.Value;

        foreach (var item in items.Data!)
        {
            yield return item;
        }
    }
}

Run evaluation

From this dataset, I run this evaluation

[TestCaseSource(nameof(TestCases))]
public async Task Scenario(DatasetItem item)
{
    await using var scenarioRun = await ReportingConfiguration.CreateScenarioRunAsync(
        scenarioName: DatasetName,
        iterationName: item.Id!,
        additionalTags: []);
    ChatMessage[] messages =
    [
        new(ChatRole.System,
            """
            You're an AI assistant that can answer questions related to astronomy.
            Keep your responses concise and under 100 words.
            Use the imperial measurement system for all measurements in your response.
            """),
        // https://github.com/orgs/langfuse/discussions/9355
        new(ChatRole.User, ((UntypedString)item.Input!).GetValue()),
    ];

    ChatResponse response;
    ActivityTraceId traceId;
    using (var span = ActivitySource.StartActivity()!)
    {
        span.SetTag("gen_ai.input.messages", ((UntypedString)item.Input!).GetValue());
        response = await scenarioRun.ChatConfiguration!.ChatClient.GetResponseAsync(
            messages,
            new ChatOptions { Temperature = 0 });
        traceId = span.TraceId;
        span.SetTag("gen_ai.output.messages", response.Text);
    }

    EvaluationResult result = await scenarioRun.EvaluateAsync(messages, response,
        [new LangfuseEvaluationContext(traceId)]);

    // Assert some stuff for local dev but what matters is the data written to langfuse

    NumericMetric relevance = result.Get<NumericMetric>(RelevanceEvaluator.RelevanceMetricName);
    Assert.That(relevance.Interpretation!.Failed, Is.False, relevance.Reason);
    Assert.That(relevance.Interpretation.Rating, Is.AnyOf(EvaluationRating.Good, EvaluationRating.Exceptional));

    NumericMetric coherence = result.Get<NumericMetric>(CoherenceEvaluator.CoherenceMetricName);
    Assert.That(coherence.Interpretation!.Failed, Is.False, coherence.Reason);
    Assert.That(coherence.Interpretation.Rating, Is.AnyOf(EvaluationRating.Good, EvaluationRating.Exceptional));
}

Note that creating a span is very important as the chat history is saved as-is, but only the trace. So an OTEL SDK is setup somewhere that exports the traces to Langfuse.

Set up

private const string DatasetName = "Test";
private static readonly ActivitySource ActivitySource = new(typeof(DemoTest).Assembly.GetName().Name!);
private static readonly ReportingConfiguration ReportingConfiguration = LangfuseReportingConfiguration.Create(
    client: TestSetup.Services.GetRequiredService<LangfuseClient>(),
    evaluators: [new LangfuseEvaluator(), new CoherenceEvaluator(), new RelevanceEvaluator()],
    chatConfiguration: new ChatConfiguration(TestSetup.Services.GetRequiredService<IChatClient>()));

LangfuseReportingConfiguration is defined as

public static class LangfuseReportingConfiguration
{
    public static ReportingConfiguration Create(
        LangfuseClient client,
        IEnumerable<IEvaluator> evaluators,
        ChatConfiguration? chatConfiguration = null,
        bool enableResponseCaching = true,
        TimeSpan? timeToLiveForCacheEntries = null,
        IEnumerable<string>? cachingKeys = null,
        string? runName = null,
        Func<EvaluationMetric, EvaluationMetricInterpretation?>? evaluationMetricInterpreter = null,
        IEnumerable<string>? tags = null)
    {
        LangfuseResultStore resultStore = new(client);
        runName ??= $"{DateTime.Now:yyyyMMddTHHmmss}";

        return new ReportingConfiguration(
            evaluators,
            resultStore,
            chatConfiguration,
            responseCacheProvider: null,
            cachingKeys: null,
            executionName: runName,
            evaluationMetricInterpreter,
            tags);
    }

    private class LangfuseResultStore(LangfuseClient client) : IEvaluationResultStore
    {
        private readonly LangfuseClient _client = client;

                public async ValueTask WriteResultsAsync(
            IEnumerable<ScenarioRunResult> results,
            CancellationToken cancellationToken = default)
        {
            foreach (var result in results)
            {
                if (!result.EvaluationResult.TryGet<EvaluationMetric>(
                        LangfuseEvaluator.LangfuseMetricName,
                        out var langfuseMetric))
                {
                    throw new InvalidOperationException($"A {nameof(LangfuseEvaluator)} must be added to the {nameof(LangfuseReportingConfiguration)}");
                }

                result.EvaluationResult.Metrics.Remove(LangfuseEvaluator.LangfuseMetricName);

                var langfuseContext = (LangfuseEvaluationContext)langfuseMetric.Context![LangfuseEvaluationContext.LangfuseContextName];

                await _client.Api.Public.DatasetRunItems.PostAsync(new CreateDatasetRunItemRequest
                {
                    DatasetItemId = result.IterationName,
                    RunName = result.ExecutionName,
                    RunDescription = "",
                    TraceId = langfuseContext.TraceId.ToString(),
                }, cancellationToken: cancellationToken);

                foreach (var metric in result.EvaluationResult.Metrics)
                {
                    await _client.Api.Public.Scores.PostAsync(new CreateScoreRequest
                    {
                        Name = metric.Value.Name,
                        TraceId = langfuseContext.TraceId.ToString(),
                        DataType = ScoreDataType.CATEGORICAL,
                        Value = new CreateScoreValue { String = metric.Value.Interpretation!.Rating.ToString() },
                        Comment = metric.Value.Reason,
                        Metadata = metric.Value.Metadata == null
                            ? null
                            : new UntypedObject(metric.Value.Metadata.ToDictionary(
                                m => m.Key,
                                UntypedNode (m) => new UntypedString(m.Value))),
                    }, cancellationToken: cancellationToken);
                }
            }
        }
    }
}

Question: only IEvaluationResultStore.WriteResultsAsync seems to matter? For which use-case are the other methods?

Pass trace id

To pass the trace id I created a dummy evaluator

public class LangfuseEvaluationContext : EvaluationContext
{
    public static string LangfuseContextName => "Langfuse";

    public LangfuseEvaluationContext(ActivityTraceId traceId) : base(LangfuseContextName, "")
    {
        TraceId = traceId;
    }

    public ActivityTraceId TraceId { get; }
}

public class LangfuseEvaluator : IEvaluator
{
    public static string LangfuseMetricName => "Langfuse";

    public ValueTask<EvaluationResult> EvaluateAsync(
        IEnumerable<ChatMessage> messages,
        ChatResponse modelResponse,
        ChatConfiguration? chatConfiguration = null,
        IEnumerable<EvaluationContext>? additionalContext = null,
        CancellationToken cancellationToken = default)
    {
        var langfuseCtx = additionalContext?.OfType<LangfuseEvaluationContext>().FirstOrDefault()
            ?? throw new InvalidOperationException($"A {nameof(LangfuseEvaluationContext)} must be added to {nameof(ScenarioRun)}.{nameof(EvaluateAsync)}");

        EvaluationMetric metric = new(LangfuseMetricName);
        metric.AddOrUpdateContext(langfuseCtx);
        return ValueTask.FromResult(new EvaluationResult(metric));
    }

    public IReadOnlyCollection<string> EvaluationMetricNames { get; } = [LangfuseMetricName];
}

Problem 1: Is there a better way to do that? Ideally, I think I would like to pass arbitrary context to ScenarioRun.EvaluateAsync that would be available in ScenarioRunResult.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MEAI.Eval with Langfuse #6860

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

MEAI.Eval with Langfuse #6860

Uh oh!

Uh oh!

verdie-g Sep 26, 2025

Langfuse Concepts

Dataset

Dataset run

MEAIE Integration

Terminology

Download dataset

Run evaluation

Set up

Pass trace id

Replies: 0 comments

verdie-g
Sep 26, 2025