You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A dataset is a collection of inputs and expected outputs and is used to test your application.
Here is a dataset named "Test" that has 3 items. No outputs are defined for now because I'm going to use the generic MEAIE built-in evaluators.
Dataset run
A dataset is run against some user-defined code and scores can be pushed on traces associated with a dataset run item. An entire dataset run can also be scored.
Here is dataset run named "run-name-5" with 3 data set run items. One of the item has a score named "score-name" with a value 123.456 and a comment.
MEAIE Integration
Terminology
Here is the mapping from my understanding
Langfuse
MEAIE
Dataset
Scenario
Dataset run
Execution
Dataset run item
Iteration
Score
ScenarioRunResult
I first thought I could have one execution per dataset run item but I understand that ScenarioRun.EvaluateAsync should be called only once per execution X iteration. So I'm associating an iteration to a dataset run item which is a little weird.
Download dataset
MEAIE doesn't deal with that. In a NUnit TestCaseSource, I download all the dataset items
[TestCaseSource(nameof(TestCases))]publicasyncTaskScenario(DatasetItemitem){awaitusingvarscenarioRun=awaitReportingConfiguration.CreateScenarioRunAsync(scenarioName:DatasetName,iterationName:item.Id!,additionalTags:[]);ChatMessage[]messages=[new(ChatRole.System,""" You're an AI assistant that can answer questions related to astronomy. Keep your responses concise and under 100 words. Use the imperial measurement system for all measurements in your response. """),// https://github.com/orgs/langfuse/discussions/9355new(ChatRole.User,((UntypedString)item.Input!).GetValue()),];ChatResponseresponse;ActivityTraceIdtraceId;using(varspan=ActivitySource.StartActivity()!){span.SetTag("gen_ai.input.messages",((UntypedString)item.Input!).GetValue());response=awaitscenarioRun.ChatConfiguration!.ChatClient.GetResponseAsync(messages,newChatOptions{Temperature=0});traceId=span.TraceId;span.SetTag("gen_ai.output.messages",response.Text);}EvaluationResultresult=awaitscenarioRun.EvaluateAsync(messages,response,[newLangfuseEvaluationContext(traceId)]);// Assert some stuff for local dev but what matters is the data written to langfuseNumericMetricrelevance=result.Get<NumericMetric>(RelevanceEvaluator.RelevanceMetricName);Assert.That(relevance.Interpretation!.Failed,Is.False,relevance.Reason);Assert.That(relevance.Interpretation.Rating,Is.AnyOf(EvaluationRating.Good,EvaluationRating.Exceptional));NumericMetriccoherence=result.Get<NumericMetric>(CoherenceEvaluator.CoherenceMetricName);Assert.That(coherence.Interpretation!.Failed,Is.False,coherence.Reason);Assert.That(coherence.Interpretation.Rating,Is.AnyOf(EvaluationRating.Good,EvaluationRating.Exceptional));}
Note that creating a span is very important as the chat history is saved as-is, but only the trace. So an OTEL SDK is setup somewhere that exports the traces to Langfuse.
publicstaticclassLangfuseReportingConfiguration{publicstaticReportingConfigurationCreate(LangfuseClientclient,IEnumerable<IEvaluator>evaluators,ChatConfiguration?chatConfiguration=null,boolenableResponseCaching=true,TimeSpan?timeToLiveForCacheEntries=null,IEnumerable<string>?cachingKeys=null,string?runName=null,Func<EvaluationMetric,EvaluationMetricInterpretation?>?evaluationMetricInterpreter=null,IEnumerable<string>?tags=null){LangfuseResultStoreresultStore=new(client);runName??=$"{DateTime.Now:yyyyMMddTHHmmss}";returnnewReportingConfiguration(evaluators,resultStore,chatConfiguration,responseCacheProvider:null,cachingKeys:null,executionName:runName,evaluationMetricInterpreter,tags);}privateclassLangfuseResultStore(LangfuseClientclient):IEvaluationResultStore{privatereadonlyLangfuseClient_client=client;publicasyncValueTaskWriteResultsAsync(IEnumerable<ScenarioRunResult>results,CancellationTokencancellationToken=default){foreach(varresultinresults){if(!result.EvaluationResult.TryGet<EvaluationMetric>(LangfuseEvaluator.LangfuseMetricName,outvarlangfuseMetric)){thrownewInvalidOperationException($"A {nameof(LangfuseEvaluator)} must be added to the {nameof(LangfuseReportingConfiguration)}");}result.EvaluationResult.Metrics.Remove(LangfuseEvaluator.LangfuseMetricName);varlangfuseContext=(LangfuseEvaluationContext)langfuseMetric.Context![LangfuseEvaluationContext.LangfuseContextName];await_client.Api.Public.DatasetRunItems.PostAsync(newCreateDatasetRunItemRequest{DatasetItemId=result.IterationName,RunName=result.ExecutionName,RunDescription="",TraceId=langfuseContext.TraceId.ToString(),},cancellationToken:cancellationToken);foreach(varmetricinresult.EvaluationResult.Metrics){await_client.Api.Public.Scores.PostAsync(newCreateScoreRequest{Name=metric.Value.Name,TraceId=langfuseContext.TraceId.ToString(),DataType=ScoreDataType.CATEGORICAL,Value=newCreateScoreValue{String=metric.Value.Interpretation!.Rating.ToString()},Comment=metric.Value.Reason,Metadata=metric.Value.Metadata==null?null:newUntypedObject(metric.Value.Metadata.ToDictionary(
m =>m.Key,UntypedNode(m)=>newUntypedString(m.Value))),},cancellationToken:cancellationToken);}}}}}
Question: only IEvaluationResultStore.WriteResultsAsync seems to matter? For which use-case are the other methods?
Pass trace id
To pass the trace id I created a dummy evaluator
publicclassLangfuseEvaluationContext:EvaluationContext{publicstaticstringLangfuseContextName=>"Langfuse";publicLangfuseEvaluationContext(ActivityTraceIdtraceId):base(LangfuseContextName,""){TraceId=traceId;}publicActivityTraceIdTraceId{get;}}publicclassLangfuseEvaluator:IEvaluator{publicstaticstringLangfuseMetricName=>"Langfuse";publicValueTask<EvaluationResult>EvaluateAsync(IEnumerable<ChatMessage>messages,ChatResponsemodelResponse,ChatConfiguration?chatConfiguration=null,IEnumerable<EvaluationContext>?additionalContext=null,CancellationTokencancellationToken=default){varlangfuseCtx=additionalContext?.OfType<LangfuseEvaluationContext>().FirstOrDefault()??thrownewInvalidOperationException($"A {nameof(LangfuseEvaluationContext)} must be added to {nameof(ScenarioRun)}.{nameof(EvaluateAsync)}");EvaluationMetricmetric=new(LangfuseMetricName);metric.AddOrUpdateContext(langfuseCtx);returnValueTask.FromResult(newEvaluationResult(metric));}publicIReadOnlyCollection<string>EvaluationMetricNames{get;}=[LangfuseMetricName];}
Problem 1: Is there a better way to do that? Ideally, I think I would like to pass arbitrary context to ScenarioRun.EvaluateAsync that would be available in ScenarioRunResult.
area-ai-evalMicrosoft.Extensions.AI.Evaluation and related
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to share my experience using Microsoft.Extensions.AI.Evaluation with Langfuse and discuss if there are improvements to be made.
Langfuse Concepts
Dataset
Here is a dataset named "Test" that has 3 items. No outputs are defined for now because I'm going to use the generic MEAIE built-in evaluators.
Dataset run
A dataset is run against some user-defined code and scores can be pushed on traces associated with a dataset run item. An entire dataset run can also be scored.
Here is dataset run named "run-name-5" with 3 data set run items. One of the item has a score named "score-name" with a value 123.456 and a comment.
MEAIE Integration
Terminology
Here is the mapping from my understanding
I first thought I could have one execution per dataset run item but I understand that
ScenarioRun.EvaluateAsync
should be called only once per execution X iteration. So I'm associating an iteration to a dataset run item which is a little weird.Download dataset
MEAIE doesn't deal with that. In a NUnit TestCaseSource, I download all the dataset items
Run evaluation
From this dataset, I run this evaluation
Note that creating a span is very important as the chat history is saved as-is, but only the trace. So an OTEL SDK is setup somewhere that exports the traces to Langfuse.
Set up
LangfuseReportingConfiguration is defined as
Question: only IEvaluationResultStore.WriteResultsAsync seems to matter? For which use-case are the other methods?
Pass trace id
To pass the trace id I created a dummy evaluator
Problem 1: Is there a better way to do that? Ideally, I think I would like to pass arbitrary context to
ScenarioRun.EvaluateAsync
that would be available inScenarioRunResult
.Beta Was this translation helpful? Give feedback.
All reactions