feat(api): (1/n) datasets api clean up #1573

yanxi0830 · 2025-03-11T21:45:05Z

PR Stack

Client SDK

feat: datasets api updates llama-stack-client-python#203

CI

https://github.com/meta-llama/llama-stack-ops/actions/runs/13911304888

-- the test_rag_agent_with_attachments is flaky and not related to this PR

Doc

Client Usage

client.datasets.register(
    source={
        "type": "uri",
        "uri": "lsfs://mydata.jsonl",
    },
    schema="jsonl_messages",
    # optional 
    dataset_id="my_first_train_data"
)

# quick prototype debugging
client.datasets.register(
    data_reference={
        "type": "rows",
        "rows": [
                "messages": [...],
        ],
    },
    schema="jsonl_messages",
)

Test Plan

CI: https://github.com/meta-llama/llama-stack-ops/actions/runs/13878055453

LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py

LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py

pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

llama_stack/apis/datasets/datasets.py

ashwinb · 2025-03-11T22:47:14Z

llama_stack/apis/datasets/datasets.py

+        :param schema: The schema format of the dataset. One of
+            - jsonl_messages: The dataset is a JSONL file with messages in column format
+        :param uri: The URI of the dataset. Examples:
+           - file://mydata.jsonl


when would file:// make sense? when the server is running on the same machine as client? i'd rather not have that.

I believe the idea was to use file:// to indicate that the dataset will be coming from a File object uploaded via file API.

file://<file-id> -> loads data from local file system via file ID

huggingface://<dataset-path> -> loads data from hugginface

The other alternative is instead of a single uri, we have different Data class to specify different ways we can create a dataset.

class DatasetType(Enum): huggingface = "huggingface" file = "file" class HuggingfaceData(BaseModel): type: Literal[DatasetType.huggingface.value] dataset_id: str split: str class FileData(BaseModel): type: Literal[DatasetType.file.value] file_id: str Data = Annotated[ Union[FileData, HuggingfaceData], Field(discriminator="type"), ] async def register_dataset( self, schema: Schema, data: Data, metadata: Optional[Dict[str, Any]] = None, dataset_id: Optional[str] = None, ) -> Dataset:

Synced with @hardikjshah offline. Going with the Union way with ~~data_reference~~ data_source.

maybe just source, drop the data_ ? This is inside of a dataset anyways

update to source

I don't think we should have file:// as an allowed URI. if you want local files, you can calculate rows yourself and pass as rows. Otherwise you are relying on the client-SDK faithfully obeying this API (and we will not be able to keep this in sync on all clients without additional effort)

I don't think we should have file:// as an allowed URI.

@ashwinb Added example URI's that is allowed

- "https://mywebsite.com/mydata.jsonl" - "lsfs://mydata.jsonl" --> a file url - "data:csv;base64,{base64_content}"

llama_stack/apis/datasets/datasets.py

ehhuang · 2025-03-12T17:20:31Z

llama_stack/apis/datasets/datasets.py

+class Schema(Enum):
+    """
+    Schema of the dataset. Each type has a different column format.
+    :cvar messages: The dataset contains messages used for post-training. Examples:


A bit confused here. Does this dictate what the format of the data source should be? e.g. can I supply a csv file of type uri as data source?

The Schema defines the column name and format. For example:

messages schema indicate a dataset consisting of rows with a "messages" column with List[Messages]. These datasets can be used as input for finetuning.

[ {"messages": List[Message]} ]

chat_completion_eval consisting of "messages" column and "expected_answer" column. The "messages" column will be used as input to generate "generated_answer", and "expected_answer" column will be used for eval for scoring and computing metrics.

The reason we need schema is to validate that the supplied dataset can be correctly parsed by different methods:

llama-stack/llama_stack/providers/utils/common/data_schema_validator.py

Lines 28 to 53 in 00da911

VALID_SCHEMAS_FOR_SCORING = [

{

ColumnName.input_query.value: StringType(),

ColumnName.expected_answer.value: StringType(),

ColumnName.generated_answer.value: StringType(),

},

{

ColumnName.input_query.value: StringType(),

ColumnName.expected_answer.value: StringType(),

ColumnName.generated_answer.value: StringType(),

ColumnName.context.value: StringType(),

},

]

VALID_SCHEMAS_FOR_EVAL = [

{

ColumnName.input_query.value: StringType(),

ColumnName.expected_answer.value: StringType(),

ColumnName.chat_completion_input.value: ChatCompletionInputType(),

},

{

ColumnName.input_query.value: StringType(),

ColumnName.expected_answer.value: StringType(),

ColumnName.completion_input.value: CompletionInputType(),

},

]

The DataSource represent where the data can be loaded from. A csv file can be supplied via 3 ways:

base64 encoded in the uri.

URIDataSource( uri="`data:csv;base64,{base64_content}`", )

upload csv file vie /files API defined in https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py

read rows in csv file and register dataset via:

client.datasets.create( data_reference={ "type": "rows", "rows": [ "messages": [...], ], }, schema="jsonl_messages", )

Synced with @ehhuang offline. purpose might be a better name compared with schema and less confusing for users, since we could infer the schema from dataset. cc @hardikjshah for thoughts.

Discussed offline:

Clarification: The purpose of the schema parameter is to validate that the input data has the expected format.

Perhaps a more apt name would be purpose, where in the documentation we lay out the expected schema for each purpose, and in the implementation of create we validate that the input data against the expected schema.

What are all the valid schema names that we will have given what we know right now ?

jsonl with each line being a `{"messages": []} ( tuning/messages )

jsonl with each line being {"text": ... } ( tuning/text )

eval with a csv ( query / expected / output ) ( eval/question_answer )

May be we use this api/purpose style to name the various schemas ?

# What does this PR do? - as title - uses "cursor" pagination scheme for iterrows [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan <img width="1226" alt="image" src="https://github.com/user-attachments/assets/3220eaac-7117-4d0a-b344-2bbb77a22065" /> [//]: # (## Documentation)

…#1657) # What does this PR do? - We need to tag DatasetIO class correctly with Datasets with the endpoint change [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan **Before** <img width="1474" alt="image" src="https://github.com/user-attachments/assets/48737317-28a3-4aa6-a1b5-e1ea680cef84" /> **After** <img width="1508" alt="image" src="https://github.com/user-attachments/assets/123322f0-a52f-47ee-99a7-ecc66c1b09ec" /> [//]: # (## Documentation)

# What does this PR do? - fix datasets api signature mis-match so that llama stack run can start [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` llama stack run ``` <img width="626" alt="image" src="https://github.com/user-attachments/assets/59072d1a-ccb6-453a-80e8-d87419896c41" /> [//]: # (## Documentation)

# What does this PR do? - fix dataset registeration & iterrows > NOTE: the URL endpoint is changed to datasetio due to flaky path routing [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py ``` <img width="854" alt="image" src="https://github.com/user-attachments/assets/0168b352-1c5a-48d1-8e9a-93141d418e54" /> [//]: # (## Documentation)

# What does this PR do? - as title [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan CI ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="587" alt="image" src="https://github.com/user-attachments/assets/4a25f493-501e-43f4-9836-d9802223a93a" /> [//]: # (## Documentation)

llama_stack/distribution/routers/routing_tables.py

llama_stack/apis/datasets/datasets.py

llama_stack/apis/datasetio/datasetio.py

ashwinb · 2025-03-17T18:45:57Z

llama_stack/apis/datasetio/datasetio.py

@@ -37,22 +34,21 @@ class DatasetIO(Protocol):
    # keeping for aligning with inference/safety, but this is not used
    dataset_store: DatasetStore

-    @webmethod(route="/datasetio/rows", method="GET")
-    async def get_rows_paginated(
+    # TODO(xiyan): there's a flakiness here where setting route to "/datasets/" here will not result in proper routing


nit: this is not flaky (which means it sometimes works vs. sometimes not) -- I think maybe you just mean "wonkiness" or "sadness"

yeah, it actually is the fact that this sometimes work and sometimes do not work if I set the route to /datasets. I'm suspecting it may be due to the way we do topological sort in our resolver

ashwinb · 2025-03-17T18:49:32Z

llama_stack/apis/datasetio/datasetio.py

+        start_index: Optional[int] = None,
+        limit: Optional[int] = None,
+    ) -> IterrowsResponse:
+        """Get a paginated list of rows from a dataset. Uses cursor-based pagination.


I don't think this is using cursor-based pagination because this is just an index we are returning? Maybe just avoid saying this is cursor-based?

I'm basing the API off of this cursor based pagination spec.

llama_stack/apis/datasetio/datasetio.py

llama_stack/apis/datasets/datasets.py

llama_stack/templates/open-benchmark/open_benchmark.py

llama_stack/providers/inline/datasetio/localfs/datasetio.py

llama_stack/distribution/routers/routers.py

hardikjshah · 2025-03-17T21:21:37Z

llama_stack/providers/inline/datasetio/localfs/datasetio.py

-        assert len(df.columns) == len(self.dataset_def.dataset_schema)
-        # TODO: type checking against column types in dataset schema
-        return df
-
    def load(self) -> None:


Where do we do validation ?

We do validation at the eval API level:

it will happen here:

llama-stack/llama_stack/providers/inline/eval/meta_reference/eval.py

Lines 87 to 88 in cc48d9e

# TODO (xiyan): validate dataset schema

# dataset_def = await self.datasets_api.get_dataset(dataset_id=dataset_id)

which will be added with the eval api implementation changes

hardikjshah · 2025-03-17T21:24:58Z

llama_stack/providers/inline/datasetio/localfs/datasetio.py

-            next_page_token=str(end),
+        return IterrowsResponse(
+            data=rows,
+            next_start_index=end if end < len(dataset_impl) else None,
        )

    async def append_rows(self, dataset_id: str, rows: List[Dict[str, Any]]) -> None:


Should append_rows take source , similarly to dataset's registration ?
What is the use case where we need append ?

I think @dineshyv added the append_rows API which is used internally for storing telemetry logs into a dataset.

But I don't feel like it is useful to be exposed to users, so we can remove it and make it an internal api in the future.

hardikjshah · 2025-03-17T21:54:12Z

llama_stack/templates/open-benchmark/open_benchmark.py

-            },
+            purpose=DatasetPurpose.eval_messages_answer,
+            source=URIDataSource(
+                uri="huggingface://datasets/llamastack/simpleqa?split=train",


Are we going to put all params in the uri ? or put them in uri_params ?
I know we went back and forth on this , but did we decide on the uri route ?

We decided on the uri route based on @ashwinb's feedback on not having a separate HuggingfaceDataSource.

# What does this PR do? - sync SDK [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - See meta-llama/llama-stack#1573 [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant)

datasets api

bc551e6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 11, 2025

yanxi0830 changed the title ~~datasets api~~ feat(api): (1/n) datasets api clean up Mar 11, 2025

precommit

8592c2b

yanxi0830 marked this pull request as ready for review March 11, 2025 21:58

yanxi0830 requested review from ashwinb, hardikjshah, dltn, raghotham, dineshyv, vladimirivic, sixianyi0721, ehhuang, terrytangyuan and SLR722 as code owners March 11, 2025 21:58

yanxi0830 changed the base branch from main to eval_api_updates March 11, 2025 22:00

openapi

0e8a53a

ashwinb reviewed Mar 11, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Outdated Show resolved Hide resolved

ashwinb reviewed Mar 11, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Outdated Show resolved Hide resolved

yanxi0830 added 2 commits March 11, 2025 16:08

remove json_schema_type decorator

02aa9a1

update

0e47c65

yanxi0830 commented Mar 12, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Show resolved Hide resolved

precommit

817331e

raghotham reviewed Mar 12, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Outdated Show resolved Hide resolved

ehhuang reviewed Mar 12, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Outdated Show resolved Hide resolved

llama_stack/apis/datasets/datasets.py Outdated Show resolved Hide resolved

yanxi0830 added 2 commits March 12, 2025 00:13

comment

0abedd0

upgrade doc

1d80ec7

ehhuang reviewed Mar 12, 2025

View reviewed changes

Merge branch 'main' into pr1573

31e3409

yanxi0830 added 10 commits March 15, 2025 13:58

openapi gen + precommit fix

5cb0ad7

fix hf url endpoint

6f5df08

scoring fix

28b8c1c

pre

f2d9332

fix integeration

5cf7779

JoshSalomon reviewed Mar 16, 2025

View reviewed changes

llama_stack/distribution/routers/routing_tables.py Show resolved Hide resolved

JoshSalomon reviewed Mar 16, 2025

View reviewed changes

llama_stack/apis/datasets/datasets.py Show resolved Hide resolved

yanxi0830 added 2 commits March 16, 2025 16:27

precommit

63f1525

dataaset

d9264a0

yanxi0830 mentioned this pull request Mar 17, 2025

feat: datasets api updates meta-llama/llama-stack-client-python#203

Merged

ashwinb reviewed Mar 17, 2025

View reviewed changes

llama_stack/apis/datasetio/datasetio.py Outdated Show resolved Hide resolved

ashwinb reviewed Mar 17, 2025

View reviewed changes

llama_stack/apis/datasetio/datasetio.py Outdated Show resolved Hide resolved

ashwinb approved these changes Mar 17, 2025

View reviewed changes

yanxi0830 added 2 commits March 17, 2025 14:14

next_index -> next_start_index

6a8bd19

precommit

cc48d9e

ehhuang reviewed Mar 17, 2025

View reviewed changes

doc update

54a4f41

hardikjshah reviewed Mar 17, 2025

View reviewed changes

yanxi0830 merged commit 5287b43 into main Mar 17, 2025
11 checks passed

yanxi0830 deleted the pr1573 branch March 17, 2025 23:55

JoshSalomon mentioned this pull request Mar 26, 2025

Bug in llama-stack-client CLI meta-llama/llama-stack-client-python#211

Closed

2 tasks

booxter mentioned this pull request Mar 31, 2025

torchtune training fails to validate dataset #1849

Closed

2 tasks

	VALID_SCHEMAS_FOR_SCORING = [
	{
	ColumnName.input_query.value: StringType(),
	ColumnName.expected_answer.value: StringType(),
	ColumnName.generated_answer.value: StringType(),
	},
	{
	ColumnName.input_query.value: StringType(),
	ColumnName.expected_answer.value: StringType(),
	ColumnName.generated_answer.value: StringType(),
	ColumnName.context.value: StringType(),
	},
	]

	VALID_SCHEMAS_FOR_EVAL = [
	{
	ColumnName.input_query.value: StringType(),
	ColumnName.expected_answer.value: StringType(),
	ColumnName.chat_completion_input.value: ChatCompletionInputType(),
	},
	{
	ColumnName.input_query.value: StringType(),
	ColumnName.expected_answer.value: StringType(),
	ColumnName.completion_input.value: CompletionInputType(),
	},
	]

	# TODO (xiyan): validate dataset schema
	# dataset_def = await self.datasets_api.get_dataset(dataset_id=dataset_id)

feat(api): (1/n) datasets api clean up #1573

feat(api): (1/n) datasets api clean up #1573

Uh oh!

Conversation

yanxi0830 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Stack

Doc

Client Usage

Test Plan

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxi0830 Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxi0830 Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxi0830 Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxi0830 Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxi0830 commented Mar 11, 2025 •

edited

Loading

yanxi0830 Mar 11, 2025 •

edited

Loading

yanxi0830 Mar 12, 2025 •

edited

Loading

yanxi0830 Mar 12, 2025 •

edited

Loading

yanxi0830 Mar 17, 2025 •

edited

Loading