Skip to content

feat(api): (1/n) datasets api clean up #1573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Mar 17, 2025
Merged

feat(api): (1/n) datasets api clean up #1573

merged 44 commits into from
Mar 17, 2025

Conversation

yanxi0830
Copy link
Contributor

@yanxi0830 yanxi0830 commented Mar 11, 2025

PR Stack

Client SDK

CI

image -- the test_rag_agent_with_attachments is flaky and not related to this PR

Doc

image

Client Usage

client.datasets.register(
    source={
        "type": "uri",
        "uri": "lsfs://mydata.jsonl",
    },
    schema="jsonl_messages",
    # optional 
    dataset_id="my_first_train_data"
)

# quick prototype debugging
client.datasets.register(
    data_reference={
        "type": "rows",
        "rows": [
                "messages": [...],
        ],
    },
    schema="jsonl_messages",
)

Test Plan

LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 11, 2025
@yanxi0830 yanxi0830 changed the title datasets api feat(api): (1/n) datasets api clean up Mar 11, 2025
@yanxi0830 yanxi0830 marked this pull request as ready for review March 11, 2025 21:58
@yanxi0830 yanxi0830 changed the base branch from main to eval_api_updates March 11, 2025 22:00
:param schema: The schema format of the dataset. One of
- jsonl_messages: The dataset is a JSONL file with messages in column format
:param uri: The URI of the dataset. Examples:
- file://mydata.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would file:// make sense? when the server is running on the same machine as client? i'd rather not have that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea was to use file:// to indicate that the dataset will be coming from a File object uploaded via file API.

  • file://<file-id> -> loads data from local file system via file ID
  • huggingface://<dataset-path> -> loads data from hugginface

Copy link
Contributor Author

@yanxi0830 yanxi0830 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative is instead of a single uri, we have different Data class to specify different ways we can create a dataset.

class DatasetType(Enum):
    huggingface = "huggingface"
    file = "file"

class HuggingfaceData(BaseModel):
    type: Literal[DatasetType.huggingface.value]
    dataset_id: str
    split: str

class FileData(BaseModel):
    type: Literal[DatasetType.file.value]
    file_id: str

Data = Annotated[
    Union[FileData, HuggingfaceData],
    Field(discriminator="type"),
]

 async def register_dataset(
        self,
        schema: Schema,
        data: Data,
        metadata: Optional[Dict[str, Any]] = None,
        dataset_id: Optional[str] = None,
    ) -> Dataset:

Copy link
Contributor Author

@yanxi0830 yanxi0830 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with @hardikjshah offline. Going with the Union way with data_reference data_source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just source, drop the data_ ? This is inside of a dataset anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to source

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have file:// as an allowed URI. if you want local files, you can calculate rows yourself and pass as rows. Otherwise you are relying on the client-SDK faithfully obeying this API (and we will not be able to keep this in sync on all clients without additional effort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have file:// as an allowed URI.

@ashwinb Added example URI's that is allowed

- "https://mywebsite.com/mydata.jsonl"
- "lsfs://mydata.jsonl" --> a file url
- "data:csv;base64,{base64_content}"

class Schema(Enum):
"""
Schema of the dataset. Each type has a different column format.
:cvar messages: The dataset contains messages used for post-training. Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused here. Does this dictate what the format of the data source should be? e.g. can I supply a csv file of type uri as data source?

Copy link
Contributor Author

@yanxi0830 yanxi0830 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Schema defines the column name and format. For example:

  1. messages schema indicate a dataset consisting of rows with a "messages" column with List[Messages]. These datasets can be used as input for finetuning.
[
    {"messages": List[Message]}
]
  1. chat_completion_eval consisting of "messages" column and "expected_answer" column. The "messages" column will be used as input to generate "generated_answer", and "expected_answer" column will be used for eval for scoring and computing metrics.

The reason we need schema is to validate that the supplied dataset can be correctly parsed by different methods:

VALID_SCHEMAS_FOR_SCORING = [
{
ColumnName.input_query.value: StringType(),
ColumnName.expected_answer.value: StringType(),
ColumnName.generated_answer.value: StringType(),
},
{
ColumnName.input_query.value: StringType(),
ColumnName.expected_answer.value: StringType(),
ColumnName.generated_answer.value: StringType(),
ColumnName.context.value: StringType(),
},
]
VALID_SCHEMAS_FOR_EVAL = [
{
ColumnName.input_query.value: StringType(),
ColumnName.expected_answer.value: StringType(),
ColumnName.chat_completion_input.value: ChatCompletionInputType(),
},
{
ColumnName.input_query.value: StringType(),
ColumnName.expected_answer.value: StringType(),
ColumnName.completion_input.value: CompletionInputType(),
},
]

The DataSource represent where the data can be loaded from. A csv file can be supplied via 3 ways:

  1. base64 encoded in the uri.
URIDataSource(
     uri="`data:csv;base64,{base64_content}`",
)
  1. upload csv file vie /files API defined in https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py

  2. read rows in csv file and register dataset via:

client.datasets.create(
    data_reference={
        "type": "rows",
        "rows": [
                "messages": [...],
        ],
    },
    schema="jsonl_messages",
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with @ehhuang offline. purpose might be a better name compared with schema and less confusing for users, since we could infer the schema from dataset. cc @hardikjshah for thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline:

  • Clarification: The purpose of the schema parameter is to validate that the input data has the expected format.
  • Perhaps a more apt name would be purpose, where in the documentation we lay out the expected schema for each purpose, and in the implementation of create we validate that the input data against the expected schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are all the valid schema names that we will have given what we know right now ?

  • jsonl with each line being a `{"messages": []} ( tuning/messages )
  • jsonl with each line being {"text": ... } ( tuning/text )
  • eval with a csv ( query / expected / output ) ( eval/question_answer )

May be we use this api/purpose style to name the various schemas ?

# What does this PR do?
- as title
- uses "cursor" pagination scheme for iterrows

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
<img width="1226" alt="image"
src="https://github.com/user-attachments/assets/3220eaac-7117-4d0a-b344-2bbb77a22065"
/>


[//]: # (## Documentation)
…#1657)

# What does this PR do?
- We need to tag DatasetIO class correctly with Datasets with the
endpoint change

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
**Before**
<img width="1474" alt="image"
src="https://github.com/user-attachments/assets/48737317-28a3-4aa6-a1b5-e1ea680cef84"
/>


**After**
<img width="1508" alt="image"
src="https://github.com/user-attachments/assets/123322f0-a52f-47ee-99a7-ecc66c1b09ec"
/>

[//]: # (## Documentation)
# What does this PR do?
- fix datasets api signature mis-match so that llama stack run can start

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
llama stack run
```
<img width="626" alt="image"
src="https://github.com/user-attachments/assets/59072d1a-ccb6-453a-80e8-d87419896c41"
/>


[//]: # (## Documentation)
# What does this PR do?

- fix dataset registeration & iterrows
> NOTE: the URL endpoint is changed to datasetio due to flaky path
routing

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py
```
<img width="854" alt="image"
src="https://github.com/user-attachments/assets/0168b352-1c5a-48d1-8e9a-93141d418e54"
/>


[//]: # (## Documentation)
# What does this PR do?
- as title

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
CI

```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
<img width="587" alt="image"
src="https://github.com/user-attachments/assets/4a25f493-501e-43f4-9836-d9802223a93a"
/>


[//]: # (## Documentation)
@@ -37,22 +34,21 @@ class DatasetIO(Protocol):
# keeping for aligning with inference/safety, but this is not used
dataset_store: DatasetStore

@webmethod(route="/datasetio/rows", method="GET")
async def get_rows_paginated(
# TODO(xiyan): there's a flakiness here where setting route to "/datasets/" here will not result in proper routing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not flaky (which means it sometimes works vs. sometimes not) -- I think maybe you just mean "wonkiness" or "sadness"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it actually is the fact that this sometimes work and sometimes do not work if I set the route to /datasets. I'm suspecting it may be due to the way we do topological sort in our resolver

start_index: Optional[int] = None,
limit: Optional[int] = None,
) -> IterrowsResponse:
"""Get a paginated list of rows from a dataset. Uses cursor-based pagination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is using cursor-based pagination because this is just an index we are returning? Maybe just avoid saying this is cursor-based?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basing the API off of this cursor based pagination spec.

image

assert len(df.columns) == len(self.dataset_def.dataset_schema)
# TODO: type checking against column types in dataset schema
return df

def load(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we do validation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do validation at the eval API level:

which will be added with the eval api implementation changes

next_page_token=str(end),
return IterrowsResponse(
data=rows,
next_start_index=end if end < len(dataset_impl) else None,
)

async def append_rows(self, dataset_id: str, rows: List[Dict[str, Any]]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should append_rows take source , similarly to dataset's registration ?
What is the use case where we need append ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @dineshyv added the append_rows API which is used internally for storing telemetry logs into a dataset.

But I don't feel like it is useful to be exposed to users, so we can remove it and make it an internal api in the future.

},
purpose=DatasetPurpose.eval_messages_answer,
source=URIDataSource(
uri="huggingface://datasets/llamastack/simpleqa?split=train",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to put all params in the uri ? or put them in uri_params ?
I know we went back and forth on this , but did we decide on the uri route ?

Copy link
Contributor Author

@yanxi0830 yanxi0830 Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided on the uri route based on @ashwinb's feedback on not having a separate HuggingfaceDataSource.

@yanxi0830 yanxi0830 merged commit 5287b43 into main Mar 17, 2025
11 checks passed
@yanxi0830 yanxi0830 deleted the pr1573 branch March 17, 2025 23:55
yanxi0830 added a commit to meta-llama/llama-stack-client-python that referenced this pull request Mar 17, 2025
# What does this PR do?
- sync SDK

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- See meta-llama/llama-stack#1573


[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants