-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat(api): (1/n) datasets api clean up #1573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
:param schema: The schema format of the dataset. One of | ||
- jsonl_messages: The dataset is a JSONL file with messages in column format | ||
:param uri: The URI of the dataset. Examples: | ||
- file://mydata.jsonl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would file://
make sense? when the server is running on the same machine as client? i'd rather not have that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the idea was to use file://
to indicate that the dataset will be coming from a File object uploaded via file
API.
file://<file-id>
-> loads data from local file system via file IDhuggingface://<dataset-path>
-> loads data from hugginface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other alternative is instead of a single uri
, we have different Data
class to specify different ways we can create a dataset.
class DatasetType(Enum):
huggingface = "huggingface"
file = "file"
class HuggingfaceData(BaseModel):
type: Literal[DatasetType.huggingface.value]
dataset_id: str
split: str
class FileData(BaseModel):
type: Literal[DatasetType.file.value]
file_id: str
Data = Annotated[
Union[FileData, HuggingfaceData],
Field(discriminator="type"),
]
async def register_dataset(
self,
schema: Schema,
data: Data,
metadata: Optional[Dict[str, Any]] = None,
dataset_id: Optional[str] = None,
) -> Dataset:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with @hardikjshah offline. Going with the Union way with data_reference
data_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just source
, drop the data_
? This is inside of a dataset anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have file://
as an allowed URI. if you want local files, you can calculate rows yourself and pass as rows
. Otherwise you are relying on the client-SDK faithfully obeying this API (and we will not be able to keep this in sync on all clients without additional effort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have file:// as an allowed URI.
@ashwinb Added example URI's that is allowed
- "https://mywebsite.com/mydata.jsonl"
- "lsfs://mydata.jsonl" --> a file url
- "data:csv;base64,{base64_content}"
class Schema(Enum): | ||
""" | ||
Schema of the dataset. Each type has a different column format. | ||
:cvar messages: The dataset contains messages used for post-training. Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit confused here. Does this dictate what the format of the data source should be? e.g. can I supply a csv file of type uri as data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Schema
defines the column name and format. For example:
messages
schema indicate a dataset consisting of rows with a "messages" column withList[Messages]
. These datasets can be used as input for finetuning.
[
{"messages": List[Message]}
]
chat_completion_eval
consisting of "messages" column and "expected_answer" column. The "messages" column will be used as input to generate "generated_answer", and "expected_answer" column will be used for eval for scoring and computing metrics.
The reason we need schema is to validate that the supplied dataset can be correctly parsed by different methods:
VALID_SCHEMAS_FOR_SCORING = [ | |
{ | |
ColumnName.input_query.value: StringType(), | |
ColumnName.expected_answer.value: StringType(), | |
ColumnName.generated_answer.value: StringType(), | |
}, | |
{ | |
ColumnName.input_query.value: StringType(), | |
ColumnName.expected_answer.value: StringType(), | |
ColumnName.generated_answer.value: StringType(), | |
ColumnName.context.value: StringType(), | |
}, | |
] | |
VALID_SCHEMAS_FOR_EVAL = [ | |
{ | |
ColumnName.input_query.value: StringType(), | |
ColumnName.expected_answer.value: StringType(), | |
ColumnName.chat_completion_input.value: ChatCompletionInputType(), | |
}, | |
{ | |
ColumnName.input_query.value: StringType(), | |
ColumnName.expected_answer.value: StringType(), | |
ColumnName.completion_input.value: CompletionInputType(), | |
}, | |
] |
The DataSource
represent where the data can be loaded from. A csv
file can be supplied via 3 ways:
- base64 encoded in the
uri
.
URIDataSource(
uri="`data:csv;base64,{base64_content}`",
)
-
upload csv file vie
/files
API defined in https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py -
read rows in csv file and register dataset via:
client.datasets.create(
data_reference={
"type": "rows",
"rows": [
"messages": [...],
],
},
schema="jsonl_messages",
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with @ehhuang offline. purpose
might be a better name compared with schema
and less confusing for users, since we could infer the schema from dataset. cc @hardikjshah for thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline:
- Clarification: The purpose of the
schema
parameter is to validate that the input data has the expected format. - Perhaps a more apt name would be
purpose
, where in the documentation we lay out the expected schema for eachpurpose
, and in the implementation ofcreate
we validate that the input data against the expected schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are all the valid schema names that we will have given what we know right now ?
- jsonl with each line being a `{"messages": []} ( tuning/messages )
- jsonl with each line being
{"text": ... }
( tuning/text ) - eval with a csv ( query / expected / output ) ( eval/question_answer )
May be we use this api/purpose
style to name the various schemas ?
# What does this PR do? - as title - uses "cursor" pagination scheme for iterrows [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan <img width="1226" alt="image" src="https://github.com/user-attachments/assets/3220eaac-7117-4d0a-b344-2bbb77a22065" /> [//]: # (## Documentation)
…#1657) # What does this PR do? - We need to tag DatasetIO class correctly with Datasets with the endpoint change [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan **Before** <img width="1474" alt="image" src="https://github.com/user-attachments/assets/48737317-28a3-4aa6-a1b5-e1ea680cef84" /> **After** <img width="1508" alt="image" src="https://github.com/user-attachments/assets/123322f0-a52f-47ee-99a7-ecc66c1b09ec" /> [//]: # (## Documentation)
# What does this PR do? - fix datasets api signature mis-match so that llama stack run can start [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` llama stack run ``` <img width="626" alt="image" src="https://github.com/user-attachments/assets/59072d1a-ccb6-453a-80e8-d87419896c41" /> [//]: # (## Documentation)
# What does this PR do? - fix dataset registeration & iterrows > NOTE: the URL endpoint is changed to datasetio due to flaky path routing [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py ``` <img width="854" alt="image" src="https://github.com/user-attachments/assets/0168b352-1c5a-48d1-8e9a-93141d418e54" /> [//]: # (## Documentation)
# What does this PR do? - as title [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan CI ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="587" alt="image" src="https://github.com/user-attachments/assets/4a25f493-501e-43f4-9836-d9802223a93a" /> [//]: # (## Documentation)
@@ -37,22 +34,21 @@ class DatasetIO(Protocol): | |||
# keeping for aligning with inference/safety, but this is not used | |||
dataset_store: DatasetStore | |||
|
|||
@webmethod(route="/datasetio/rows", method="GET") | |||
async def get_rows_paginated( | |||
# TODO(xiyan): there's a flakiness here where setting route to "/datasets/" here will not result in proper routing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is not flaky (which means it sometimes works vs. sometimes not) -- I think maybe you just mean "wonkiness" or "sadness"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it actually is the fact that this sometimes work and sometimes do not work if I set the route to /datasets
. I'm suspecting it may be due to the way we do topological sort in our resolver
start_index: Optional[int] = None, | ||
limit: Optional[int] = None, | ||
) -> IterrowsResponse: | ||
"""Get a paginated list of rows from a dataset. Uses cursor-based pagination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is using cursor-based pagination because this is just an index we are returning? Maybe just avoid saying this is cursor-based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert len(df.columns) == len(self.dataset_def.dataset_schema) | ||
# TODO: type checking against column types in dataset schema | ||
return df | ||
|
||
def load(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we do validation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do validation at the eval API level:
- it will happen here:
llama-stack/llama_stack/providers/inline/eval/meta_reference/eval.py
Lines 87 to 88 in cc48d9e
# TODO (xiyan): validate dataset schema # dataset_def = await self.datasets_api.get_dataset(dataset_id=dataset_id)
which will be added with the eval api implementation changes
next_page_token=str(end), | ||
return IterrowsResponse( | ||
data=rows, | ||
next_start_index=end if end < len(dataset_impl) else None, | ||
) | ||
|
||
async def append_rows(self, dataset_id: str, rows: List[Dict[str, Any]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should append_rows
take source , similarly to dataset's registration ?
What is the use case where we need append
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @dineshyv added the append_rows
API which is used internally for storing telemetry logs into a dataset.
But I don't feel like it is useful to be exposed to users, so we can remove it and make it an internal api in the future.
}, | ||
purpose=DatasetPurpose.eval_messages_answer, | ||
source=URIDataSource( | ||
uri="huggingface://datasets/llamastack/simpleqa?split=train", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to put all params in the uri ? or put them in uri_params
?
I know we went back and forth on this , but did we decide on the uri route ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided on the uri
route based on @ashwinb's feedback on not having a separate HuggingfaceDataSource
.
# What does this PR do? - sync SDK [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - See meta-llama/llama-stack#1573 [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant)
PR Stack
Client SDK
CI
Doc
Client Usage
Test Plan