-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Hello!
Feature Request overview
- Many example scripts use
http_get, while we can more smoothly load that data withdatasets
Details
Many example scripts and some tests rely on http_get to download e.g. https://sbert.net/datasets/stsbenchmark.tsv.gz / https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz / askubuntu / TREC, etc., while this data is often also easily accessible on Hugging Face. We should be able to simplify a lot of these scripts considerably with datasets (and perhaps also Dataset.map/Dataset.filter etc.).
For example
sentence-transformers/tests/test_train_stsb.py
Lines 35 to 37 in 5bd3e61
| sts_dataset_path = "datasets/stsbenchmark.tsv.gz" | |
| if not os.path.exists(sts_dataset_path): | |
| util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path) |
When we can follow the steps I already took in 548e463
to update these
sentence-transformers/examples/sentence_transformer/training/sts/training_stsbenchmark.py
Lines 40 to 43 in 5bd3e61
| # 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb | |
| train_dataset = load_dataset("sentence-transformers/stsb", split="train") | |
| eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") | |
| test_dataset = load_dataset("sentence-transformers/stsb", split="test") |
- Tom Aarsen