|
1 | 1 | # Bittensor Subnet 5: Text Embedding Model |
2 | 2 |
|
| 3 | +> This is a draft proposal for next version of Bittensor Subnet 5: Text Embedding Model. |
| 4 | +
|
3 | 5 | ## Abstract |
4 | 6 |
|
5 | | -Bittensor Subnet 5's primary focus is the advancement of text embedding models through collaborative efforts among miners. |
| 7 | +Bittensor Subnet 5's primary focus is the development of the world’s best performing and most generalizable text embedding model. |
6 | 8 |
|
7 | | -Leveraging an extensive Large Language Model (LLM)-augmented corpus for evaluation, miners are empowered to develop and deploy text-embedding models that surpass current state-of-the-art (SOTA) performance. |
| 9 | +Leveraging an extensive Large Language Model (LLM)-augmented corpus for evaluation, miners are empowered to develop and deploy text-embedding models that surpass current state-of-the-art (SOTA) performance. |
8 | 10 |
|
9 | 11 | These models will be accessible to users via the subnet's API. |
10 | 12 |
|
11 | 13 | ## Objectives & Contributions |
12 | 14 |
|
13 | | -The primary objective of Subnet 5 is to train and serve the best and most robust generic text-embedding models. Such text-embedding models can empower plenty of downstream applications such as semantic search, natural language understanding, and so on. |
| 15 | +The primary objective of Subnet 5 is to train and serve the best and most generalizable text-embedding models. Such text-embedding models can empower plenty of downstream applications such as semantic search, natural language understanding, and so on. |
| 16 | + |
14 | 17 |
|
15 | 18 | Miners will be responsible for training models using an extensive corpus of textual data and serving the model in a low-latency and high-throughput way. These models will be utilized to generate high-quality embeddings for diverse text inputs. |
16 | 19 |
|
| 20 | + |
17 | 21 | Validators will conduct rigorous evaluations of the models using multiple benchmarks. Performance comparisons will be made against existing SOTA text embedding models to ensure continuous improvement and competitiveness. |
18 | 22 |
|
19 | | -Subnet users will gain access to cutting-edge text embedding models that exceed SOTA performance. These models will be made publicly available through the validator API of Bittensor Subnet 5, facilitating widespread adoption and integration into various applications. |
| 23 | + |
| 24 | +Subnet users will gain access to cutting-edge text embedding models that are most generic and exceed SOTA performance. These models will be made publicly available through the validator API of Bittensor Subnet 5, facilitating widespread adoption and integration into various applications. |
| 25 | + |
20 | 26 |
|
21 | 27 | ## Incentive Mechanism |
22 | 28 |
|
23 | 29 | Miners will receive a batch of texts and embed them. |
24 | 30 |
|
25 | 31 | For the text embeddings, validators have the pairwise relevance information to evaluate them via the contrastive learning loss: |
26 | 32 |
|
27 | | -$$ |
28 | | -\mathcal{L}_\text{InfoNCE} = - \mathbb{E} \Big[\log \frac{f(\mathbf{x}, \mathbf{c})}{\sum_{\mathbf{x}' \in X} f(\mathbf{x}', \mathbf{c})} \Big] |
29 | | -$$, |
| 33 | +```math |
| 34 | +\mathcal{L}_\text{InfoNCE} = - \mathbb{E} \left[\log \frac{f(\mathbf{x}, \mathbf{c})}{\sum_{\mathbf{x}' \in X} f(\mathbf{x}', \mathbf{c})} \right] |
| 35 | +``` |
30 | 36 |
|
31 | 37 | where $f(x,c) = \exp{(x \cdot c)}$ is an estimate of $\frac{p(x | c)}{p(x)}$, and $c$ is the target embedding, and $x$ is the positive sample, and $x'$ are negative samples. |
32 | 38 |
|
|
0 commit comments