You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix heading levels
* remove $ on command examples
* fix markdown coding errors: indenting and spaces in emphasis
Signed-off-by: David B. Kinder <[email protected]>
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
@@ -104,6 +104,7 @@ python main.py \
104
104
--batch_size 10 \
105
105
--allow_code_execution
106
106
```
107
+
107
108
#### function call usage
108
109
```python
109
110
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
Copy file name to clipboardExpand all lines: evals/metrics/bleu/README.md
+4-24Lines changed: 4 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,5 @@
1
-
---
2
-
title: BLEU
3
-
emoji: 🤗
4
-
colorFrom: blue
5
-
colorTo: red
6
-
sdk: gradio
7
-
sdk_version: 3.19.1
8
-
app_file: app.py
9
-
pinned: false
10
-
tags:
11
-
- evaluate
12
-
- metric
13
-
description: >-
14
-
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
15
-
Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"
16
-
– this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
17
-
18
-
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
19
-
Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
20
-
Neither intelligibility nor grammatical correctness are not taken into account.
21
-
---
22
-
23
1
# Metric Card for BLEU
24
2
25
-
26
3
## Metric Description
27
4
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
28
5
@@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of
48
25
```
49
26
50
27
### Inputs
28
+
51
29
-**predictions** (`list` of `str`s): Translations to score.
52
30
-**references** (`list` of `list`s of `str`s): references for each translation.
53
-
-**tokenizer** : approach used for standardizing `predictions` and `references`.
31
+
-**tokenizer** : approach used for standardizing `predictions` and `references`.
54
32
The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
55
33
This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).
56
34
57
35
The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index).
36
+
58
37
-**max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`.
59
38
-**smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`.
60
39
61
40
### Output Values
41
+
62
42
-**bleu** (`float`): bleu score
63
43
-**precisions** (`list` of `float`s): geometric mean of n-gram precisions,
0 commit comments