Skip to content

Commit 65a0a5b

Browse files
authored
doc: fix headings and indents (#109)
* fix heading levels * remove $ on command examples * fix markdown coding errors: indenting and spaces in emphasis Signed-off-by: David B. Kinder <[email protected]>
1 parent 626d269 commit 65a0a5b

File tree

3 files changed

+27
-46
lines changed

3 files changed

+27
-46
lines changed

README.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -69,27 +69,27 @@ results = evaluate(args)
6969

7070
1. setup a separate server with [GenAIComps](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/lm-eval)
7171

72-
```
73-
# build cpu docker
74-
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
72+
```
73+
# build cpu docker
74+
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
7575
76-
# start the server
77-
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
78-
```
76+
# start the server
77+
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
78+
```
7979

8080
2. evaluate the model
8181

82-
- set `base_url`, `tokenizer` and `--model genai-hf`
82+
- set `base_url`, `tokenizer` and `--model genai-hf`
8383

84-
```
85-
cd evals/evaluation/lm_evaluation_harness/examples
84+
```
85+
cd evals/evaluation/lm_evaluation_harness/examples
8686
87-
python main.py \
88-
--model genai-hf \
89-
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
90-
--tasks "lambada_openai" \
91-
--batch_size 2
92-
```
87+
python main.py \
88+
--model genai-hf \
89+
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
90+
--tasks "lambada_openai" \
91+
--batch_size 2
92+
```
9393
9494
### bigcode-evaluation-harness
9595
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
@@ -104,6 +104,7 @@ python main.py \
104104
--batch_size 10 \
105105
--allow_code_execution
106106
```
107+
107108
#### function call usage
108109
```python
109110
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate

evals/benchmark/stresscli/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ pip install -r requirements.txt
3535
### Usage
3636

3737
```
38-
$ ./stresscli.py --help
38+
./stresscli.py --help
3939
Usage: stresscli.py [OPTIONS] COMMAND [ARGS]...
4040
4141
StressCLI - A command line tool for stress testing OPEA workloads.
@@ -60,7 +60,7 @@ Commands:
6060

6161
More detail options:
6262
```
63-
$ ./stresscli.py load-test --help
63+
./stresscli.py load-test --help
6464
Usage: stresscli.py load-test [OPTIONS]
6565
6666
Do load test
@@ -74,12 +74,12 @@ Options:
7474

7575
You can generate the report for test cases by:
7676
```
77-
$ ./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
77+
./stresscli.py report --folder /home/sdp/test_reports/20240710_004105 --format csv -o data.csv
7878
```
7979

8080
More detail options:
8181
```
82-
$ ./stresscli.py report --help
82+
./stresscli.py report --help
8383
Usage: stresscli.py report [OPTIONS]
8484
8585
Print the test report
@@ -101,7 +101,7 @@ You can dump the current testing profile by
101101
```
102102
More detail options:
103103
```
104-
$ ./stresscli.py dump --help
104+
./stresscli.py dump --help
105105
Usage: stresscli.py dump [OPTIONS]
106106
107107
Dump the test spec
@@ -115,12 +115,12 @@ Options:
115115

116116
You can validate if the current K8s and workloads deployment comply with the test spec by:
117117
```
118-
$ ./stresscli.py validate --file testspec.yaml
118+
./stresscli.py validate --file testspec.yaml
119119
```
120120

121121
More detail options:
122122
```
123-
$ ./stresscli.py validate --help
123+
./stresscli.py validate --help
124124
Usage: stresscli.py validate [OPTIONS]
125125
126126
Validate against the test spec

evals/metrics/bleu/README.md

Lines changed: 4 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,5 @@
1-
---
2-
title: BLEU
3-
emoji: 🤗
4-
colorFrom: blue
5-
colorTo: red
6-
sdk: gradio
7-
sdk_version: 3.19.1
8-
app_file: app.py
9-
pinned: false
10-
tags:
11-
- evaluate
12-
- metric
13-
description: >-
14-
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
15-
Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is"
16-
– this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
17-
18-
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations.
19-
Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.
20-
Neither intelligibility nor grammatical correctness are not taken into account.
21-
---
22-
231
# Metric Card for BLEU
242

25-
263
## Metric Description
274
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
285

@@ -48,17 +25,20 @@ This metric takes as input a list of predicted sentences and a list of lists of
4825
```
4926

5027
### Inputs
28+
5129
- **predictions** (`list` of `str`s): Translations to score.
5230
- **references** (`list` of `list`s of `str`s): references for each translation.
53-
- ** tokenizer** : approach used for standardizing `predictions` and `references`.
31+
- **tokenizer** : approach used for standardizing `predictions` and `references`.
5432
The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
5533
This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).
5634

5735
The default tokenizer is based on whitespace and regexes. It can be replaced by any function that takes a string as input and returns a list of tokens as output. E.g. `word_tokenize()` from [NLTK](https://www.nltk.org/api/nltk.tokenize.html) or pretrained tokenizers from the [Tokenizers library](https://huggingface.co/docs/tokenizers/index).
36+
5837
- **max_order** (`int`): Maximum n-gram order to use when computing BLEU score. Defaults to `4`.
5938
- **smooth** (`boolean`): Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False`.
6039

6140
### Output Values
41+
6242
- **bleu** (`float`): bleu score
6343
- **precisions** (`list` of `float`s): geometric mean of n-gram precisions,
6444
- **brevity_penalty** (`float`): brevity penalty,

0 commit comments

Comments
 (0)