Skip to content

Commit 77bb66c

Browse files
authored
Add README for codegen acc test. (#110)
Signed-off-by: Yao, Qing <[email protected]>
1 parent 52a540d commit 77bb66c

File tree

2 files changed

+95
-0
lines changed

2 files changed

+95
-0
lines changed

evals/evaluation/bigcode_evaluation_harness/api_evaluator.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ def generate_text(self, task_name, intermediate_generations=None):
1616
dataset = task.get_dataset()
1717
# if args.limit is None, use all samples
1818
# if args.limit is used, make sure args.limit_start + args.limit <= len(dataset)
19+
20+
# TODO: Only support running the entire task in its entirety now,
21+
# parameters limit or limit_start will result in inaccurate results.
1922
n_tasks = min(self.args.limit, len(dataset) - self.args.limit_start) if self.args.limit else len(dataset)
2023
print(n_tasks)
2124
# when args.limit is None

examples/CodeGen/README.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# CodeGen accuracy Evaluation
2+
3+
## Evaluation Framework
4+
We evaluate accuracy by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness). It is a framework for the evaluation of code generation models.
5+
6+
7+
## Evaluation FAQs
8+
9+
### Launch CodeGen microservice
10+
Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples/tree/main/CodeGen), follow the guide to deploy CodeGen megeservice.
11+
12+
Use cURL command to test codegen service and ensure that it has started properly
13+
```bash
14+
export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
15+
curl $CODEGEN_ENDPOINT \
16+
-H "Content-Type: application/json" \
17+
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
18+
19+
```
20+
21+
22+
### Generation and Evaluation
23+
24+
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
25+
#### command line usage
26+
27+
```shell
28+
cd evals/evaluation/bigcode_evaluation_harness/examples
29+
python main.py --model Qwen/CodeQwen1.5-7B-Chat \
30+
--tasks humaneval \
31+
--codegen_url $CODEGEN_ENDPOINT \
32+
--max_length_generation 2048 \
33+
--batch_size 1 \
34+
--save_generations \
35+
--save_references \
36+
--allow_code_execution
37+
```
38+
39+
***Note:*** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
40+
41+
42+
### accuracy Result
43+
Here is the tested result for your reference
44+
```json
45+
{
46+
"humaneval": {
47+
"pass@1": 0.7195121951219512
48+
},
49+
"config": {
50+
"prefix": "",
51+
"do_sample": true,
52+
"temperature": 0.2,
53+
"top_k": 0,
54+
"top_p": 0.95,
55+
"n_samples": 1,
56+
"eos": "<|endoftext|>",
57+
"seed": 0,
58+
"model": "Qwen/CodeQwen1.5-7B-Chat",
59+
"modeltype": "causal",
60+
"peft_model": null,
61+
"revision": null,
62+
"use_auth_token": false,
63+
"trust_remote_code": false,
64+
"tasks": "humaneval",
65+
"instruction_tokens": null,
66+
"batch_size": 1,
67+
"max_length_generation": 2048,
68+
"precision": "fp32",
69+
"load_in_8bit": false,
70+
"load_in_4bit": false,
71+
"left_padding": false,
72+
"limit": null,
73+
"limit_start": 0,
74+
"save_every_k_tasks": -1,
75+
"postprocess": true,
76+
"allow_code_execution": true,
77+
"generation_only": false,
78+
"load_generations_path": null,
79+
"load_data_path": null,
80+
"metric_output_path": "evaluation_results.json",
81+
"save_generations": true,
82+
"load_generations_intermediate_paths": null,
83+
"save_generations_path": "generations.json",
84+
"save_references": true,
85+
"save_references_path": "references.json",
86+
"prompt": "prompt",
87+
"max_memory_per_gpu": null,
88+
"check_references": false,
89+
"codegen_url": "http://192.168.123.104:31234/v1/codegen"
90+
}
91+
}
92+
```

0 commit comments

Comments
 (0)