|
3 | 3 | ### MCP Server 🖥️
|
4 | 4 | - ✅ Python stdio server support
|
5 | 5 | - ✅ node.js stdio server support
|
| 6 | +- ✅ http mcp server support |
| 7 | +- 🔲 connecting mcp servers with json file as a standard way |
6 | 8 |
|
7 | 9 | ### MCP Client 🤖
|
8 | 10 | - ✅ Stdio client implementation
|
|
19 | 21 | ### Evaluation 📊
|
20 | 22 | - ✅ Implement core evaluation metrics (accuracy, latency)
|
21 | 23 | - ✅ Create automated testing framework
|
| 24 | +- 🔲 Automatic Deep Evaluating |
| 25 | +- 🔲 Evaluating the implementation of MCP server |
22 | 26 |
|
23 | 27 | ### Data Pipeline 🔄
|
24 | 28 | - ✅ Design unified data schema for all benchmarks
|
25 | 29 | - ✅ Implement data preprocessing tools
|
26 | 30 | - ✅ Add support for multiple data formats
|
27 | 31 |
|
28 |
| -### Benchmarks 🧪 |
29 |
| -- ✅ Airbnb MCP benchmark |
30 |
| -- ✅ Healthcare MCP benchmark |
31 |
| -- ✅ yahoo finance MCP benchmark |
32 |
| -- ✅ Sports benchmark |
33 |
| -- ✅ travel_assistant benchmark |
34 |
| -- ✅ File System benchmark |
35 |
| - |
36 | 32 | ### LLM Provider 🧠
|
37 | 33 | - ✅ OpenAI API integration (used for data generation and testing)
|
38 | 34 | - ✅ local vllm-based model
|
|
43 | 39 | - ✅ Data converter
|
44 | 40 | - ✅ Model evaluator
|
45 | 41 | - ✅ Report generator
|
46 |
| -- ✅ Auto end-to-end evaluation |
| 42 | +- ✅ Auto end-to-end evaluation |
| 43 | + |
| 44 | +### Front-end 🎨 |
| 45 | +- ✅ React application setup with TypeScript |
| 46 | +- ✅ Core navigation and routing |
| 47 | +- ✅ MCP server configuration interface |
| 48 | +- ✅ Chat client for MCP interactions |
| 49 | +- ✅ Task generation and verification UI |
| 50 | +- ✅ Model evaluation dashboard |
| 51 | +- ✅ Results and analytics pages |
| 52 | +- ✅ Data management interfaces |
| 53 | +- 🔲 Unifying the model config for all the pages and sharing the same component |
| 54 | +- 🔲 Saving any existing model config as a config file and support load it again |
| 55 | + |
| 56 | +## Issues |
| 57 | +- Evluating multiple models does not working |
| 58 | +- Analyze feature does not support not generating AI report |
| 59 | +- Judge Rubrics select not generate report |
0 commit comments