Objective
Parse all 185 existing UDB parameter YAML files and load them into a local searchable database (ChromaDB). This database lets us dynamically find the most relevant existing parameters when building LLM prompts later — instead of hardcoding a few static examples.
Data Sources
spec/std/isa/param/*.yaml — 207 files total, 185 real parameters (22 are test mocks)
spec/std/isa/csr/*.yaml — 81 CSR definition files that reference parameters in their IDL code
spec/schemas/param_schema.json & schema_defs.json — the official schemas
Steps
1. Parse All Parameter YAML Files
Read each parameter file and extract: name, description, long_name, schema (value type), definedBy (which extension), and requirements (IDL constraints). Categorize them by value type:
| Schema Pattern |
Type |
Examples |
type: boolean |
boolean |
MISALIGNED_LDST, TRAP_ON_ILLEGAL_WLRL |
type: integer, enum: [32, 64] |
integer enum |
MXLEN, SXLEN |
type: string, enum: [...] |
string enum |
HW_MSTATUS_FS_DIRTY_UPDATE, LRSC_RESERVATION_STRATEGY |
type: integer, minimum/maximum |
integer range |
NUM_PMP_ENTRIES (0–64), PHYS_ADDR_WIDTH |
type: array, items: {enum} |
array/set |
VSXLEN, MTVEC_MODES |
oneOf with when conditions |
conditional |
MTVAL_WIDTH, MTVEC_BASE_ALIGNMENT_DIRECT |
2. Extract CSR Cross-References
Scan all 81 CSR files and find which CSR fields reference which parameters in their IDL code. For example, mtvec.yaml references MTVEC_MODES inside MODE.sw_write(). Store these cross-references alongside the parameter data — this helps the search engine link related concepts.
3. Build the Vector Database
Convert each parameter into a plain-English summary (combining its name, description, type, extension, and CSR references). Load these into ChromaDB using all-MiniLM-L6-v2 sentence embeddings. Store extension info as filterable metadata.
4. Test Retrieval Accuracy
Write test queries to verify the database returns relevant results:
"legal values of mtvec mode field" → should return MTVEC_MODES
"misaligned memory access support" → should return MISALIGNED_LDST
"width of trap value register" → should return MTVAL_WIDTH
Deliverables
| File |
Description |
tools/llm-extraction/build_vector_db.py |
Script to parse params and build the database |
tools/llm-extraction/chroma_db/ |
The local vector database |
tools/llm-extraction/param_corpus.json |
All parsed parameter data in JSON |
tools/llm-extraction/test_retrieval.py |
Retrieval accuracy tests |
Acceptance Criteria
Objective
Parse all 185 existing UDB parameter YAML files and load them into a local searchable database (ChromaDB). This database lets us dynamically find the most relevant existing parameters when building LLM prompts later — instead of hardcoding a few static examples.
Data Sources
spec/std/isa/param/*.yaml— 207 files total, 185 real parameters (22 are test mocks)spec/std/isa/csr/*.yaml— 81 CSR definition files that reference parameters in their IDL codespec/schemas/param_schema.json&schema_defs.json— the official schemasSteps
1. Parse All Parameter YAML Files
Read each parameter file and extract:
name,description,long_name,schema(value type),definedBy(which extension), andrequirements(IDL constraints). Categorize them by value type:type: booleanMISALIGNED_LDST,TRAP_ON_ILLEGAL_WLRLtype: integer, enum: [32, 64]MXLEN,SXLENtype: string, enum: [...]HW_MSTATUS_FS_DIRTY_UPDATE,LRSC_RESERVATION_STRATEGYtype: integer, minimum/maximumNUM_PMP_ENTRIES(0–64),PHYS_ADDR_WIDTHtype: array, items: {enum}VSXLEN,MTVEC_MODESoneOfwithwhenconditionsMTVAL_WIDTH,MTVEC_BASE_ALIGNMENT_DIRECT2. Extract CSR Cross-References
Scan all 81 CSR files and find which CSR fields reference which parameters in their IDL code. For example,
mtvec.yamlreferencesMTVEC_MODESinsideMODE.sw_write(). Store these cross-references alongside the parameter data — this helps the search engine link related concepts.3. Build the Vector Database
Convert each parameter into a plain-English summary (combining its name, description, type, extension, and CSR references). Load these into ChromaDB using
all-MiniLM-L6-v2sentence embeddings. Store extension info as filterable metadata.4. Test Retrieval Accuracy
Write test queries to verify the database returns relevant results:
"legal values of mtvec mode field"→ should returnMTVEC_MODES"misaligned memory access support"→ should returnMISALIGNED_LDST"width of trap value register"→ should returnMTVAL_WIDTHDeliverables
tools/llm-extraction/build_vector_db.pytools/llm-extraction/chroma_db/tools/llm-extraction/param_corpus.jsontools/llm-extraction/test_retrieval.pyAcceptance Criteria