Skip to content

[RAG Pipeline] Build UDB Parameter Knowledge Base #1769

@ankit-cybertron

Description

@ankit-cybertron

Objective

Parse all 185 existing UDB parameter YAML files and load them into a local searchable database (ChromaDB). This database lets us dynamically find the most relevant existing parameters when building LLM prompts later — instead of hardcoding a few static examples.

Data Sources

  • spec/std/isa/param/*.yaml — 207 files total, 185 real parameters (22 are test mocks)
  • spec/std/isa/csr/*.yaml — 81 CSR definition files that reference parameters in their IDL code
  • spec/schemas/param_schema.json & schema_defs.json — the official schemas

Steps

1. Parse All Parameter YAML Files

Read each parameter file and extract: name, description, long_name, schema (value type), definedBy (which extension), and requirements (IDL constraints). Categorize them by value type:

Schema Pattern Type Examples
type: boolean boolean MISALIGNED_LDST, TRAP_ON_ILLEGAL_WLRL
type: integer, enum: [32, 64] integer enum MXLEN, SXLEN
type: string, enum: [...] string enum HW_MSTATUS_FS_DIRTY_UPDATE, LRSC_RESERVATION_STRATEGY
type: integer, minimum/maximum integer range NUM_PMP_ENTRIES (0–64), PHYS_ADDR_WIDTH
type: array, items: {enum} array/set VSXLEN, MTVEC_MODES
oneOf with when conditions conditional MTVAL_WIDTH, MTVEC_BASE_ALIGNMENT_DIRECT

2. Extract CSR Cross-References

Scan all 81 CSR files and find which CSR fields reference which parameters in their IDL code. For example, mtvec.yaml references MTVEC_MODES inside MODE.sw_write(). Store these cross-references alongside the parameter data — this helps the search engine link related concepts.

3. Build the Vector Database

Convert each parameter into a plain-English summary (combining its name, description, type, extension, and CSR references). Load these into ChromaDB using all-MiniLM-L6-v2 sentence embeddings. Store extension info as filterable metadata.

4. Test Retrieval Accuracy

Write test queries to verify the database returns relevant results:

  • "legal values of mtvec mode field" → should return MTVEC_MODES
  • "misaligned memory access support" → should return MISALIGNED_LDST
  • "width of trap value register" → should return MTVAL_WIDTH

Deliverables

File Description
tools/llm-extraction/build_vector_db.py Script to parse params and build the database
tools/llm-extraction/chroma_db/ The local vector database
tools/llm-extraction/param_corpus.json All parsed parameter data in JSON
tools/llm-extraction/test_retrieval.py Retrieval accuracy tests

Acceptance Criteria

  • All 185 non-mock parameters parsed without errors
  • CSR cross-references extracted for parameters used in IDL code
  • Test queries return relevant results with >90% accuracy
  • Runs entirely offline — no cloud API calls needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions