Skip to content

Fix regex in mdextractor to handle nested code blocks #4

@chigwell

Description

@chigwell

User Story
As a developer using mdextractor,
I want code blocks containing inner backticks to be extracted as single units
so that nested or complex markdown structures are parsed accurately.

Background
The current regex pattern in mdextractor/__init__.py uses non-greedy matching (.*?), causing unintended splits when code blocks contain ```` characters. This fails the test_nested_code_blocks unit test, which expects `["Outer inner end"]` but currently returns `["Outer", "end"]`. The issue stems from the pattern prematurely closing at the first encountered closing backticks rather than matching the outermost pair.

Acceptance Criteria

  • Modify mdextractor/__init__.py to use a regex pattern that greedily matches entire fenced blocks, ignoring inner backticks.
  • Update test_nested_code_blocks in tests/test_mdextractor.py to validate blocks like:
    Outer ```inner``` end
    
    are extracted as a single string ["Outer ```inner``` end"].
  • Ensure existing tests (e.g., test_multiple_blocks, test_with_language_specifier) still pass after the regex update.
  • Validate edge cases: consecutive backticks outside code fences, malformed blocks, and mixed inline/block syntax.
  • Confirm extraction works for multi-line blocks with varying indentation and whitespace.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions