User Story
As a developer using mdextractor, I want code block extraction to preserve original leading/trailing whitespace (except newlines) so that valid code structures with intentional indentation or alignment aren’t altered.
Background
The current implementation of extract_md_blocks in mdextractor/__init__.py uses block.strip(), which removes all leading/trailing whitespace, including spaces and tabs. This is problematic for code blocks where whitespace is syntactically significant (e.g., Python indentation, YAML formatting). For example, a code block starting with def example(): loses its leading spaces, rendering it invalid. The regex pattern r"```(?:\w+\s+)?(.*?)```" already captures the content correctly, but the aggressive stripping erases meaningful data.
Acceptance Criteria
User Story
As a developer using mdextractor, I want code block extraction to preserve original leading/trailing whitespace (except newlines) so that valid code structures with intentional indentation or alignment aren’t altered.
Background
The current implementation of
extract_md_blocksinmdextractor/__init__.pyusesblock.strip(), which removes all leading/trailing whitespace, including spaces and tabs. This is problematic for code blocks where whitespace is syntactically significant (e.g., Python indentation, YAML formatting). For example, a code block starting withdef example():loses its leading spaces, rendering it invalid. The regex patternr"```(?:\w+\s+)?(.*?)```"already captures the content correctly, but the aggressive stripping erases meaningful data.Acceptance Criteria
extract_md_blocksinmdextractor/__init__.pyto useblock.strip('\n')instead ofblock.strip().tests/test_mdextractor.pyverifying:" code\n "becomes" code")."\ncode\n"becomes"code")." \n "→" ").