Skip to content

Protential ReDoS vulnerability in markdown2.py #633

@ShangzhiXu

Description

@ShangzhiXu

Describe the bug
Hi team, thanks for your great work! I think I found a small vulnerability that might lead to DDoS in the system
At line 1265 in markdown2.py
the regex _sorta_html_tokenize_re is vulnerable to ReDoS when it is used in
for token in self._sorta_html_tokenize_re.split(text):

To Reproduce
I have a test file here and can be run directly

import markdown2
import time

for repeat_count in range(0,5000,500):
    start_time = time.time()
    markdown_text = "<p m=\"1\"" * repeat_count + " "* repeat_count + " " * repeat_count + "</div"

    start_time = time.time()
    html_output = markdown2.markdown(markdown_text, extras=["markdown-in-html"])
    end_time = time.time()
    print(f"repeat_count/length: {repeat_count}/{len(markdown_text)} Time taken: {end_time - start_time:.6f} seconds")

The result is like this

repeat_count/length: 0/5 Time taken: 0.004103 seconds
repeat_count/length: 500/5005 Time taken: 3.075934 seconds
repeat_count/length: 1000/10005 Time taken: 23.623668 seconds
repeat_count/length: 1500/15005 Time taken: 78.709412 seconds
repeat_count/length: 2000/20005 Time taken: 186.460569 seconds
repeat_count/length: 2500/25005 Time taken: 361.830719 seconds
repeat_count/length: 3000/30005 Time taken: 625.041665 seconds
repeat_count/length: 3500/35005 Time taken: 991.112904 seconds

As we can see, the time consumption increase significantly, and only 30k chars can lead to the program hang for more than 16 mins.

Expected behavior
I think we can add a limit like replace \s* with \s{0,100} or maybe divide the regex into multiple sub-regex? Maybe it can help to solve the recursion problem.

For the regex _sorta_html_tokenize_re , the issue occurs due to the subregex

              (?:             # attributes
                    \s+                           # whitespace after tag
                    (?:[^\t<>"'=/]+:)?
                    [^<>"'=/]+=                   # attr name
                    (?:".*?"|'.*?'|[^<>"'=/\s]+)  # value, quoted or unquoted. If unquoted, no spaces allowed
                )*
                \s*/?>

The (...)* and \s* both try to eagerly match strings, leading to massive recusrion.

I tried to remove one of them from the regex, and the performance significantly improved

# remove \s*
repeat_count/length: 0/5 Time taken: 0.004091 seconds
repeat_count/length: 500/6005 Time taken: 0.080678 seconds
repeat_count/length: 1000/12005 Time taken: 0.326295 seconds
repeat_count/length: 1500/18005 Time taken: 0.712929 seconds
repeat_count/length: 2000/24005 Time taken: 1.258733 seconds
repeat_count/length: 2500/30005 Time taken: 2.067534 seconds
repeat_count/length: 3000/36005 Time taken: 2.911231 seconds
repeat_count/length: 3500/42005 Time taken: 3.897797 seconds
repeat_count/length: 4000/48005 Time taken: 5.017783 seconds
repeat_count/length: 4500/54005 Time taken: 6.552440 seconds
# remove `(...)*`
repeat_count/length: 0/5 Time taken: 0.004098 seconds
repeat_count/length: 500/6005 Time taken: 0.088631 seconds
repeat_count/length: 1000/12005 Time taken: 0.347793 seconds
repeat_count/length: 1500/18005 Time taken: 0.774725 seconds
repeat_count/length: 2000/24005 Time taken: 1.367911 seconds
repeat_count/length: 2500/30005 Time taken: 2.169432 seconds
repeat_count/length: 3000/36005 Time taken: 3.167773 seconds
repeat_count/length: 3500/42005 Time taken: 4.229510 seconds
repeat_count/length: 4000/48005 Time taken: 5.558357 seconds
repeat_count/length: 4500/54005 Time taken: 6.914844 seconds

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions