Skip to content

fix(message-extractor): always convert text from html for previews#10919

Open
frabera wants to merge 6 commits into
thunderbird:mainfrom
frabera:patch-1
Open

fix(message-extractor): always convert text from html for previews#10919
frabera wants to merge 6 commits into
thunderbird:mainfrom
frabera:patch-1

Conversation

@frabera

@frabera frabera commented Apr 20, 2026

Copy link
Copy Markdown

Resolves #10256
Resolves #8471

This solves html characters or code in notifications (#10256) and message previews (#8471)

Previously it only converted from html after checking the mimetype convertFromHtmlIfNecessary() but if the message contained html code or special characters in the plain text part they were rendered in the message previews and in notifications, polluting content.

I kept the function, let me know if there is a cleaner way to accomplish this. I tested it with messages containing special characters and also with html code, they are not shown.

@github-actions

This comment was marked as resolved.

@frabera frabera marked this pull request as draft April 20, 2026 22:17
@frabera frabera marked this pull request as draft April 20, 2026 22:17
@github-actions

This comment was marked as resolved.

2 similar comments
@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@frabera frabera marked this pull request as ready for review April 20, 2026 23:53
@github-actions

This comment was marked as resolved.

3 similar comments
@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@frabera

frabera commented Apr 21, 2026

Copy link
Copy Markdown
Author

I modified the logic by moving the parsing and cleaning parts in stripTextForPreview() because while it worked, the previous solution didn't parse the quotation header from forwarded emails and replies (it showed -----Original Message----- in the preview). This is caused somehow by parsing before extractUnquotedText().

In this way if a message contains only an html part, it is parsed twice. To avoid this a deeper modification is required, I'll wait for opinions from reviewers.

Here's an example returning a boolean from convertFromHtmlIfNecessary() if a parsing will be required:

 fun extractPreview(textPart: Part): String {
        val text = MessageExtractor.getTextFromPart(textPart, MAX_CHARACTERS_CHECKED_FOR_PREVIEW)
            ?: throw PreviewExtractionException("Couldn't get text from part")

        val (plainText, requiresParsingHtml) = convertFromHtmlIfNecessary(textPart, text)
        return stripTextForPreview(plainText, requiresParsingHtml)
    }

    private fun convertFromHtmlIfNecessary(textPart: Part, text: String): Pair<String, Boolean> {
        return if (isSameMimeType(textPart.mimeType, "text/html")) {
            HtmlConverter.htmlToText(text) to false
        } else {
            text to true
        }
    }

    private fun stripTextForPreview(text: String, parseHtml: Boolean): String {
        var intermediateText = text

        intermediateText = normalizeLineBreaks(intermediateText)
        intermediateText = stripSignature(intermediateText)
        intermediateText = extractUnquotedText(intermediateText)

        // try to remove lines of dashes in the preview
        intermediateText = intermediateText.replace("(?m)^----.*?$".toRegex(), "")
        // Remove horizontal rules.
        intermediateText = intermediateText.replace("\\s*([-=_]{30,}+)\\s*".toRegex(), " ")

        // If the textPart was plaintext, parse as HTML
        if (parseHtml) {
            intermediateText = HtmlConverter.htmlToText(intermediateText)
        }

        // Remove parsed HTML links/images "<url>"
        intermediateText = intermediateText.replace("<https?://\\S+>".toRegex(), " ")

        // URLs in the preview should just be shown as "..." - They're not
        // clickable and they usually overwhelm the preview
        intermediateText = intermediateText.replace("https?://\\S+".toRegex(), "...")
        // Don't show newlines in the preview
        intermediateText = intermediateText.replace('\n', ' ')
        // Collapse whitespace in the preview
        intermediateText = intermediateText.replace("\\s+".toRegex(), " ")
        // Remove any whitespace at the beginning and end of the string.
        intermediateText = intermediateText.trim()

        return if (intermediateText.length > MAX_PREVIEW_LENGTH) {
            intermediateText.substring(0, MAX_PREVIEW_LENGTH - 1) + ""
        } else {
            intermediateText
        }
    }

@rafaeltonholo rafaeltonholo added the report: include Include changes in user-facing reports. label Apr 21, 2026
@frabera frabera changed the title Always convert text to html in MessagePreviewExtractor Always convert text from html in MessagePreviewExtractor Apr 21, 2026
@wmontwe wmontwe requested review from wmontwe and removed request for rafaeltonholo April 24, 2026 10:08
@wmontwe wmontwe assigned wmontwe and unassigned rafaeltonholo Apr 24, 2026
@wmontwe

wmontwe commented Apr 29, 2026

Copy link
Copy Markdown
Member

@frabera Thanks for the fix. I think having the flag improves the handling. I would add some tests to check the behavior especially for forwarded emails, special characters and html in plain text.

This might help to check the logic works, maybe you could also add more variations, html mail with forwarded plain text or plain text with html forwarded email. Then test with special characters and html code in the plain text part.

    @Test
    fun extractPreview_forwardedMessage() {
        val text =
            """
            Here is the forwarded message:

            -----Original Message-----
            From: alice@example.com
            Sent: Monday, January 1, 2024 10:00 AM
            To: bob@example.com
            Subject: Hello

            This is the original content.
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/plain", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

    @Test
    fun extractPreview_htmlForwardedMessage() {
        val text =
            """
            <html>
            <body>
            Here is the forwarded message:<br>
            <br>
            -----Original Message-----<br>
            From: alice@example.com<br>
            Sent: Monday, January 1, 2024 10:00 AM<br>
            To: bob@example.com<br>
            Subject: Hello<br>
            <br>
            This is the original content.
            </body>
            </html>
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/html", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

I'm open for suggestions how to mark the content of the forwarded message. Or even threat this as a separate issue.

When working on the test, keep the line breaks as is, otherwise the tests will fail. Ideally the test emails need to be loaded from a file, instead of being part of the code that reformats the message in an incompatible way. We won't have time to fix this now, so we're open for contributions.

frabera and others added 2 commits June 10, 2026 11:14
…Extractor

This solves html characters or code in notifications and message previews (thunderbird#10256)
…reviewTextExtractorTest` to protect newlines from autoformatting changes
@wmontwe

wmontwe commented Jun 10, 2026

Copy link
Copy Markdown
Member

@frabera Thanks for your contribution. I updated the PR to a mergable state and added tests.

@wmontwe wmontwe requested review from rafaeltonholo and removed request for wmontwe June 10, 2026 09:26
@wmontwe wmontwe assigned rafaeltonholo and unassigned wmontwe Jun 10, 2026
wmontwe
wmontwe previously approved these changes Jun 10, 2026

@wmontwe wmontwe left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wmontwe wmontwe changed the title Always convert text from html in MessagePreviewExtractor fix(message-extractor): always convert text from html for previews Jun 10, 2026
@frabera

frabera commented Jun 10, 2026

Copy link
Copy Markdown
Author

@wmontwe sorry for leaving it in the previous state, thank you for fixing it. Just a note, in the original PR I didn't add the flag to avoid the double HTML parsing (#10919 (comment)), do you want me to modify the returned value of convertFromHtmlIfNecessary to a tuple as in the previous comment?

@wmontwe

wmontwe commented Jun 10, 2026

Copy link
Copy Markdown
Member

@frabera I think that makes sense. I just altered the structure a bit.

Use the flag to avoid double HTML conversion, but keep cleanup after parsing for text/plain that contains HTML:

if (parsePlainTextAsHtml) {
    intermediateText = HtmlConverter.htmlToText(intermediateText)
    intermediateText = stripLineBasedArtifacts(intermediateText)
}

You might need to change the test then.

@rafaeltonholo

Copy link
Copy Markdown
Member

@wmontwe and @frabera, please ping me once this PR is ready for testing. I was reviewing/testing it, but apparently, you have more changes to do.

@frabera

frabera commented Jun 10, 2026

Copy link
Copy Markdown
Author

@wmontwe @rafaeltonholo I added the flag to avoid double parsing and all the tests are passing. I think the PR is ready for review. On a side note, if this is the new behaviour maybe it's just easier to always parse the html part if present? I didn't look so deep in the code but I think that it tries to parse the plain text part and it falls back to html (as it would be the most logical thing to do in an ideal world where plain text parts are well-formed).

I tried looking at my emails with the new preview and there is some small inconsistency with spaces in emails where the html part is well formed but the plain text use normal spaces, maybe in the future will be more convenient to prefer the html now that the parsing is always performed regardless of body type?

@wmontwe

wmontwe commented Jun 10, 2026

Copy link
Copy Markdown
Member

@frabera I think the parsing needs to be reworked. It has gaps and also doesn't remove Markdown formattings.

This could be a dedicated effort. But for a patch this is already good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

report: include Include changes in user-facing reports.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Message preview in notification is buggy Preview (encoding?) sometimes wrong

3 participants