fix(message-extractor): always convert text from html for previews by frabera · Pull Request #10919 · thunderbird/thunderbird-android

frabera · 2026-04-20T21:56:51Z

Resolves #10256
Resolves #8471

This solves html characters or code in notifications (#10256) and message previews (#8471)

Previously it only converted from html after checking the mimetype convertFromHtmlIfNecessary() but if the message contained html code or special characters in the plain text part they were rendered in the message previews and in notifications, polluting content.

I kept the function, let me know if there is a cleaner way to accomplish this. I tested it with messages containing special characters and also with html code, they are not shown.

frabera · 2026-04-21T11:24:34Z

I modified the logic by moving the parsing and cleaning parts in stripTextForPreview() because while it worked, the previous solution didn't parse the quotation header from forwarded emails and replies (it showed -----Original Message----- in the preview). This is caused somehow by parsing before extractUnquotedText().

In this way if a message contains only an html part, it is parsed twice. To avoid this a deeper modification is required, I'll wait for opinions from reviewers.

Here's an example returning a boolean from convertFromHtmlIfNecessary() if a parsing will be required:

 fun extractPreview(textPart: Part): String {
        val text = MessageExtractor.getTextFromPart(textPart, MAX_CHARACTERS_CHECKED_FOR_PREVIEW)
            ?: throw PreviewExtractionException("Couldn't get text from part")

        val (plainText, requiresParsingHtml) = convertFromHtmlIfNecessary(textPart, text)
        return stripTextForPreview(plainText, requiresParsingHtml)
    }

    private fun convertFromHtmlIfNecessary(textPart: Part, text: String): Pair<String, Boolean> {
        return if (isSameMimeType(textPart.mimeType, "text/html")) {
            HtmlConverter.htmlToText(text) to false
        } else {
            text to true
        }
    }

    private fun stripTextForPreview(text: String, parseHtml: Boolean): String {
        var intermediateText = text

        intermediateText = normalizeLineBreaks(intermediateText)
        intermediateText = stripSignature(intermediateText)
        intermediateText = extractUnquotedText(intermediateText)

        // try to remove lines of dashes in the preview
        intermediateText = intermediateText.replace("(?m)^----.*?$".toRegex(), "")
        // Remove horizontal rules.
        intermediateText = intermediateText.replace("\\s*([-=_]{30,}+)\\s*".toRegex(), " ")

        // If the textPart was plaintext, parse as HTML
        if (parseHtml) {
            intermediateText = HtmlConverter.htmlToText(intermediateText)
        }

        // Remove parsed HTML links/images "<url>"
        intermediateText = intermediateText.replace("<https?://\\S+>".toRegex(), " ")

        // URLs in the preview should just be shown as "..." - They're not
        // clickable and they usually overwhelm the preview
        intermediateText = intermediateText.replace("https?://\\S+".toRegex(), "...")
        // Don't show newlines in the preview
        intermediateText = intermediateText.replace('\n', ' ')
        // Collapse whitespace in the preview
        intermediateText = intermediateText.replace("\\s+".toRegex(), " ")
        // Remove any whitespace at the beginning and end of the string.
        intermediateText = intermediateText.trim()

        return if (intermediateText.length > MAX_PREVIEW_LENGTH) {
            intermediateText.substring(0, MAX_PREVIEW_LENGTH - 1) + "…"
        } else {
            intermediateText
        }
    }

wmontwe · 2026-04-29T08:58:33Z

@frabera Thanks for the fix. I think having the flag improves the handling. I would add some tests to check the behavior especially for forwarded emails, special characters and html in plain text.

This might help to check the logic works, maybe you could also add more variations, html mail with forwarded plain text or plain text with html forwarded email. Then test with special characters and html code in the plain text part.

    @Test
    fun extractPreview_forwardedMessage() {
        val text =
            """
            Here is the forwarded message:

            -----Original Message-----
            From: alice@example.com
            Sent: Monday, January 1, 2024 10:00 AM
            To: bob@example.com
            Subject: Hello

            This is the original content.
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/plain", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

    @Test
    fun extractPreview_htmlForwardedMessage() {
        val text =
            """
            <html>
            <body>
            Here is the forwarded message:<br>
            <br>
            -----Original Message-----<br>
            From: alice@example.com<br>
            Sent: Monday, January 1, 2024 10:00 AM<br>
            To: bob@example.com<br>
            Subject: Hello<br>
            <br>
            This is the original content.
            </body>
            </html>
            """.trimIndent()
        val part = MessageCreationHelper.createTextPart("text/html", text)

        val preview = previewTextExtractor.extractPreview(part)

        assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
    }

I'm open for suggestions how to mark the content of the forwarded message. Or even threat this as a separate issue.

When working on the test, keep the line breaks as is, otherwise the tests will fail. Ideally the test emails need to be loaded from a file, instead of being part of the code that reformats the message in an incompatible way. We won't have time to fix this now, so we're open for contributions.

…Extractor This solves html characters or code in notifications and message previews (thunderbird#10256)

…reviewTextExtractorTest` to protect newlines from autoformatting changes

…n preview extraction

wmontwe · 2026-06-10T09:25:53Z

@frabera Thanks for your contribution. I updated the PR to a mergable state and added tests.

wmontwe

LGTM

frabera · 2026-06-10T11:12:42Z

@wmontwe sorry for leaving it in the previous state, thank you for fixing it. Just a note, in the original PR I didn't add the flag to avoid the double HTML parsing (#10919 (comment)), do you want me to modify the returned value of convertFromHtmlIfNecessary to a tuple as in the previous comment?

…hance text cleanup

wmontwe · 2026-06-10T11:36:32Z

@frabera I think that makes sense. I just altered the structure a bit.

Use the flag to avoid double HTML conversion, but keep cleanup after parsing for text/plain that contains HTML:

if (parsePlainTextAsHtml) {
    intermediateText = HtmlConverter.htmlToText(intermediateText)
    intermediateText = stripLineBasedArtifacts(intermediateText)
}

You might need to change the test then.

rafaeltonholo · 2026-06-10T13:00:33Z

@wmontwe and @frabera, please ping me once this PR is ready for testing. I was reviewing/testing it, but apparently, you have more changes to do.

frabera · 2026-06-10T13:08:14Z

@wmontwe @rafaeltonholo I added the flag to avoid double parsing and all the tests are passing. I think the PR is ready for review. On a side note, if this is the new behaviour maybe it's just easier to always parse the html part if present? I didn't look so deep in the code but I think that it tries to parse the plain text part and it falls back to html (as it would be the most logical thing to do in an ideal world where plain text parts are well-formed).

I tried looking at my emails with the new preview and there is some small inconsistency with spaces in emails where the html part is well formed but the plain text use normal spaces, maybe in the future will be more convenient to prefer the html now that the parsing is always performed regardless of body type?

wmontwe · 2026-06-10T13:19:44Z

@frabera I think the parsing needs to be reworked. It has gaps and also doesn't remove Markdown formattings.

This could be a dedicated effort. But for a patch this is already good.

frabera requested a review from a team as a code owner April 20, 2026 21:56

frabera requested a review from rafaeltonholo April 20, 2026 21:56

frabera temporarily deployed to review April 20, 2026 21:56 — with GitHub Actions Inactive

frabera temporarily deployed to botmobile April 20, 2026 21:56 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

thunderbird-botmobile Bot assigned rafaeltonholo Apr 20, 2026

frabera mentioned this pull request Apr 20, 2026

Message preview in notification is buggy #10256

Open

2 tasks

frabera marked this pull request as draft April 20, 2026 22:17

This comment was marked as resolved.

Sign in to view

frabera marked this pull request as ready for review April 20, 2026 23:53

This comment was marked as resolved.

Sign in to view

frabera mentioned this pull request Apr 21, 2026

Preview (encoding?) sometimes wrong #8471

Open

2 tasks

rafaeltonholo added the report: include Include changes in user-facing reports. label Apr 21, 2026

frabera changed the title ~~Always convert text to html in MessagePreviewExtractor~~ Always convert text from html in MessagePreviewExtractor Apr 21, 2026

wmontwe requested review from wmontwe and removed request for rafaeltonholo April 24, 2026 10:08

wmontwe temporarily deployed to botmobile April 24, 2026 10:08 — with GitHub Actions Inactive

wmontwe assigned wmontwe and unassigned rafaeltonholo Apr 24, 2026

frabera and others added 2 commits June 10, 2026 11:14

fix(message-extractor): always convert text to html in MessagePreview…

c6d6cc4

…Extractor This solves html characters or code in notifications and message previews (thunderbird#10256)

test(message-extractor): replace trimIndent with trimMargin in `P…

0094460

…reviewTextExtractorTest` to protect newlines from autoformatting changes

wmontwe added 2 commits June 10, 2026 11:24

test(message-extractor): add tests for HTML parsing and URL removal i…

05684ae

…n preview extraction

fix(message-extractor): strip zero-width characters from preview text

60ebe8d

wmontwe force-pushed the patch-1 branch from 1c177c6 to 5d74694 Compare June 10, 2026 09:25

wmontwe requested review from rafaeltonholo and removed request for wmontwe June 10, 2026 09:26

wmontwe temporarily deployed to botmobile June 10, 2026 09:26 — with GitHub Actions Inactive

wmontwe assigned rafaeltonholo and unassigned wmontwe Jun 10, 2026

wmontwe previously approved these changes Jun 10, 2026

View reviewed changes

wmontwe changed the title ~~Always convert text from html in MessagePreviewExtractor~~ fix(message-extractor): always convert text from html for previews Jun 10, 2026

fix(message-extractor): improve handling of forwarded messages and en…

87bed2e

…hance text cleanup

wmontwe dismissed their stale review via 87bed2e June 10, 2026 11:17

wmontwe force-pushed the patch-1 branch from 5d74694 to 87bed2e Compare June 10, 2026 11:17

fix(message-extractor): avoid double parsing of html body

77375e0

Uh oh!

Conversation

frabera commented Apr 20, 2026 • edited by wmontwe Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

frabera commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wmontwe commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wmontwe commented Jun 10, 2026

Uh oh!

wmontwe left a comment

Choose a reason for hiding this comment

Uh oh!

frabera commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wmontwe commented Jun 10, 2026

Uh oh!

rafaeltonholo commented Jun 10, 2026

Uh oh!

frabera commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wmontwe commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frabera commented Apr 20, 2026 •

edited by wmontwe

Loading

frabera commented Apr 21, 2026 •

edited

Loading

wmontwe commented Apr 29, 2026 •

edited

Loading

frabera commented Jun 10, 2026 •

edited

Loading

frabera commented Jun 10, 2026 •

edited

Loading