fix(message-extractor): always convert text from html for previews#10919
fix(message-extractor): always convert text from html for previews#10919frabera wants to merge 6 commits into
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
2 similar comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
3 similar comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
I modified the logic by moving the parsing and cleaning parts in In this way if a message contains only an html part, it is parsed twice. To avoid this a deeper modification is required, I'll wait for opinions from reviewers. Here's an example returning a boolean from fun extractPreview(textPart: Part): String {
val text = MessageExtractor.getTextFromPart(textPart, MAX_CHARACTERS_CHECKED_FOR_PREVIEW)
?: throw PreviewExtractionException("Couldn't get text from part")
val (plainText, requiresParsingHtml) = convertFromHtmlIfNecessary(textPart, text)
return stripTextForPreview(plainText, requiresParsingHtml)
}
private fun convertFromHtmlIfNecessary(textPart: Part, text: String): Pair<String, Boolean> {
return if (isSameMimeType(textPart.mimeType, "text/html")) {
HtmlConverter.htmlToText(text) to false
} else {
text to true
}
}
private fun stripTextForPreview(text: String, parseHtml: Boolean): String {
var intermediateText = text
intermediateText = normalizeLineBreaks(intermediateText)
intermediateText = stripSignature(intermediateText)
intermediateText = extractUnquotedText(intermediateText)
// try to remove lines of dashes in the preview
intermediateText = intermediateText.replace("(?m)^----.*?$".toRegex(), "")
// Remove horizontal rules.
intermediateText = intermediateText.replace("\\s*([-=_]{30,}+)\\s*".toRegex(), " ")
// If the textPart was plaintext, parse as HTML
if (parseHtml) {
intermediateText = HtmlConverter.htmlToText(intermediateText)
}
// Remove parsed HTML links/images "<url>"
intermediateText = intermediateText.replace("<https?://\\S+>".toRegex(), " ")
// URLs in the preview should just be shown as "..." - They're not
// clickable and they usually overwhelm the preview
intermediateText = intermediateText.replace("https?://\\S+".toRegex(), "...")
// Don't show newlines in the preview
intermediateText = intermediateText.replace('\n', ' ')
// Collapse whitespace in the preview
intermediateText = intermediateText.replace("\\s+".toRegex(), " ")
// Remove any whitespace at the beginning and end of the string.
intermediateText = intermediateText.trim()
return if (intermediateText.length > MAX_PREVIEW_LENGTH) {
intermediateText.substring(0, MAX_PREVIEW_LENGTH - 1) + "…"
} else {
intermediateText
}
} |
|
@frabera Thanks for the fix. I think having the flag improves the handling. I would add some tests to check the behavior especially for forwarded emails, special characters and html in plain text. This might help to check the logic works, maybe you could also add more variations, html mail with forwarded plain text or plain text with html forwarded email. Then test with special characters and html code in the plain text part. @Test
fun extractPreview_forwardedMessage() {
val text =
"""
Here is the forwarded message:
-----Original Message-----
From: alice@example.com
Sent: Monday, January 1, 2024 10:00 AM
To: bob@example.com
Subject: Hello
This is the original content.
""".trimIndent()
val part = MessageCreationHelper.createTextPart("text/plain", text)
val preview = previewTextExtractor.extractPreview(part)
assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
}
@Test
fun extractPreview_htmlForwardedMessage() {
val text =
"""
<html>
<body>
Here is the forwarded message:<br>
<br>
-----Original Message-----<br>
From: alice@example.com<br>
Sent: Monday, January 1, 2024 10:00 AM<br>
To: bob@example.com<br>
Subject: Hello<br>
<br>
This is the original content.
</body>
</html>
""".trimIndent()
val part = MessageCreationHelper.createTextPart("text/html", text)
val preview = previewTextExtractor.extractPreview(part)
assertThat(preview).isEqualTo("Here is the forwarded message: This is the original content.")
}I'm open for suggestions how to mark the content of the forwarded message. Or even threat this as a separate issue. When working on the test, keep the line breaks as is, otherwise the tests will fail. Ideally the test emails need to be loaded from a file, instead of being part of the code that reformats the message in an incompatible way. We won't have time to fix this now, so we're open for contributions. |
…Extractor This solves html characters or code in notifications and message previews (thunderbird#10256)
…reviewTextExtractorTest` to protect newlines from autoformatting changes
|
@frabera Thanks for your contribution. I updated the PR to a mergable state and added tests. |
|
@wmontwe sorry for leaving it in the previous state, thank you for fixing it. Just a note, in the original PR I didn't add the flag to avoid the double HTML parsing (#10919 (comment)), do you want me to modify the returned value of |
…hance text cleanup
|
@frabera I think that makes sense. I just altered the structure a bit. Use the flag to avoid double HTML conversion, but keep cleanup after parsing for text/plain that contains HTML: You might need to change the test then. |
|
@wmontwe @rafaeltonholo I added the flag to avoid double parsing and all the tests are passing. I think the PR is ready for review. On a side note, if this is the new behaviour maybe it's just easier to always parse the html part if present? I didn't look so deep in the code but I think that it tries to parse the plain text part and it falls back to html (as it would be the most logical thing to do in an ideal world where plain text parts are well-formed). I tried looking at my emails with the new preview and there is some small inconsistency with spaces in emails where the html part is well formed but the plain text use normal spaces, maybe in the future will be more convenient to prefer the html now that the parsing is always performed regardless of body type? |
|
@frabera I think the parsing needs to be reworked. It has gaps and also doesn't remove Markdown formattings. This could be a dedicated effort. But for a patch this is already good. |
Resolves #10256
Resolves #8471
This solves html characters or code in notifications (#10256) and message previews (#8471)
Previously it only converted from html after checking the mimetype
convertFromHtmlIfNecessary()but if the message contained html code or special characters in the plain text part they were rendered in the message previews and in notifications, polluting content.I kept the function, let me know if there is a cleaner way to accomplish this. I tested it with messages containing special characters and also with html code, they are not shown.