Skip to content

Xml Reader Rich Text#4007

Merged
oleibman merged 3 commits intoPHPOffice:masterfrom
oleibman:issue4001
May 5, 2024
Merged

Xml Reader Rich Text#4007
oleibman merged 3 commits intoPHPOffice:masterfrom
oleibman:issue4001

Conversation

@oleibman
Copy link
Collaborator

@oleibman oleibman commented May 1, 2024

Fix #4001. Thanks to @SlowFox71 who reported the problem and developed most of the solution. This PR adds Rich Text support to the XML reader. The Xml Spreadsheet stores Rich Text as Html tags, children of the ss:Data tag using a specific namespace. These can be parsed into a RichText object using existing method Helper/Html::toRichTextObject. There are 2 items which need special attention.

First, for attributes like bold or italic, Excel uses the appropriate Html tag (e.g. <B>). However, for an attribute like color, Excel uses <Font html:Color="#FF0000">, with a prefix on the Color tag. PhpSpreadsheet's Html parser cannot cope with the prefix. The parser is changed to strip html: from attribute names for the Font tag.

The example cited by the user used a <BR /> to indicate a line break in the data. However, it appears that, at least some of the time, Excel will instead use &#10; to indicate a line break. The existing parser reduces one or more whitespace characters in the text to a single space, and so &#10; will wind up disappearing. I am not sure why the existing code does this, but I do know that I am not willing to break it. Instead, I've added an optional boolean parameter $preserveWhiteSpace to toRichTextObject. If false (default), the existing logic will be used; but if true, substitution for whitespace characters in the text will not happen.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#4001. Thanks to @SlowFox71 who reported the problem and developed most of the solution. This PR adds Rich Text support to the XML reader. The Xml Spreadsheet stores Rich Text as Html tags, children of the ss:Data tag using a specific namespace. These can be parsed into a RichText object using existing method Helper/Html::toRichTextObject. There are 2 items which need special attention.

First, for attributes like bold or italic, Excel uses the appropriate Html tag (e.g. `<B>`). However, for an attribute like color, Excel uses `<Font html:Color="#FF0000">`, with a prefix on the Color tag. PhpSpreadsheet's Html parser cannot cope with the prefix. The parser is changed to strip `html:` from attribute names for the Font tag.

The example cited by the user used a `<BR />` to indicate a line break in the data. However, it appears that, at least some of the time, Excel will instead use `&#10;` to indicate a line break. The existing parser reduces one or more whitespace characters in the text to a single space, and so `&#10;` will wind up disappearing. I am not sure why the existing code does this, but I do know that I am not willing to break it. Instead, I've added an optional boolean parameter `$preserveWhiteSpace` to `toRichTextObject`. If false (default), the existing logic will be used; but if true, substitution for whitespace characters in the text will not happen.
@oleibman oleibman added this pull request to the merge queue May 5, 2024
Merged via the queue into PHPOffice:master with commit 4a7fa14 May 5, 2024
@oleibman oleibman deleted the issue4001 branch May 5, 2024 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

XML-Reader: support rich text

1 participant