Skip to content

feat: Support rich table cell content #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented May 6, 2025

Feature

This PR introduces changes to allow table cells to contain any rich content instead of a basic text string. To enable this, the TableCell model now inherits from NodeItem. In order to remain backward compatible, the TableCell.text field is still present (default = "") and the self_ref uses a relative JSON pointer ("0") to refer itself as long as the TableCell is constructed in isolation before belonging to a table in a doc (as in previous usage conventions). A new method TableCell.has_rich_content() reports if the table cell contains child nodes.

New methods in TableItem allow to modify a table's cells after adding the table to a DoclingDocument, which ensures proper reference handling of table cells:

  • TableItem.update_cell is used to insert or overwrite a cell in a table, which allows to use DoclingDocument.add_text and other APIs with the returned cell as parent argument.
  • TableItem.delete_cells allows to delete cells in a given row/col index range

Table cells with valid self_ref are created by the update_cell method and will look like this example: #/tables/0/data/table_cells/2

The serializers are updated to consider table cells with rich content to ensure the text field is not considered when children are present. Children are serialized like any other subtree in a document.

@cau-git cau-git requested a review from vagenas May 6, 2025 11:06
Copy link

mergify bot commented May 6, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 64.60177% with 40 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/types/doc/document.py 63.30% 40 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Christoph Auer <[email protected]>
vagenas and others added 3 commits May 6, 2025 14:48
Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
@cau-git cau-git changed the title feat: Make TableCell a NodeItem to support rich table cell content feat: Support rich table cell content May 6, 2025
@cau-git cau-git marked this pull request as ready for review May 6, 2025 19:03

doc = _construct_doc()

html_pred = doc.export_to_html()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to other tests (see here for example) we could simplify the test results removing the styling

Suggested change
html_pred = doc.export_to_html()
html_pred = doc.export_to_html(html_head="")

"""TableCell."""

bbox: Optional[BoundingBox] = None
row_span: int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a variable here? I think this can be a method computing end_row_offset_idx-start_row_offset_idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do something about this but it is unrelated to the changes of this PR, and we cannot break backwards-compatibility.


bbox: Optional[BoundingBox] = None
row_span: int = 1
col_span: int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a variable here? I think this can be a method computing end_col_offset_idx-start_col_offset_idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do something about this but it is unrelated to the changes of this PR, and we cannot break backwards-compatibility.

row_span: int = 1
col_span: int = 1
start_row_offset_idx: int
end_row_offset_idx: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check somewhere that end_*>start_*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generally put a model validator for such things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants