-
Notifications
You must be signed in to change notification settings - Fork 2
Description
What?
When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.
I (just theory) think this can be fixed here
strawberry_runners/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php
Lines 355 to 356 in 9d3bf9e
| $page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>", | |
| PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : ''; |
Basically, we don't want this:
Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:
https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?
@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?
