Skip to content

Pure Text extraction from HOCR is HTML entity encoded #81

@DiegoPino

Description

@DiegoPino

What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here

$page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>",
PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';

Basically, we don't want this:

image

Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?

Metadata

Metadata

Assignees

Labels

Post processor PluginsThe ones with a ->run() methodSolr IndexingPutting things where they can be foundenhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions