Pure Text extraction from HOCR is HTML entity encoded

# What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here
https://github.com/esmero/strawberry_runners/blob/9d3bf9ed2040856c1ec5dc9cb19a8a0d568481a5/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L355-L356

Basically, we don't want this:

![image](https://github.com/esmero/strawberry_runners/assets/6946023/ce55b727-8215-4e1b-ba20-dfcbaf5573e5)


Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ? 

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pure Text extraction from HOCR is HTML entity encoded #81

What?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	$page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>",
	PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';

Pure Text extraction from HOCR is HTML entity encoded #81

Description

What?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions