Install Ansible:
Install Git:
For Debian or Ubuntu:
% sudo apt install -y -V gitFor CentOS:
% sudo yum install -y gitClone this repository:
% git clone https://github.com/ranguba/chupa-text-vagrant.git
% sudo mv chupa-text-vagrant /var/lib/chupa-textStart virtual machine. It takes long time...:
% cd /var/lib/chupa-text
% vagrant upInstall systemd service file:
% sudo ln -fs \
/var/lib/chupa-text/usr/lib/systemd/system/chupa-text.service \
/usr/lib/systemd/system/chupa-text.service
% sudo systemctl daemon-reload
% sudo systemctl enable chupa-textRun ChupaText service:
% sudo systemctl start chupa-textYou can use ChupaText via HTTP.
http://localhost:20080/ provides form to text extraction. You can use this style by your Web browser.
http://localhost:20080/extraction.json is Web API endpoint with the following specification:
-
HTTP Method:
POST -
Content-Type:
multipart/form-data -
Parameters:
You must to specify at least
dataoruri. You can specify bothdataanduri. In the case,uriis used as additional information.-
data: Data to be extracted. If content-type is specified, it's helpful because ChupaText doesn't need to guess content-type. If ChupaText guesses content-type, ChupaText may detect wrong content-type. -
uri: URI to be extracted.
-
Here is a curl command line to extract local PDF file at
/tmp/sample.pdf. You can use --form option to use
multipart/form-data. data=@PATH means that parameter name is
data and parameter value is content of
PATH. ;type=application/pdf specifies content-type of the data
value:
% curl \
--form 'data=@/tmp/sample.pdf;type=application/pdf' \
http://localhost:20080/extraction.jsonThis Web API returns the following JSON:
{
"mime-type": "application/pdf",
"uri": "file:/home/chupa-text/chupa-text-http-server/sample.pdf",
"path": "/tmp/sample-36-1ywy0xf.pdf",
"size": 147159,
"texts": [
{
"mime-type": "text/plain",
"uri": "file:/home/chupa-text/chupa-text-http-server/sample.txt",
"path": "/home/chupa-text/chupa-text-http-server/sample.txt",
"size": 1012,
"title": "",
"created_time": "2015-01-22T15:54:11.000Z",
"source-mime-types": [
"application/pdf"
],
"creator": "Adobe Illustrator CS3",
"producer": "Adobe PDF library 8.00",
"body": "This is sample PDF. ...",
"screenshot": {
"mime-type": "image/png",
"data": "iVBORw...",
"encoding": "base64"
}
}
]
}In most cases, you're interested in texts values. They include
extracted text in body and screenshot in screenshot. Screenshot
has the following keys:
-
mime-type: The MIME type of thedata. Normally, this isimage/png. -
data: The image data encoded byencoding. -
encoding: This is optional. Ifdatais encoded by base64, this value is"base64". Ifdataisn't encoded, this key doesn't exist. ChupaText needs binary data but JSON doesn't support binary data because JSON is a text format. Ifdatais text data such as SVG, this key doesn't exist.
- Kouhei Sutou
<[email protected]>
LGPL 2.1 or later.
(Kouhei Sutou has a right to change the license including contributed patches.)