Python script for preprocessing floodx data into uniform and readable format. Learn more about the project here: http://www.eawag.ch/en/department/sww/projects/floodx/. The following transformations are applied:
- Optical Character Recognition of display values
- Consolidation
- Sorting
- Formatting
- Offsetting
- Time shifting
- Removal of extreme values
- Segmentation of data by experiment
- Formatting for time series database
tesseract-ocr executable. Optical character recognition (OCR) is used to read sensor values from images. The pytesseract (https://pypi.python.org/pypi/pytesseract) package used requires that you install tesseract-ocr. The following Wiki for tesseract-ocr provides useful information: https://github.com/tesseract-ocr/tesseract/wiki. You must be able to invoke the tesseract command as "tesseract". If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
Python 2.7X is recommended. We recommend installing the Anaconda package: https://www.continuum.io/downloads
The following Python packages are required:
- pandas for working with time series
- os for working with filesystem
- Image or PIL for working with images
- pytesseract is a Python wrapper for tesseract-ocr
- glob for selecting files with wildcards
- csv for writing csv files
- re for using regular expressions
- datetime for working with datetime stamps
- tkinter for creating GUIs
- Make sure the dependencies mentioned above are installed.
- Download and extract the following packages from the Zenodo data repository:
floodX Datasets
(doi: ...)floodX Datalogger Images
(doi: ...)- The
floodX Datasets
package contains the default version of this script in thecode
folder. - If you wish to update the files in the
code
folder, you can replace them with the files of this Github repository. - Open
settings.py
, and update the paths inmetadata/metadata_ocr.csv
to point to where you unpacked thefloodX Datalogger Images
package.
- If you have installed tesseract and can call tesseract from the command line using "tesseract", but still get an error when using pytesseract like "system cannot find the file specified", then try restarting your Python environment - the PATH may need to be updated.