floodx_data_preprocessing

Python script for preprocessing floodx data into uniform and readable format. Learn more about the project here: http://www.eawag.ch/en/department/sww/projects/floodx/. The following transformations are applied:

Optical Character Recognition of display values
Consolidation
Sorting
Formatting
Offsetting
Time shifting
Removal of extreme values
Segmentation of data by experiment
Formatting for time series database

Dependencies

Tesseract OCR

tesseract-ocr executable. Optical character recognition (OCR) is used to read sensor values from images. The pytesseract (https://pypi.python.org/pypi/pytesseract) package used requires that you install tesseract-ocr. The following Wiki for tesseract-ocr provides useful information: https://github.com/tesseract-ocr/tesseract/wiki. You must be able to invoke the tesseract command as "tesseract". If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.

Python

Python 2.7X is recommended. We recommend installing the Anaconda package: https://www.continuum.io/downloads

Python packages

The following Python packages are required:

pandas for working with time series
os for working with filesystem
Image or PIL for working with images
pytesseract is a Python wrapper for tesseract-ocr
glob for selecting files with wildcards
csv for writing csv files
re for using regular expressions
datetime for working with datetime stamps
tkinter for creating GUIs

Executing the script

Make sure the dependencies mentioned above are installed.
Download and extract the following packages from the Zenodo data repository:
floodX Datasets (doi: ...)
floodX Datalogger Images (doi: ...)
The floodX Datasets package contains the default version of this script in the code folder.
If you wish to update the files in the code folder, you can replace them with the files of this Github repository.
Open settings.py, and update the paths in metadata/metadata_ocr.csv to point to where you unpacked the floodX Datalogger Images package.

Troubleshooting

If you have installed tesseract and can call tesseract from the command line using "tesseract", but still get an error when using pytesseract like "system cannot find the file specified", then try restarting your Python environment - the PATH may need to be updated.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
tesseract_training		tesseract_training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SettingReader.py		SettingReader.py
copy_to_database.py		copy_to_database.py
dbconfig(TEMPLATE).py		dbconfig(TEMPLATE).py
main.py		main.py
process_csv.py		process_csv.py
process_ocr.py		process_ocr.py
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

floodx_data_preprocessing

Dependencies

Tesseract OCR

Python

Python packages

Executing the script

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

Eawag-SWW/floodx_data_preprocessing

Folders and files

Latest commit

History

Repository files navigation

floodx_data_preprocessing

Dependencies

Tesseract OCR

Python

Python packages

Executing the script

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages