-
Notifications
You must be signed in to change notification settings - Fork 258
Add Numeric Table Detection samples #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small changes. Moved the pdf's to an S3 bucket and reduced dependency on SageMaker.
python/14-pdf-numeric-table.ipynb
Outdated
"file=\"./DemoTable.pdf\"\n", | ||
"file_key=f\"idp/textract/demo/{os.path.basename(file)}\"\n", | ||
"s3url=\"s3://\"+data_bucket+\"/\"+file_key\n", | ||
"!aws s3 cp {file} {s3url} --only-show-errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the sample at: s3://amazon-textract-public-content/code-samples/DemoTable.pdf as public readable. We can get rid of the entire uploading to S3 and Sagemaker code.
Simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. I also added a local copy of he file so that the user can see it in the notebook.
"# Detect Tables in raw text\n", | ||
"\n", | ||
"In this example using a threashold of 4 characters out of the 15 most used characters being digits works well to identify if the page is mostly constituted of numeric tables.\n", | ||
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain in more detail how this implementation works. This is the key point of the notebook/sample, so we want to be very clear about this. Provide a second implementation as well, so the concept is understood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a second implementation focusing on each line.
] | ||
}, | ||
{ | ||
"cell_type": "code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add visualization for the newly created temp.pdf to show that it only contains tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Issue #, if available:
Description of changes:
I added a static numeric table detection sample in a Jupiter notebook and corrected a missing link in the README
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.