Skip to content

Add Numeric Table Detection samples #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jmalha
Copy link

@jmalha jmalha commented Mar 22, 2023

Issue #, if available:

Description of changes:
I added a static numeric table detection sample in a Jupiter notebook and corrected a missing link in the README

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@schadem schadem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes. Moved the pdf's to an S3 bucket and reduced dependency on SageMaker.

Comment on lines 102 to 105
"file=\"./DemoTable.pdf\"\n",
"file_key=f\"idp/textract/demo/{os.path.basename(file)}\"\n",
"s3url=\"s3://\"+data_bucket+\"/\"+file_key\n",
"!aws s3 cp {file} {s3url} --only-show-errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the sample at: s3://amazon-textract-public-content/code-samples/DemoTable.pdf as public readable. We can get rid of the entire uploading to S3 and Sagemaker code.
Simplification.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I also added a local copy of he file so that the user can see it in the notebook.

"# Detect Tables in raw text\n",
"\n",
"In this example using a threashold of 4 characters out of the 15 most used characters being digits works well to identify if the page is mostly constituted of numeric tables.\n",
"\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain in more detail how this implementation works. This is the key point of the notebook/sample, so we want to be very clear about this. Provide a second implementation as well, so the concept is understood.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a second implementation focusing on each line.

Comment on lines +248 to +251
]
},
{
"cell_type": "code",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add visualization for the newly created temp.pdf to show that it only contains tables

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@jmalha jmalha marked this pull request as draft March 24, 2023 06:48
@jmalha jmalha marked this pull request as ready for review March 24, 2023 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants