BSC Analytics

Making PDFs Searchable Using AWS Textract and CloudSearch By Mahmood BSC Analytics

25 Jun 2024

If you have PDFs and want to make them searchable, provide the search functionality on your website or for internal use. This is what we are doing today using AWS Textract and CloudSearch.

Why Textract? Will it is better than your average PDF-to-text extractor. Textract uses Machine Learning (ML) and Optical Character Recognition (OCR).

Why CloudSearch? There is no specific reason other than making it easy to search. You can use many other databases like MySQL, Postgres, MongoDB, etc.

Start by cloning the Terraform repo that sets up all your resources.

git clone https://github.com/mahmoodr786/aws-pdf-search.git

cd aws-pdf-search

terraform init

terraform apply

The repo has no-nonsense straight-up Terraform with everything main.tf file. Update the file with your bucket name, or change names around. You will also find the Python code needed to aid the search and move extracted text from S3 to CloudSearch. Here is a basic diagram to help you understand flow visually.

NOTE: I’m using very open permission on the Terraform as this is just a POC and will be destroyed. Please adjust your IAM permission to the least privileges.

You start by uploading your PDFs in the S3 bucket in the pdfs folder/prefix. That will automatically trigger the Lambda function using S3 notification, and the Lambda will submit the PDF to AWS Textract. Once Textract is done, it will send a notification via SNS, which will trigger the same Lambda again, and the Python code will get all the files and lines of text from JSON files. Once it has all the text, it will be uploaded to CloudSearch. Finally, we use the same Lambda using function URL to search for PDFs that contain the keywords.

We are going to search for the keyword below from the Graviton2 Whitepaper.

https://u3gq2rmqwswz2kc6zjnzvqzz7y0uzbkm.lambda-url.us-east-1.on.aws/search?keyword=Review%20Ramp-up%20and

Once the Terraform is completed, you should see an S3 bucket create a prefix/folder called pdfs and begin uploading pdfs. The outputs folder is where Textract will dump the JSON files with text lines from PDF.