Document Text Extractor


Test distributed optical character recognition service for clients to connect microservices


Setting up a sample application was as simple as I thought using AWS Amplify. Unfortunately the framework makes setting up a DynamoDB GraphQL application straightforward but to leverage additional backend services you often end up writing raw CloudFormation.

AWS Textract on the other hand went wonderfully and passed the optical character recognition test with flying colors. Not only was it able to extract text from blurry documents provided by the content but it was able to analyze relationships between those texts to identify fields/tables.


React app frontend to demonstrate a sample UI hosted on S3 behind cloudfront using the AWS Amplify framework.

Used Amplify to also set up AppSync GraphQL API and DynamoDB schema. Then used Serverless framework to deploy AWS SNS topics and SQS queues.

Finally S3 was setup to accept file uploads and send event notifications to SNS prompting an AWS Lambda function to kickoff the AWS Textract service using the AWS cli.

In Summary:

  1. user uploads file
  2. S3 event notification sent to Upload Topic
  3. Upload Topic sends messages to Upload Queue
  4. Upload Queue triggers AWS Textract lambda function
  5. Lambda function tells Textract to analyze document in S3 (async)
  6. When Textract is done it sends a notification to the Textract Topic
  7. The Textract Topic forwards the message to the Textract Queue
  8. Another Lambda function uses the AWS CLI to get the finished job data and pushes that into another S3 bucket
  9. An S3 notification for this analyzed data bucket is sent to a Analyzed Data Topic
  10. The Analyzed Data Topic forwards to an Analyzed Data Queue
  11. The Analyzed Data Queue triggers another Lambda function to parse the analyzed data from S3 and makes a POST request with this parsed data to the GraphQL endpoint provided by AWS AppSync
  12. The AppSync GraphQL server handles writing the data into DynamoDB
  13. The React frontend uses the GraphQL subscriptions api to listen to mutations happening on AppSync and re-renders with changes to display document analysis to the user