Document AI

AWS Textract in Production: What Nobody Tells You

Published: March 2025
Read time: 5 min
Tags: AWS Textract, Document AI, Python, Production

AWS Textract is the right choice for production document processing. It handles complex layouts better than open-source OCR alternatives, integrates cleanly with the rest of the AWS ecosystem, and the managed service model means you are not maintaining OCR infrastructure. It is also more nuanced in production than the documentation suggests. This article covers the gaps between the Textract demo and what you actually encounter processing 500+ real-world documents monthly — contracts, insurance claims, medical records, and government forms.

What Textract Actually Does Well Textract genuinely excels at standard document layouts — typed text in single or double column layouts, tables with clear borders, forms with labeled fields. On clean, digitally-generated PDFs with standard layouts, accuracy exceeds 99%. For these document types, Textract is as good as it gets and significantly faster than managing your own OCR pipeline. The Queries feature — where you ask Textract specific questions about a document rather than extracting everything — is underused and extremely useful for structured forms. Instead of extracting all text and then parsing it to find the effective date of a contract, you ask Textract directly: "What is the effective date?" and it returns the answer with a confidence score. For standardized forms this is faster and more accurate than general extraction followed by NLP parsing.

Where Textract Struggles in Production Multi-column layouts with mixed content are the first production failure mode. A legal contract with a standard body but a two-column definitions section, footnotes, and marginal annotations will produce extraction output where the reading order is incorrect — content from different columns interleaved, footnotes appearing mid-sentence. The extracted text is technically accurate at the character level but semantically broken. Handwritten content on printed forms is inconsistent. Textract's handwriting detection works well for clear, separated handwriting. Connected cursive, non-English handwriting, or handwriting on colored form fields produces significantly lower accuracy. If your document corpus includes handwritten sections — insurance claim supplements, medical intake forms, government applications — budget for post-processing and human review. Scanned documents with skew, shadow, or low resolution are the most common source of Textract failures in production. A contract scanned at an angle, a mobile phone photo of a document taken in poor lighting, or a fax-quality scan will produce extraction errors that propagate through your entire downstream pipeline.

Pre-Processing Is Not Optional The gap between demo accuracy and production accuracy on real-world documents is almost always closed by pre-processing, not by Textract configuration. Pre-processing the document before submitting it to Textract has more impact on accuracy than any other single decision. Pre-processing steps that matter in production: deskewing scanned documents using OpenCV's Hough line transform to detect and correct rotation, denoising using Gaussian blur or bilateral filtering before OCR, contrast enhancement for faded or low-contrast documents, and resolution normalization — Textract performs best on documents at 150–300 DPI, and upsampling low-resolution scans before submission improves accuracy measurably. For a contract processing system handling 500+ documents monthly, implementing these pre-processing steps reduced Textract error rate from 8.3% to 1.2% on the same document corpus. The pre-processing pipeline added 200–400ms per document — a worthwhile trade-off.

Confidence Scores and Human Review Routing Every Textract extraction includes confidence scores at the word and block level. Most teams ignore these and treat all Textract output as equally reliable. This is a mistake. In production, use confidence scores to route documents to human review when needed. Documents where any block confidence score falls below your threshold — typically 85–90% for high-stakes applications — should be flagged for human verification before the extracted data enters downstream systems. This confidence-based routing is how you maintain accuracy guarantees without manually reviewing every document. In our contract processing system, approximately 12% of documents are routed to human review based on confidence scores. The remaining 88% proceed to automated processing with 99%+ accuracy.

Asynchronous Processing for Production Volume The Textract synchronous API — DetectDocumentText and AnalyzeDocument — works for single document processing and prototyping. For production volumes it creates bottlenecks. Synchronous calls block while Textract processes the document, and at scale you are either rate-limited or maintaining a large number of concurrent connections. Production document processing uses the Textract asynchronous API: StartDocumentAnalysis submits the job and returns a job ID immediately, your application continues processing other work, and a completion notification arrives via SNS when Textract finishes. This event-driven architecture scales to high document volumes without connection management complexity. Combine asynchronous Textract with an SQS queue for document ingestion and an S3 event trigger for incoming documents. New document arrives in S3, S3 event triggers Lambda, Lambda submits to Textract async API, Textract completes and notifies SNS, SNS triggers processing Lambda with job results. This pipeline scales horizontally without any changes as volume grows.

Cost Management at Scale Textract pricing is per page. At low volumes this is negligible. At 500+ documents monthly with average document lengths of 15–20 pages, costs accumulate. Two optimizations matter at scale. First, only run full AnalyzeDocument — which extracts tables, forms, and text — on documents that actually contain tables or forms. For plain text documents, DetectDocumentText is significantly cheaper. Classify documents before Textract processing and use the appropriate API call. Second, cache extraction results. If the same document is submitted multiple times — common in systems where users re-process or re-analyze documents — re-running Textract is wasteful. Store extraction results in S3 with the document hash as the key and return cached results for duplicate submissions.

Integrating Textract Output With Downstream NLP Textract output is structured JSON containing blocks of text with position information, confidence scores, and relationship data. Most downstream NLP pipelines expect plain text. The translation layer between Textract JSON and usable text is more work than it initially appears. Preserve the structure. Do not flatten Textract output to a single text string — you lose table structure, section hierarchy, and reading order information that downstream processing needs. Build a document model from the Textract JSON that preserves sections, tables as structured data, and form fields as key-value pairs. Feed the appropriate representation to each downstream component — plain text to your LLM for summarization, structured tables to your data extraction pipeline, form fields to your validation logic.

Share this article

By LTK Group Engineering Team | Bangalore, India

Processing Documents at Scale? Talk to Our Engineering Team.

Book Free Technical Call