Home / Blog / AWS Textract vs ShapeForge
Comparison

AWS Textract vs ShapeForge: Benchmarks on Real Invoices

📅 April 2, 2026 🕒 8 min read 📄 200-invoice benchmark

If you've ever Googled "aws textract alternative", you're probably tired of paying AWS's per-page pricing, wrestling with their SDK, or getting blank JSON back from scanned documents. We ran both APIs against 200 real-world invoices — typed, scanned, handwritten, and multi-column — and measured accuracy, latency, cost, and developer experience. Here's what we found.

Why developers look for Textract alternatives

AWS Textract is a capable service — it's been around since 2019, it handles a wide range of document types, and it has the institutional trust of being an AWS product. But it comes with friction that adds up fast once you start building real products on it:

ShapeForge is designed to solve exactly these problems: a single REST endpoint, no AWS account required, structured JSON out of the box, and first-class support for scanned images and handwriting.

The benchmark setup

We tested both APIs on 200 invoices drawn from real-world accounts payable workflows. The dataset breaks down as:

For each document we measured: (1) field extraction accuracy against a manually-verified ground truth, (2) end-to-end latency from API call to structured response, and (3) total API cost. All tests were run on April 1–2, 2026.

200 Invoices tested
12 Fields per invoice
2,400 Data points compared

Head-to-head comparison

Criteria AWS Textract ShapeForge
Accuracy — clean PDFs 96.4% 97.1%
Accuracy — scanned images 79.3% 91.8%
Accuracy — handwritten fields 61.2% 78.4%
Accuracy — multi-column tables 83.7% 93.2%
Median latency (p50) 4.2s 1.8s
p95 latency 12.1s 4.3s
Output format Raw Blocks (requires parsing) Structured JSON
AWS account required Yes + S3 bucket + IAM No
Integration complexity 40–80 lines of code 1 curl command
Cost per 1,000 pages $15.00–$30.00+ $5.00–$9.90
Zero-retention data policy No Yes
Free tier 3,000 pages/mo (first 3 months) 100 documents included

Accuracy deep-dive: where Textract falls short on scans

The 12.5-point accuracy gap on scanned images (79.3% vs 91.8%) is the number that matters most for real AP workflows. A significant share of the invoices coming into accounts payable departments are scanned — either because vendors fax them, employees photograph paper receipts, or older systems print-then-scan.

Textract's raw text extraction on low-DPI TIFFs was often technically accurate at the character level, but the spatial block layout reconstruction was fragile. Vendor names would bleed into address fields, line-item amounts would merge with tax lines, and totals from multi-column layouts would be silently omitted from the key-value response.

ShapeForge runs a semantic extraction pass on top of the raw OCR, which means even if the scanned text is slightly degraded, the model understands that a number near "Total" is the invoice total — not a line-item quantity. This produces structured JSON that's accurate end-to-end, not just at the character level.

Code comparison: 3 lines vs 60 lines

This is where the developer experience gap is most obvious. To extract the vendor name, invoice number, and total amount from a PDF with Textract, you need to:

  1. Upload the file to S3
  2. Start an async AnalyzeDocument job
  3. Poll for job completion (or configure SNS)
  4. Download and parse the Blocks response
  5. Reconstruct key-value pairs from KEY_VALUE_SET blocks
  6. Map the reconstructed pairs to your schema

Here's what that looks like for both:

AWS Textract (Node.js SDK)
// Step 1: Upload to S3 first const s3 = new S3Client({ region: "us-east-1" }); await s3.send(new PutObjectCommand({ Bucket: "my-bucket", Key: "invoice.pdf", Body: fileBuffer })); // Step 2: Start async job const textract = new TextractClient({}); const { JobId } = await textract.send( new StartDocumentAnalysisCommand({ DocumentLocation: { S3Object: { Bucket: "my-bucket", Name: "invoice.pdf" } }, FeatureTypes: ["FORMS", "TABLES"] }) ); // Step 3: Poll for completion let status = "IN_PROGRESS"; while (status === "IN_PROGRESS") { await sleep(2000); const res = await textract.send( new GetDocumentAnalysisCommand({ JobId }) ); status = res.JobStatus; } // Step 4: Reconstruct key-value pairs const blocks = response.Blocks; const kvSets = blocks.filter(b => b.BlockType === "KEY_VALUE_SET" ); // ... 30+ more lines of block parsing
ShapeForge
# That's it. One curl command. curl -X POST \ https://shapeforge-4vqx.polsia.app/api/parse \ -H "Authorization: Bearer YOUR_KEY" \ -F "file=@invoice.pdf" // Response is already structured JSON: { "document_type": "invoice", "extracted_fields": { "vendor_name": "Acme Corp", "invoice_number": "INV-2026-00847", "invoice_date": "2026-03-15", "total_amount": 4250.00, "currency": "USD", "due_date": "2026-04-14" }, "tables": [...], "confidence": { "overall": 0.97 } }

No S3 setup. No IAM policies. No async job polling. No block reassembly. You send a file, you get structured data back in under 2 seconds.

Pricing breakdown

AWS Textract pricing is tiered by feature and volume. To do a full invoice extraction (detect text + analyze forms + analyze tables), you pay for each feature separately. At 10,000 pages/month:

Service 10k pages/mo 50k pages/mo 100k pages/mo
AWS Textract (Forms + Tables) ~$150 ~$675 ~$1,300
ShapeForge (Growth plan) $99 $99 + overages ~$350

At 10,000 pages/month, ShapeForge is approximately 34% cheaper than Textract's full-feature pricing. At 100k pages/month, the savings are more significant. And unlike Textract, ShapeForge has no per-feature surcharges — tables, forms, and raw text are all included in a single parse call.

Scanned image support: the real differentiator

If your use case only involves clean, digitally-generated PDFs, Textract and ShapeForge are roughly comparable on accuracy (96.4% vs 97.1%). The gap is narrow enough that developer experience and pricing become the primary decision factors.

But if you're processing scanned documents — and most real-world AP workflows include them — the 12.5-point accuracy gap (79.3% vs 91.8%) is a business problem. At 79% accuracy across 12 fields, you're looking at roughly 2–3 incorrect fields per invoice. At scale that means human review queues, failed automations, and incorrect GL entries.

ShapeForge handles scanned images natively via the same API endpoint. You send a JPG, PNG, or TIFF the same way you'd send a PDF:

Scanned image — same API, same response
curl -X POST \ https://shapeforge-4vqx.polsia.app/api/parse \ -H "Authorization: Bearer YOUR_KEY" \ -F "file=@scanned_invoice.jpg" # Works identically for JPG, PNG, TIFF, PDF, multi-page PDFs

No separate OCR pipeline to configure. No separate pricing tier for images. No additional SDK or integration work. The same single endpoint handles everything — and the response format is identical regardless of input type.

The verdict

Bottom line

For new projects: ShapeForge wins on every dimension that matters for invoice parsing — accuracy on scans (+12.5%), latency (2.3x faster p50), integration complexity (3 lines vs 60+), and pricing (34% cheaper at 10k pages). The only reason to choose Textract today is if you're already deeply embedded in the AWS ecosystem and the migration cost outweighs these advantages.

If you're evaluating document parsing APIs for a new project — or you're tired of Textract's SDK complexity and want to de-couple from AWS — the numbers here are a clear case for making the switch.

When Textract is still the right choice

We believe in honest comparisons. There are specific scenarios where sticking with Textract makes sense:

For everyone else — especially startups, indie developers, and mid-market finance teams building their first AP automation — ShapeForge is the faster path to production.

See it for yourself

Upload a real invoice and get structured JSON back in under 2 seconds. No AWS account. No SDK. No credit card.

Try it yourself →

100 free documents included with every new account.