Skip to main content

PdfOcrExtraction@1

Node PdfOcrExtraction@1 is used to extract text and data from PDF files using Optical Character Recognition (OCR). This node uses IronOCR to process PDF documents, including scanned documents, and can extract text, tables, and barcodes.

Adapter Prerequisites

Node Configuration

For fields path, targetPath, targetValueWriteMode, and targetValueKind, see Overview.

transformations:
- type: PdfOcrExtraction@1
path: $.PdfContent # The path to the base64-encoded PDF content
targetPath: $.ExtractedText # The path where the extracted text will be stored
targetValueWriteMode: Overwrite # The target value write mode
targetValueKind: Simple # The target value kind
pageNumbers: # Specific page numbers to process (optional, if not set all pages will be processed)
- 1
- 2
- 5
language: "en" # OCR language code (e.g., 'en', 'de', 'fr', default: "en")
extractTables: false # Whether to extract tables from the PDF (default: false)
tablesOutputPath: $.Tables # Output path for extracted tables (default: $.Tables)
extractBarcodes: false # Whether to extract barcodes from the PDF (default: false)
barcodesOutputPath: $.Barcodes # Output path for extracted barcodes (default: $.Barcodes)
includeConfidence: false # Whether to include OCR confidence score in output (default: false)
confidenceOutputPath: $.Confidence # Output path for OCR confidence score (default: $.Confidence)
continueOnError: false # Whether to continue processing if OCR extraction fails (default: false)

Configuration Parameters

ParameterTypeRequiredDefaultDescription
pageNumbersint[]NoAll pagesSpecific page numbers to process
languagestringNoenOCR language code
extractTablesboolNofalseWhether to extract tables from the PDF
tablesOutputPathstringNo$.TablesOutput path for extracted tables
extractBarcodesboolNofalseWhether to extract barcodes from the PDF
barcodesOutputPathstringNo$.BarcodesOutput path for extracted barcodes
includeConfidenceboolNofalseWhether to include OCR confidence score
confidenceOutputPathstringNo$.ConfidenceOutput path for OCR confidence score
continueOnErrorboolNofalseWhether to continue if extraction fails

Supported Languages

The node supports the following language codes for OCR:

  • en or english - English
  • de or german - German
  • fr or french - French
  • es or spanish - Spanish
  • it or italian - Italian
  • pt or portuguese - Portuguese
  • nl or dutch - Dutch
  • ru or russian - Russian
  • zh or chinese - Chinese (Simplified)
  • ja or japanese - Japanese
  • ko or korean - Korean
  • ar or arabic - Arabic

Input Requirements

  • The input must be a base64-encoded PDF file content
  • Maximum file size: 1 MB (1,000,000 bytes)
  • The PDF content must be available at the specified path

Output Structure

The node can produce multiple outputs based on configuration:

  1. Main text output (targetPath): The extracted text from the PDF
  2. Tables (tablesOutputPath): Structured table data if extractTables is enabled
  3. Barcodes (barcodesOutputPath): Barcode data if extractBarcodes is enabled
  4. Confidence score (confidenceOutputPath): OCR confidence percentage if includeConfidence is enabled

Error Handling

  • If continueOnError is set to false (default), the pipeline will stop if OCR extraction fails
  • If continueOnError is set to true, errors will be logged but the pipeline will continue
  • Common errors include: missing PDF content, file size exceeded, invalid base64 encoding, or OCR processing failures

Example: Basic Text Extraction

triggers:
- type: FromHttpRequest@1
path: /extract/pdf
method: POST
transformations:
- type: PdfOcrExtraction@1
path: $.pdfData
targetPath: $.extractedContent
language: "en"

Example: Extract Text with Tables and Confidence

transformations:
- type: PdfOcrExtraction@1
path: $.documentPdf
targetPath: $.text
language: "en"
extractTables: true
tablesOutputPath: $.tables
includeConfidence: true
confidenceOutputPath: $.ocrConfidence

Example: Process Specific Pages with Barcode Extraction

transformations:
- type: PdfOcrExtraction@1
path: $.pdfContent
targetPath: $.extractedText
pageNumbers:
- 1
- 3
- 5
language: "de"
extractBarcodes: true
barcodesOutputPath: $.barcodes
continueOnError: true

Performance Considerations

note

OCR processing can be computationally intensive, especially for:

  • Large PDF files
  • Documents with many pages
  • High-resolution scanned images
  • Complex layouts with tables and graphics

Consider using pageNumbers to limit processing to specific pages when full document extraction is not required.