PdfOcrExtraction@1

Node PdfOcrExtraction@1 is used to extract text and data from PDF files using Optical Character Recognition (OCR). This node uses IronOCR to process PDF documents, including scanned documents, and can extract text, tables, and barcodes.

Adapter Prerequisites

Mesh Adapter

Node Configuration

For fields path, targetPath, targetValueWriteMode, and targetValueKind, see Overview.

transformations:
  - type: PdfOcrExtraction@1
    path: $.PdfContent # The path to the base64-encoded PDF content
    targetPath: $.ExtractedText # The path where the extracted text will be stored
    targetValueWriteMode: Overwrite # The target value write mode
    targetValueKind: Simple # The target value kind
    pageNumbers: # Specific page numbers to process (optional, if not set all pages will be processed)
      - 1
      - 2
      - 5
    language: "en" # OCR language code (e.g., 'en', 'de', 'fr', default: "en")
    extractTables: false # Whether to extract tables from the PDF (default: false)
    tablesOutputPath: $.Tables # Output path for extracted tables (default: $.Tables)
    extractBarcodes: false # Whether to extract barcodes from the PDF (default: false)
    barcodesOutputPath: $.Barcodes # Output path for extracted barcodes (default: $.Barcodes)
    includeConfidence: false # Whether to include OCR confidence score in output (default: false)
    confidenceOutputPath: $.Confidence # Output path for OCR confidence score (default: $.Confidence)
    continueOnError: false # Whether to continue processing if OCR extraction fails (default: false)

Configuration Parameters

Parameter	Type	Required	Default	Description
`pageNumbers`	int[]	No	All pages	Specific page numbers to process
`language`	string	No	`en`	OCR language code
`extractTables`	bool	No	false	Whether to extract tables from the PDF
`tablesOutputPath`	string	No	`$.Tables`	Output path for extracted tables
`extractBarcodes`	bool	No	false	Whether to extract barcodes from the PDF
`barcodesOutputPath`	string	No	`$.Barcodes`	Output path for extracted barcodes
`includeConfidence`	bool	No	false	Whether to include OCR confidence score
`confidenceOutputPath`	string	No	`$.Confidence`	Output path for OCR confidence score
`continueOnError`	bool	No	false	Whether to continue if extraction fails

Supported Languages

The node supports the following language codes for OCR:

en or english - English
de or german - German
fr or french - French
es or spanish - Spanish
it or italian - Italian
pt or portuguese - Portuguese
nl or dutch - Dutch
ru or russian - Russian
zh or chinese - Chinese (Simplified)
ja or japanese - Japanese
ko or korean - Korean
ar or arabic - Arabic

Input Requirements

The input must be a base64-encoded PDF file content
Maximum file size: 1 MB (1,000,000 bytes)
The PDF content must be available at the specified path

Output Structure

The node can produce multiple outputs based on configuration:

Main text output (targetPath): The extracted text from the PDF
Tables (tablesOutputPath): Structured table data if extractTables is enabled
Barcodes (barcodesOutputPath): Barcode data if extractBarcodes is enabled
Confidence score (confidenceOutputPath): OCR confidence percentage if includeConfidence is enabled

Error Handling

If continueOnError is set to false (default), the pipeline will stop if OCR extraction fails
If continueOnError is set to true, errors will be logged but the pipeline will continue
Common errors include: missing PDF content, file size exceeded, invalid base64 encoding, or OCR processing failures

Example: Basic Text Extraction

triggers:
  - type: FromHttpRequest@1
    path: /extract/pdf
    method: POST
transformations:
  - type: PdfOcrExtraction@1
    path: $.pdfData
    targetPath: $.extractedContent
    language: "en"

Example: Extract Text with Tables and Confidence

transformations:
  - type: PdfOcrExtraction@1
    path: $.documentPdf
    targetPath: $.text
    language: "en"
    extractTables: true
    tablesOutputPath: $.tables
    includeConfidence: true
    confidenceOutputPath: $.ocrConfidence

Example: Process Specific Pages with Barcode Extraction

transformations:
  - type: PdfOcrExtraction@1
    path: $.pdfContent
    targetPath: $.extractedText
    pageNumbers:
      - 1
      - 3
      - 5
    language: "de"
    extractBarcodes: true
    barcodesOutputPath: $.barcodes
    continueOnError: true

Performance Considerations

note

OCR processing can be computationally intensive, especially for:

Large PDF files
Documents with many pages
High-resolution scanned images
Complex layouts with tables and graphics

Consider using pageNumbers to limit processing to specific pages when full document extraction is not required.

Adapter Prerequisites​

Node Configuration​

Configuration Parameters​

Supported Languages​

Input Requirements​

Output Structure​

Error Handling​

Example: Basic Text Extraction​

Example: Extract Text with Tables and Confidence​

Example: Process Specific Pages with Barcode Extraction​

Performance Considerations​