PdfOcrExtraction@1
Node PdfOcrExtraction@1
is used to extract text and data from PDF files using Optical Character Recognition (OCR). This node uses IronOCR to process PDF documents, including scanned documents, and can extract text, tables, and barcodes.
Adapter Prerequisites
Node Configuration
For fields path
, targetPath
, targetValueWriteMode
, and targetValueKind
, see Overview.
transformations:
- type: PdfOcrExtraction@1
path: $.PdfContent # The path to the base64-encoded PDF content
targetPath: $.ExtractedText # The path where the extracted text will be stored
targetValueWriteMode: Overwrite # The target value write mode
targetValueKind: Simple # The target value kind
pageNumbers: # Specific page numbers to process (optional, if not set all pages will be processed)
- 1
- 2
- 5
language: "en" # OCR language code (e.g., 'en', 'de', 'fr', default: "en")
extractTables: false # Whether to extract tables from the PDF (default: false)
tablesOutputPath: $.Tables # Output path for extracted tables (default: $.Tables)
extractBarcodes: false # Whether to extract barcodes from the PDF (default: false)
barcodesOutputPath: $.Barcodes # Output path for extracted barcodes (default: $.Barcodes)
includeConfidence: false # Whether to include OCR confidence score in output (default: false)
confidenceOutputPath: $.Confidence # Output path for OCR confidence score (default: $.Confidence)
continueOnError: false # Whether to continue processing if OCR extraction fails (default: false)
Configuration Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
pageNumbers | int[] | No | All pages | Specific page numbers to process |
language | string | No | en | OCR language code |
extractTables | bool | No | false | Whether to extract tables from the PDF |
tablesOutputPath | string | No | $.Tables | Output path for extracted tables |
extractBarcodes | bool | No | false | Whether to extract barcodes from the PDF |
barcodesOutputPath | string | No | $.Barcodes | Output path for extracted barcodes |
includeConfidence | bool | No | false | Whether to include OCR confidence score |
confidenceOutputPath | string | No | $.Confidence | Output path for OCR confidence score |
continueOnError | bool | No | false | Whether to continue if extraction fails |
Supported Languages
The node supports the following language codes for OCR:
en
orenglish
- Englishde
orgerman
- Germanfr
orfrench
- Frenches
orspanish
- Spanishit
oritalian
- Italianpt
orportuguese
- Portuguesenl
ordutch
- Dutchru
orrussian
- Russianzh
orchinese
- Chinese (Simplified)ja
orjapanese
- Japaneseko
orkorean
- Koreanar
orarabic
- Arabic
Input Requirements
- The input must be a base64-encoded PDF file content
- Maximum file size: 1 MB (1,000,000 bytes)
- The PDF content must be available at the specified
path
Output Structure
The node can produce multiple outputs based on configuration:
- Main text output (
targetPath
): The extracted text from the PDF - Tables (
tablesOutputPath
): Structured table data ifextractTables
is enabled - Barcodes (
barcodesOutputPath
): Barcode data ifextractBarcodes
is enabled - Confidence score (
confidenceOutputPath
): OCR confidence percentage ifincludeConfidence
is enabled
Error Handling
- If
continueOnError
is set tofalse
(default), the pipeline will stop if OCR extraction fails - If
continueOnError
is set totrue
, errors will be logged but the pipeline will continue - Common errors include: missing PDF content, file size exceeded, invalid base64 encoding, or OCR processing failures
Example: Basic Text Extraction
triggers:
- type: FromHttpRequest@1
path: /extract/pdf
method: POST
transformations:
- type: PdfOcrExtraction@1
path: $.pdfData
targetPath: $.extractedContent
language: "en"
Example: Extract Text with Tables and Confidence
transformations:
- type: PdfOcrExtraction@1
path: $.documentPdf
targetPath: $.text
language: "en"
extractTables: true
tablesOutputPath: $.tables
includeConfidence: true
confidenceOutputPath: $.ocrConfidence
Example: Process Specific Pages with Barcode Extraction
transformations:
- type: PdfOcrExtraction@1
path: $.pdfContent
targetPath: $.extractedText
pageNumbers:
- 1
- 3
- 5
language: "de"
extractBarcodes: true
barcodesOutputPath: $.barcodes
continueOnError: true
Performance Considerations
note
OCR processing can be computationally intensive, especially for:
- Large PDF files
- Documents with many pages
- High-resolution scanned images
- Complex layouts with tables and graphics
Consider using pageNumbers
to limit processing to specific pages when full document extraction is not required.