PDF Document Loader
🚀
Enhanced
Direct integration with Langfuse tracing
PDF (Portable Document Format) is a file format developed by Adobe for presenting documents consistently across software platforms. This module provides functionality to load and process PDF files using pdf.js.
This module provides a sophisticated PDF document loader that can:
- Load single or multiple PDF files
- Split documents by page or file
- Support base64 encoded files
- Handle file storage integration
- Process content with text splitters
- Support legacy PDF versions
- Customize metadata extraction
Inputs
Required Parameters
- PDF File: The PDF file(s) to process (.pdf extension)
- Usage: Choose between:
- One document per page
- One document per file
Optional Parameters
- Text Splitter: A text splitter to process the extracted content
- Use Legacy Build: Whether to use legacy PDF.js build
- Additional Metadata: JSON object with additional metadata
- Omit Metadata Keys: Comma-separated list of metadata keys to omit
Outputs
- Document: Array of document objects containing metadata and pageContent
- Text: Concatenated string from pageContent of documents
Features
- Multiple file support
- Page-level splitting
- Legacy version support
- Text extraction
- Metadata handling
- Error handling
- Memory-efficient processing
Processing Modes
Per Page Mode
- Each page becomes a document
- Preserves page numbers
- Individual page metadata
- Granular content access
Per File Mode
- Entire PDF as one document
- Combined content
- Single metadata set
- Memory efficient
Document Structure
Each document contains:
- pageContent: Extracted text content
- metadata:
- source: Original file path
- pdf: PDF-specific metadata
- page: Page number (in per-page mode)
- Additional custom metadata
File Handling
Local Files
- Direct file loading
- Base64 encoded content
- Multiple file support
Storage Integration
- File storage system support
- Organization-based storage
- Chatflow-based storage
Notes
- Uses pdf.js for extraction
- Legacy version support
- Memory-efficient processing
- Error handling for invalid files
- Support for large PDFs
- Flexible output formats
- Metadata customization
- Text encoding handling