PDF Document Loader

🚀

Enhanced

Direct integration with Langfuse tracing

PDF (Portable Document Format) is a file format developed by Adobe for presenting documents consistently across software platforms. This module provides functionality to load and process PDF files using pdf.js.

This module provides a sophisticated PDF document loader that can:

  • Load single or multiple PDF files
  • Split documents by page or file
  • Support base64 encoded files
  • Handle file storage integration
  • Process content with text splitters
  • Support legacy PDF versions
  • Customize metadata extraction

Inputs

Required Parameters

  • PDF File: The PDF file(s) to process (.pdf extension)
  • Usage: Choose between:
    • One document per page
    • One document per file

Optional Parameters

  • Text Splitter: A text splitter to process the extracted content
  • Use Legacy Build: Whether to use legacy PDF.js build
  • Additional Metadata: JSON object with additional metadata
  • Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

  • Document: Array of document objects containing metadata and pageContent
  • Text: Concatenated string from pageContent of documents

Features

  • Multiple file support
  • Page-level splitting
  • Legacy version support
  • Text extraction
  • Metadata handling
  • Error handling
  • Memory-efficient processing

Processing Modes

Per Page Mode

  • Each page becomes a document
  • Preserves page numbers
  • Individual page metadata
  • Granular content access

Per File Mode

  • Entire PDF as one document
  • Combined content
  • Single metadata set
  • Memory efficient

Document Structure

Each document contains:

  • pageContent: Extracted text content
  • metadata:
    • source: Original file path
    • pdf: PDF-specific metadata
    • page: Page number (in per-page mode)
    • Additional custom metadata

File Handling

Local Files

  • Direct file loading
  • Base64 encoded content
  • Multiple file support

Storage Integration

  • File storage system support
  • Organization-based storage
  • Chatflow-based storage

Notes

  • Uses pdf.js for extraction
  • Legacy version support
  • Memory-efficient processing
  • Error handling for invalid files
  • Support for large PDFs
  • Flexible output formats
  • Metadata customization
  • Text encoding handling