Unstructured Folder Loader
🚀
Enhanced
Direct integration with Langfuse tracing
.png)
Unstructured Folder Loader Node
The Unstructured Folder Loader uses Unstructured.io to load and process multiple documents from a folder. It provides advanced document parsing capabilities with extensive configuration options for OCR, chunking, and metadata extraction.
⚠️
Currently doesn’t support .png and .heic files until unstructured is updated.
Features
- Batch processing of multiple documents
- Multiple processing strategies
- OCR support with 15+ languages
- Flexible chunking strategies
- Table structure inference
- XML processing options
- Page break handling
- Coordinate extraction
- Metadata customization
Configuration
API Setup
- Default API URL:
http://localhost:8000/general/v0/general - Can be configured via environment variable:
UNSTRUCTURED_API_URL - Optional API key authentication
Parameters
Required Parameters
- Folder Path: Path to the folder containing documents to process
Optional Parameters
Basic Configuration
- Unstructured API URL: API endpoint (default: http://localhost:8000/general/v0/general)
- Strategy: Processing strategy (default: auto)
- hi_res: High resolution processing
- fast: Quick processing
- ocr_only: OCR-focused processing
- auto: Automatic selection
- Encoding: Document encoding (default: utf-8)
OCR Options
- OCR Languages: Multiple language support including:
- English (eng)
- Spanish (spa)
- Mandarin Chinese (cmn)
- Hindi (hin)
- Arabic (ara)
- Portuguese (por)
- Bengali (ben)
- Russian (rus)
- Japanese (jpn)
- And more…
Processing Options
- Skip Infer Table Types: File types to skip table extraction (default: [“pdf”, “jpg”, “png”])
- Hi-Res Model Name: Model selection for hi_res strategy (default: detectron2_onnx)
- chipper: Unstructured’s in-house VDU model
- detectron2_onnx: Facebook AI’s fast object detection
- yolox: Single-stage real-time detector
- yolox_quantized: Optimized YOLOX version
- Coordinates: Extract element coordinates (default: false)
- Include Page Breaks: Include page break elements
- XML Keep Tags: Preserve XML tags
- Multi-Page Sections: Handle multi-page sections
Text Chunking Options
- Chunking Strategy: Text chunking method (default: by_title)
- None: No chunking
- by_title: Chunk by document titles
- Combine Under N Chars: Minimum chunk size
- New After N Chars: Soft maximum chunk size
- Max Characters: Hard maximum chunk size (default: 500)
Metadata Options
- Source ID Key: Key for document source identification (default: source)
- Additional Metadata: Custom metadata as JSON
- Omit Metadata Keys: Keys to exclude from metadata
Supported File Types
- Documents: .doc, .docx, .odt, .ppt, .pptx, .pdf
- Spreadsheets: .xls, .xlsx
- Text: .txt, .text, .md, .rtf
- Web: .html, .htm
- Email: .eml, .msg
- Images: .jpg, .jpeg (Note: .png and .heic currently unsupported)
Output Structure
Document Format
Each processed document includes:
- pageContent: Extracted text content
- metadata:
- source: Document source identifier
- Additional metadata from processing
- Custom metadata (if specified)
Usage Examples
Basic Configuration
{
"folderPath": "/path/to/documents",
"strategy": "auto",
"encoding": "utf-8"
}Advanced Processing
{
"folderPath": "/path/to/documents",
"strategy": "hi_res",
"hiResModelName": "detectron2_onnx",
"ocrLanguages": ["eng", "spa", "fra"],
"chunkingStrategy": "by_title",
"maxCharacters": 500,
"coordinates": true,
"metadata": {
"source": "company_docs",
"department": "legal"
}
}Best Practices
- Choose appropriate strategy based on document quality and processing needs
- Configure OCR languages based on document content
- Adjust chunking parameters for optimal text segmentation
- Use appropriate hi-res model for your use case
- Consider memory usage when processing large folders
- Monitor API usage and response times
- Handle potential API errors in your workflow
Notes
- Process multiple documents in batch
- Supports various file formats
- Memory-efficient processing
- Automatic metadata handling
- Flexible output formats
- Error handling for API responses
- Configurable processing options
This section is a work in progress. We appreciate any help you can provide in completing this section. Please check our Contribution Guide to get started.