Microsoft Word Document Loader
🚀
Enhanced
Direct integration with Langfuse tracing
.png)
Microsoft Word is a word processing software for creating and editing text documents. This module provides functionality to load and process Word documents using officeparser.
This module provides a sophisticated Word document loader that can:
- Load Word documents
- Extract text content
- Split content into sections
- Handle page numbering
- Process metadata per section
- Support multiple section formats
- Handle various section separators
Inputs
Required Parameters
- Word File: The Word file(s) to process (.doc, .docx)
Optional Parameters
- Text Splitter: A text splitter to process the extracted content
- Additional Metadata: JSON object with additional metadata
- Omit Metadata Keys: Comma-separated list of metadata keys to omit
Outputs
- Document: Array of document objects containing metadata and pageContent
- Text: Concatenated string from pageContent of documents
Features
- Text extraction
- Section separation
- Metadata handling
- Error handling
- Memory-efficient processing
- Heuristic section detection
- Content filtering
Section Detection Methods
Pattern Recognition
The loader attempts to identify sections using common patterns:
- “Page X” markers
- “Section X” markers
- “Chapter X” markers
- Numbered sections (e.g., “1. ”, “2. ”)
- ALL CAPS headings
- Long underscore separators
- Long dash separators
Fallback Mechanisms
If pattern recognition fails:
- Split by multiple newlines
- Split by double newlines
- Treat content as single section
Document Structure
Each document contains:
- pageContent: Extracted text content from the section
- metadata:
- documentType: “word”
- pageNumber: Sequential section number
- Additional custom metadata
Content Processing
- Empty sections are filtered out
- Leading/trailing whitespace removed
- Minimum content length validation
- Reasonable section count validation
Metadata Attributes
Default attributes include:
- documentType: Type of document (string)
- pageCount: Number of pages/sections (number)
- Custom metadata from input
Notes
- Uses officeparser for extraction
- Handles various document formats
- Intelligent section detection
- Content validation
- Memory-efficient processing
- Error handling for invalid files
- Flexible output formats
- Robust fallback mechanisms