GitHub Document Loader

🚀

Enhanced

Direct integration with Langfuse tracing

Github Node

GitHub is a platform for version control and collaboration. This module provides functionality to load and process content from GitHub repositories, supporting both public and private repositories.

This module provides a sophisticated GitHub document loader that can:

  • Load content from GitHub repositories
  • Support private repository access
  • Process repositories recursively
  • Handle custom GitHub instances
  • Control concurrency and retries
  • Customize file filtering
  • Process content with text splitters

Inputs

Required Parameters

Optional Parameters

  • Connect Credential: GitHub API credentials (required for private repos)
  • Recursive: Whether to process subdirectories
  • Max Concurrency: Maximum number of concurrent file loads
  • Github Base URL: Custom GitHub base URL for enterprise instances
  • Github Instance API: Custom GitHub API URL for enterprise instances
  • Ignore Paths: Array of glob patterns for paths to ignore
  • Max Retries: Maximum number of retry attempts
  • Text Splitter: A text splitter to process the extracted content
  • Additional Metadata: JSON object with additional metadata
  • Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

  • Document: Array of document objects containing metadata and pageContent
  • Text: Concatenated string from pageContent of documents

Features

  • Public/private repo support
  • Enterprise instance support
  • Recursive directory processing
  • Concurrency control
  • Retry mechanism
  • Path filtering
  • Text splitting support
  • Metadata customization

Authentication Methods

Public Repositories

  • No authentication required
  • Rate limits apply
  • Limited to public content

Private Repositories

  • Requires GitHub access token
  • Higher rate limits
  • Access to private content
  • Enterprise support

Document Structure

Each document contains:

  • pageContent: File content
  • metadata:
    • source: File path in repository
    • branch: Repository branch
    • commit: Commit hash
    • Additional custom metadata

Notes

  • Supports both public and private repos
  • Enterprise GitHub instances supported
  • Rate limiting handled automatically
  • Exponential backoff for retries
  • Path filtering with glob patterns
  • Memory-efficient processing
  • Error handling for invalid repos