Microsoft SharePoint

🚀

Enhanced

Added functionality

Microsoft SharePoint is a web-based collaboration and document management platform. This module provides a SharePoint document loader that ingests documents from both SharePoint Online (Microsoft 365) and SharePoint Server (on-premises / IaaS-hosted) into PebbleAgent’s vector stores for AI-powered search and retrieval.

Key capabilities:

  • SharePoint Online and SharePoint Server (2016+) support via the Environment selector
  • Load documents from standard SharePoint document libraries (folders and files)
  • Load items from SharePoint lists with rich metadata
  • Navigate specific folders within libraries
  • Filter by status, file type, date, size, and custom metadata fields
  • Handle large document sets (thousands of files) with batch processing and memory management
  • Five authentication methods covering cloud, hybrid, and on-premises deployments

Inputs

Required Parameters

ParameterDescription
Connect CredentialA SharePoint credential. The available credential types depend on your environment — see Authentication.
EnvironmentChoose your SharePoint deployment type: SharePoint Online (Microsoft 365, uses Graph API) or SharePoint Server (2016+ on-premises or IaaS-hosted, uses REST API). Defaults to SharePoint Online.
ModeChoose how to load documents: Document Library (files and folders) or SharePoint List (structured table with metadata columns). Controls which fields appear below.

Mode-Dependent Parameters

ParameterShown WhenDescription
Library NameDocument Library modeDisplay name of the library (e.g., Documents). Leave empty for the site’s default library.
Folder PathDocument Library modePath to a specific folder (e.g., /Reports/2024). Leave empty to load from the library root.
List NameSharePoint List modeName of the SharePoint list (e.g., Master, ControlDocument).

Optional Parameters

ParameterDescription
FilterSemi-colon separated key=value pairs to filter documents (e.g., RFCStatus=Current; fileTypes=pdf,docx). See Filtering.
Max DocumentsMaximum number of documents to load (default: 500, max: 10,000).
Text SplitterA text splitter node to chunk the extracted content.
Additional MetadataJSON object with extra metadata to add to all documents.
Omit Metadata KeysComma-separated list of metadata keys to exclude. Use * to omit all default metadata.

Additional Parameters

ParameterDescription
Batch SizeDocuments processed per batch (default: 100, range: 10–500). Reduce for large files, increase for small files.

Outputs

OutputDescription
DocumentArray of document objects containing metadata and pageContent
TextConcatenated string from pageContent of documents

Supported File Types

FormatExtensions
PDF.pdf
Word.docx, .doc
Excel.xlsx, .xls
Text.txt
CSV.csv

Authentication

PebbleAgent supports five authentication methods across SharePoint Online and SharePoint Server environments. The credential you choose determines which Environment setting to use.

EnvironmentCredentialBest For
SharePoint OnlineDevice Code FlowQuick setup, no admin needed
SharePoint OnlineOAuth2 (Enterprise)Centralised admin control
SharePoint ServerWindows NTLMSimple domain auth, widest compatibility
SharePoint ServerMicrosoft ADFS OAuth2Token-based with MFA support
SharePoint ServerWindows KerberosSSO with per-user permission preservation

SharePoint Online Credentials

Device Code Flow

A simple, secure OAuth2 method that requires no Azure AD app registration and no admin approval.

How it works:

  1. PebbleAgent displays a code (e.g., A1B2-C3D4) and a link
  2. You open a browser and go to microsoft.com/devicelogin
  3. You enter the code and sign in with your Microsoft account
  4. PebbleAgent receives permission to access your SharePoint files

Setting up the credential:

  1. Navigate to Credentials in PebbleAgent
  2. Click ”+ Add Credential”
  3. Select “Microsoft SharePoint Device Code Authentication”
  4. Enter a descriptive name (e.g., “SharePoint – DocCentre”)
  5. Click “Save”
  6. Click the “Authenticate” button next to your credential
  7. PebbleAgent displays the device code and link — open the link, enter the code, and sign in
  8. Complete Multi-Factor Authentication if required by your organisation
  9. Click “Accept” on the permission request

The permission dialog shows “Azure CLI” as the app name. This is expected — PebbleAgent uses Microsoft’s public Azure CLI client ID to enable device code flow without requiring an Azure AD app registration.

OAuth2 (Enterprise)

Standard OAuth2 Authorization Code Flow for organisations that require a custom Azure AD app registration.

Prerequisites:

  • An Azure AD admin must create an app registration with the following:
    • Redirect URI configured for your PebbleAgent instance
    • API permissions: Sites.Read.All, Files.Read.All, User.Read, openid, offline_access
    • A client secret generated for the app

Setting up the credential:

  1. Navigate to Credentials in PebbleAgent
  2. Click ”+ Add Credential”
  3. Select “Microsoft SharePoint OAuth2”
  4. Enter the following from your Azure AD app registration:
    • Authorization URL: https://login.microsoftonline.com/<tenantId>/oauth2/v2.0/authorize
    • Access Token URL: https://login.microsoftonline.com/<tenantId>/oauth2/v2.0/token
    • Client ID: From your app registration
    • Client Secret: From your app registration
    • Site URL: Your SharePoint site URL (e.g., https://contoso.sharepoint.com/sites/MySite)
  5. Click “Save” and complete the OAuth2 authorisation flow

Online Permissions

Both SharePoint Online methods request the same read-only permissions:

PermissionMicrosoft NameDescription
Read filesFiles.Read.AllRead any file you have permission to access in SharePoint
Read sitesSites.Read.AllSee which SharePoint sites you have access to
Maintain accessoffline_accessRefresh the access token without re-authenticating

PebbleAgent can only access files and sites that the authenticated user already has permission to.

Online Token Lifecycle

  • Access token expires every hour and is refreshed automatically
  • Refresh token is valid for 90 days and renews each time it is used
  • If the refresh token expires (90 days of inactivity), you will need to re-authenticate

SharePoint Server Credentials

For on-premises or IaaS-hosted SharePoint Server 2016+, PebbleAgent connects via the SharePoint REST API instead of Microsoft Graph. Three authentication methods are available.

Windows NTLM

The simplest option for SharePoint Server. Uses Windows domain credentials directly.

Setting up the credential:

  1. Navigate to Credentials in PebbleAgent
  2. Click ”+ Add Credential”
  3. Select “Windows NTLM Authentication”
  4. Enter:
    • Base URL: Your SharePoint Server site URL (e.g., https://sharepoint.contoso.local/sites/DocCentre)
    • Domain: Your Windows domain (e.g., CONTOSO)
    • Username: Your Windows username (without domain prefix)
    • Password: Your Windows domain password
  5. Click “Save”
⚠️

NTLM does not support MFA. If your organisation requires multi-factor authentication, use ADFS OAuth2 or Kerberos instead.

Optional: Enable Allow Self-Signed Certificates (under Additional Parameters) if the server uses an internal CA or self-signed certificate.

Microsoft ADFS OAuth2

Token-based authentication via Active Directory Federation Services. Supports MFA and device code flow — no passwords stored in PebbleAgent.

Prerequisites:

  • ADFS deployed with OAuth2/OpenID Connect endpoints
  • Application registered as a relying party trust in ADFS (your ADFS admin can set this up)
  • ADFS admin provides: the metadata URL and a client ID

Setting up the credential:

  1. Navigate to Credentials in PebbleAgent
  2. Click ”+ Add Credential”
  3. Select “Microsoft ADFS OAuth2”
  4. Enter:
    • Base URL: Your SharePoint Server site URL (e.g., https://sharepoint.contoso.local/sites/DocCentre)
    • ADFS Metadata URL: The OpenID Connect discovery endpoint (e.g., https://adfs.contoso.com/adfs/.well-known/openid-configuration)
    • Client ID: From your ADFS relying party trust registration
  5. Click “Save” and complete the ADFS device code flow authentication

Windows Kerberos

Kerberos Constrained Delegation preserves the full SSO chain and per-user permissions. The most secure option but requires the most AD admin setup.

Prerequisites:

  • AD admin creates a service principal (SPN) for PebbleAgent
  • AD admin configures constrained delegation to the SharePoint service
  • AD admin generates and provides a keytab file
  • Keytab file placed on the PebbleAgent server at a known path

Setting up the credential:

  1. Navigate to Credentials in PebbleAgent
  2. Click ”+ Add Credential”
  3. Select “Windows Kerberos Authentication”
  4. Enter:
    • Base URL: Your SharePoint Server site URL (e.g., https://sharepoint.contoso.local/sites/DocCentre)
    • Service Principal Name (SPN): The Kerberos SPN registered in AD (e.g., HTTP/pebbleagent.contoso.local)
    • Keytab File Path: Absolute path to the keytab file on the server (e.g., /etc/krb5/pebbleagent.keytab)
  5. Click “Save”

Server Authentication Comparison

FeatureNTLMADFS OAuth2Kerberos
MFA supportNoYesYes (via AD)
Passwords storedYes (encrypted)No (token-based)No (keytab-based)
AD admin setupNoneModerateMost
Per-user permissionsYesYesYes (strongest)
Self-signed certsSupportedSupportedSupported

SharePoint Source Types

SharePoint has two ways to store documents. The Mode dropdown controls which fields appear and how the loader connects to SharePoint.

How to tell the difference: Go to your SharePoint site → Site Contents (gear icon → Site Contents). Each item is labelled as either “Document Library” or “List”.

Document Libraries

Standard file storage, similar to folders on your computer.

SharePoint Site
└── Document Library ("Documents")
    ├── Folder A
    │   ├── file1.pdf
    │   └── file2.docx
    └── Folder B
        └── file3.xlsx

Use when you have a standard document library with files organised in folders. The URL typically contains /Shared Documents/ or a similar path.

⚠️

Use the display name (what you see in the SharePoint UI), not the URL name. For example, use Documents instead of Shared Documents. You can find the display name in Library settings under the gear icon, or in Site Contents.

SharePoint Lists

Database-like storage with rich metadata fields.

SharePoint Site
└── List ("ControlDocument")
    ├── Item 1 (Status: Current, Author: John)
    ├── Item 2 (Status: Archived, Author: Jane)
    └── Item 3 (Status: Draft, Author: Bob)

Use when your documents are stored in a list with custom metadata fields like status, author, or category. The URL typically contains /Lists/.

Folder Path

Navigate to a specific folder within a document library. Only available in Document Library mode.

Folder PathDescription
/ReportsLoad from “Reports” folder
/Reports/2024Load from “2024” subfolder
/General/Business Development/MarketingDeep folder path
(empty)Load from library root (all documents)
  • Use forward slashes /
  • Match folder names exactly (case-sensitive)
  • Don’t include the library name in the path

Filtering

Filter documents using semi-colon separated key=value pairs in the Filter field.

Supported Filters

FilterExampleDescription
RFCStatusRFCStatus=CurrentFilter by status field (SharePoint lists)
fileTypesfileTypes=pdf,docxOnly load specific file types
maxSizemaxSize=50MBSkip files larger than limit
modifiedAftermodifiedAfter=2024-01-01Only files modified after date
modifiedBeforemodifiedBefore=2024-12-31Only files modified before date

Combined Example

RFCStatus=Current; fileTypes=pdf,docx; maxSize=25MB; modifiedAfter=2024-01-01

Examples

Load from a Document Library (Online)

Load all documents from the default “Documents” library in SharePoint Online.

FieldValue
EnvironmentSharePoint Online
ModeDocument Library
Library NameDocuments

Leave Library Name empty to use the site’s default library.

Load from a Specific Folder

Load marketing materials from a nested folder.

FieldValue
EnvironmentSharePoint Online
ModeDocument Library
Library NameDocuments
Folder Path/General/Business Development/Marketing

Load from a SharePoint List with Status Filter

Load only “Current” documents from a document control system.

FieldValue
EnvironmentSharePoint Online
ModeSharePoint List
List NameControlDocument
FilterRFCStatus=Current

Load PDFs Modified This Year

FieldValue
EnvironmentSharePoint Online
ModeDocument Library
Library NameDocuments
Folder Path/Policies
FilterfileTypes=pdf; modifiedAfter=2025-01-01

Large Document Set with Limits

Load up to 5000 documents from a large library in small batches.

FieldValue
EnvironmentSharePoint Online
ModeDocument Library
Library NameArchive
Max Documents5000
Batch Size50
FiltermaxSize=25MB

Load from SharePoint Server (On-Premises)

Load policies from an on-premises SharePoint Server using NTLM authentication.

FieldValue
CredentialWindows NTLM Authentication
EnvironmentSharePoint Server
ModeDocument Library
Library NamePolicies
Folder Path/Current
FilterfileTypes=pdf,docx

Large-Scale Ingestion (1,000+ Documents)

For SharePoint sites with thousands of documents, use the multiple loader instances strategy. Instead of one loader with Max Documents: 10000, add several SharePoint loaders to the same Document Store, each scoped to a different subset.

Why Multiple Loaders?

  • Each loader processes independently — one failing doesn’t affect others
  • Each loader has its own status indicator in the Document Store table
  • Individual loaders can be refreshed without re-processing everything
  • Memory stays manageable (~500MB per 500-document loader)

Splitting Strategies

StrategyFilter ExampleBest For
By folderFolder Path: /HR/PoliciesLibraries with folder structure
By statusFilter: RFCStatus=CurrentLists with status fields
By file typeFilter: fileTypes=pdfMixed-format libraries
By date rangeFilter: modifiedAfter=2025-01-01Incremental ingestion
By Max DocumentsMax Documents: 500 (multiple loaders)Simple numerical partitioning

Monitoring Progress

Server logs show batch progress and memory usage:

[SharePoint] Processing batch 3/5 (items 201-300 of 500)
[SharePoint] Batch 3 complete. Memory: 450MB / 2048MB heap

If memory exceeds 85% of heap, processing stops gracefully with a warning.

Vector Store Cleanup

Deleting a loader does NOT remove its embeddings from the vector store. To properly clean up:

  1. Configure a Record Manager when upserting to the vector store
  2. Use full cleanup mode to remove embeddings for documents no longer in the Document Store
  3. After deleting a loader, re-upsert with Record Manager to trigger cleanup

For ongoing updates, use modifiedAfter filters with incremental cleanup mode in the Record Manager so that updated documents are automatically replaced.

Security and Privacy

What Gets Stored

All credentials are stored encrypted at rest. The specific data varies by type:

Credential TypeStored Data
Device Code FlowAccess token, refresh token, token expiry, user email, scopes
OAuth2 (Enterprise)Client ID, client secret, access token, refresh token, tenant URLs
Windows NTLMDomain, username, password, base URL
ADFS OAuth2Client ID, ADFS metadata URL, access token, base URL
KerberosSPN, keytab file path, base URL

PebbleAgent does not store device codes or Microsoft login passwords. SharePoint documents are only accessed when loading into a vector store.

Revoking Access

In PebbleAgent: Go to Credentials, find your SharePoint credential, and click “Delete”.

For SharePoint Online:

  • Device Code Flow: Go to account.microsoft.com/privacy/app-access, find “Azure CLI”, and click “Remove”.
  • OAuth2: An Azure AD admin can revoke application access or delete the app registration.

For SharePoint Server:

  • NTLM / Kerberos: Deleting the credential in PebbleAgent is sufficient. Optionally, an AD admin can disable the account or revoke the keytab.
  • ADFS OAuth2: An ADFS admin can revoke the relying party trust or the user’s session.

Troubleshooting

”Document library not found”

You may be using the URL name instead of the display name. Try Documents instead of Shared Documents. Leave the field empty to use the default library. The error message lists available libraries.

”Access denied” or “403 Forbidden”

Verify you can access the SharePoint site in a browser, check your credential is for the correct site URL, and re-authenticate if the token may have expired.

”Site not found”

Check the Site URL in your credential matches exactly. Include the full path (https://company.sharepoint.com/sites/MySite) without trailing slashes or paths beyond the site name.

No documents found

Try removing the folder path to load from the root, simplify or remove filters, and verify the folder exists in SharePoint. Folder paths are case-sensitive.

Documents loading slowly or memory errors

Reduce Max Documents and Batch Size (try 25–50). Add a maxSize=25MB filter to skip very large files. Use more specific folder paths. For very large sets, see Large-Scale Ingestion.

”Device code expired”

The code must be entered within 15 minutes. Click “Authenticate” again to get a new code. Prepare by opening microsoft.com/devicelogin beforehand.

”Need admin approval”

Your organisation requires admin consent for applications. Contact your IT department and ask them to enable user consent in Azure AD, or configure a custom OAuth2 credential with pre-approved permissions.

”Waiting for authentication…” never completes

Verify you clicked “Accept” in the browser and saw the “You’re all set!” message. If stuck, click “Cancel” and try again. If your server cannot reach login.microsoftonline.com, check outbound network/firewall rules.

Best Practices

  • Start small: Begin with Max Documents at 100 to verify configuration, then increase gradually
  • Be specific with paths: Target specific folders rather than loading entire libraries
  • Filter large libraries: Use fileTypes, maxSize, and modifiedAfter filters to narrow results
  • Monitor token expiration: Check credential status periodically and re-authenticate before the 90-day refresh token expiry
  • Use multiple loaders for scale: For 1,000+ documents, split across multiple loader instances in the same Document Store

FAQs

Why does the permission dialog say “Azure CLI”? PebbleAgent uses Microsoft’s public Azure CLI client ID to enable device code flow without requiring an Azure AD app registration. This is standard practice and secure.

When should I use OAuth2 instead of Device Code Flow? Use OAuth2 when your organisation requires a custom Azure AD app registration, needs centralised admin control over app permissions, or has disabled device code flow via conditional access policies.

Can I use one credential for multiple SharePoint sites? Yes. One credential can access all SharePoint sites the authenticated user has permission to. Configure multiple document stores or loaders pointing to different sites using the same credential.

What happens if I change my Microsoft password? Your refresh token continues to work unless your organisation enforces re-authentication after password changes.

Can multiple users share one credential? No. Each credential is tied to one Microsoft account. Different users should create their own credentials.

Does this work with SAML or Okta SSO? Yes, as long as your identity provider federates with Azure AD. The device code flow redirects through your organisation’s SSO.

Can I authenticate from a headless server? Yes. The server displays the code and you open the browser on any device (laptop, phone, etc.) to complete authentication. This is one of the key advantages of device code flow.

Does this work with on-premises SharePoint Server? Yes. Set the Environment to “SharePoint Server” and use one of the three Server credential types (NTLM, ADFS OAuth2, or Kerberos). SharePoint Server 2016 and later are supported via the SharePoint REST API.

Which SharePoint Server auth method should I choose? Start with NTLM for the simplest setup. Use ADFS OAuth2 if your organisation requires MFA or you prefer not to store passwords. Use Kerberos for the strongest security with full SSO chain preservation — but it requires the most AD admin setup.