UNSTRUCTURED DATA SHARE
85%
of enterprise content (IDC, 2023)
↑ unsearchable without AI processing
DOCS PER SMB
5K–50K
across drives, email, and wikis
↑ growing 30% per year
AI PROCESSING COST
$0.005
per page at scale (2024 API rates)
↓ 90% cheaper than 2022
MANUAL SEARCH COST
$22/hr
avg knowledge worker salary
↑ AI answers the same query in < 3s
Where company knowledge actually lives
Before building anything, it is worth being honest about where your company's knowledge is distributed. Most companies have never done this audit and are surprised by the result.
Email threads contain decisions that were never documented elsewhere. Shared drives have folders untouched for two years that hold the original contracts. Slack history contains context that only three people remember. The founding employee's laptop has processes that no one else knows exist.
A document AI system does not require perfect organisation before you start. It requires knowing what you have and where it is. The Agency Company's onboarding process starts with a content audit — two hours that typically surfaces more usable knowledge than clients expect.
Document types: what works, what needs preprocessing, what to skip
Different document types require different handling. Here is what works out of the box and what requires additional processing.
| Document type | RAG-ready as-is? | Preprocessing needed | Recommended action |
|---|---|---|---|
| PDF (text-based) | Yes | Minimal | Connect directly |
| Word / Google Docs | Yes | Export or API connection | Connect directly |
| Scanned PDFs | No | OCR required | Process first, then ingest |
| Email (Gmail/Outlook) | Partial | Thread parsing, deduplication | Selective ingestion by topic |
| Spreadsheets (.xlsx) | Partial | Flatten to rows or summarise | Structure before ingesting |
| Slack / Teams history | Partial | Thread grouping, noise filter | Filter by channel and date range |
You do not need all your documents to be perfect before starting. You need a critical mass of accurate, current content — for most companies, that is roughly 20% of existing documents. The AI works with what it has. You add more content areas over time as the system demonstrates value.
The build process in plain terms
Identify your highest-value knowledge sources
The 20% of documents that answer 80% of the questions your team asks. Start there — not with everything.
Connect or upload those sources into a vector database
Your documents are indexed for semantic search. When a user asks a question, the system retrieves the relevant section before generating a response.
Configure access rules
Only the right people see the right content. Role-based access is set at the retrieval layer — not just the UI layer.
Deploy a conversational interface on top
Your team asks questions in plain language. The AI searches your documents, cites its source, and returns the answer.
Updates are automatic. When a document changes, the next query returns the current version. There is no manual maintenance cycle unless you want to add entirely new content areas.
Sources
- IDC Data Age 2025: The Digitization of the World (idc.com)
- Gartner Market Guide for AI-Augmented Data Quality 2024 (gartner.com)
- OpenAI API pricing documentation (openai.com/pricing)