How AI Connects to Your Company Data: LLM, MCP, SQL Server, and the Knowledge Base Explained
Every organization asking “how do we use AI with our data” is really asking the same set of questions. How do we get our company knowledge into the LLM? How does it learn what our databases contain? Does it read our actual data or just the structure? What happens to our existing applications? How does MCP fit in? And where does SQL Server sit in all of this?
These questions do not have complicated answers. They have a straightforward architecture that most articles either overcomplicate with technical jargon or undercomplicate with vague diagrams. This article walks through the complete flow from the ground up, in plain language, with real SQL Server examples at every layer. By the end you will understand exactly how each piece connects and what your role as a DBA or data professional is in building and governing it.
This is the foundational article. Other SQLYARD AI articles go deep on individual components: MCP security, RAG pipeline design, vector indexes, agent patterns. This one explains how all of it fits together as a complete system. Read this first, then go deeper on whichever layer is most relevant to your situation.
- What the LLM Is and What It Cannot Do Alone
- The Knowledge Base: Your Company’s AI Memory
- What the LLM Actually Learns From (Metadata, Not Row Data)
- Step 1: Categorize Your Company Data
- Step 2: Classify Data Before Ingestion
- Step 3: The Ingestion Pipeline
- Step 4: Load Into SQL Server 2025
- Your Existing Applications Are Unaffected
- MCP: The Secure Bridge Between AI and SQL Server
- How a Question Becomes an Answer: The Complete Flow
1 What the LLM Is and What It Cannot Do Alone Beginner
A Large Language Model is a powerful reasoning and language engine. It can understand questions, write code, summarize documents, and generate intelligent responses. What it cannot do on its own is know anything about your organization.
An LLM trained on public internet data knows general SQL Server concepts. It does not know that your company calls its main database ProductionDB. It does not know that your Orders table has a special column called LegacyRefID that links to a 1998 system. It does not know your backup policy, your on-call runbook, or which stored procedures drive your most critical business processes. None of that is in its training data.
This is not a flaw. It is by design. Your company data is private and was never part of the LLM’s training. The solution is not to retrain the LLM on your data, which is enormously expensive and quickly becomes stale. The solution is to give the LLM access to your company knowledge at query time so it can answer questions grounded in your actual information rather than general knowledge.
That is what the knowledge base, the ingestion pipeline, SQL Server, and MCP are all designed to enable.
Think of the LLM as a brilliant consultant who just walked in the door. They are highly skilled and can reason through complex problems. But they know nothing about your company yet. The knowledge base is the briefing package you hand them before they start. MCP is the secure access badge that lets them query your live systems when they need current data. SQL Server is where both the briefing package and the live systems live.
2 The Knowledge Base: Your Company’s AI Memory Beginner
The knowledge base is the organized collection of your company’s knowledge stored in a format the LLM can retrieve from. It is not a single file or a single database table. It is a structured repository of documents, runbooks, policies, schema documentation, process guides, and institutional knowledge that has been processed into a form suitable for semantic search.
The key concept is that the LLM does not memorize your knowledge base. It retrieves from it at query time. When an employee asks a question, the system finds the most relevant pieces of your knowledge base, passes them to the LLM as context alongside the question, and the LLM generates an answer based on what it retrieved. This pattern is called Retrieval Augmented Generation, or RAG. The knowledge base is external to the model and can be updated at any time without retraining the model.
A 2026 enterprise RAG study found that when a system was restricted to high-quality curated content, hallucinations dropped to near zero. The same system fed unvetted data fabricated responses for 52% of out-of-scope questions. The quality of your knowledge base is the single most important factor in the quality of your AI system’s answers. This is not a technology problem. It is a data governance problem.
The most common enterprise LLM failure in 2026 is not technical. It is source data quality. The same document existing in three versions across SharePoint, email archives, and a shared drive means the AI retrieves whichever version is semantically closest to the query, which may be two years out of date. A RAG system is only as good as the knowledge it retrieves from. Garbage in, confident wrong answers out.
3 What the LLM Actually Learns From: Metadata, Not Row Data Beginner
This is the question that causes the most confusion. When people say “the AI learns from your data” what they usually mean is that the AI learns from your metadata, your schema, your documentation, and your descriptions of the data, not from the actual rows in your tables.
Your application’s actual transaction rows, customer records, and order history are not read by the LLM to make it smarter. They are queried at runtime through controlled MCP tools when a specific question requires live data. The LLM understands your data environment through documentation and schema structure, not through reading your actual data values.
What the LLM Learns From in SQL Server Terms
-- METADATA the LLM learns from (loaded into the knowledge base):
sys.tables -- your table names
sys.columns -- your column names, data types, nullability
sys.foreign_keys -- relationships between tables
sys.extended_properties -- descriptions you have written on tables and columns
sys.indexes -- your indexing strategy
Stored procedure -- names, parameters, what they do
documentation
Business glossary -- what "LegacyRefID" actually means
documentation
Process runbooks -- how to handle specific operational scenarios
Data dictionary -- definitions of domain-specific terms in your data
-- ACTUAL DATA the LLM does NOT learn from automatically:
-- Your actual rows in Orders, Customers, Transactions etc.
-- These are only accessed at query time through approved MCP tools
-- with specific permissions for specific questions
This distinction matters enormously for security. The LLM understanding that you have a table called dbo.Customers with a column called EmailAddress is different from the LLM having read every email address in that table. The metadata tells it the structure. The MCP tool controls whether and how it can query the actual values.
4 Step 1: Categorize Your Company Data Beginner
Before building anything technical, you need to decide what goes into the knowledge base. This is a business and governance decision, not a technology decision. Start by identifying the categories of knowledge your organization produces and needs to access.
Operational Knowledge
DBA runbooks, incident response procedures, maintenance checklists, escalation paths, on-call guides. The institutional knowledge that lives in people’s heads or scattered documents.
Technical Documentation
Database schema documentation, stored procedure descriptions, data dictionary, API documentation, system architecture guides, deployment procedures.
Business Policies
Data retention policies, compliance requirements, access control policies, data classification standards, regulatory requirements specific to your industry.
Historical Knowledge
Past incident reports and resolutions, post-mortems, architecture decision records, migration documentation, lessons learned from past projects.
Employee Knowledge
Subject matter expert guides, training materials, onboarding documentation, how-to guides written by experienced staff. The knowledge that leaves when people leave.
Live Data Access
This is NOT loaded into the knowledge base. Live operational data is accessed at query time through MCP tools with specific permissions. Kept completely separate from the knowledge base.
IEEE research from 2026 confirmed that adding structured metadata to every ingested document lifts RAG retrieval precision by over 9 percentage points. Every document loaded into your knowledge base should carry at minimum: owner, last verified date, classification level, department, and document type. Without last verified date you cannot detect stale content. Without classification you cannot enforce access controls at retrieval time.
5 Step 2: Classify Data Before Ingestion Intermediate
Every document must be classified before it enters the knowledge base. This is not optional. Organizations that expose unclassified data to the retrieval pipeline face what security researchers call “blast radius” expansion at inference time: sensitive information retrieved and surfaced to users who should not have access to it. The European Data Protection Board identified this as the primary LLM data leakage vector in 2025 guidance.
Use a four-level classification:
| Level | Definition | Access | Example |
|---|---|---|---|
| Public | Safe for all users including external | All users | General SQL Server best practices, public documentation |
| Internal | Safe for all employees | All employees | General company processes, standard runbooks |
| Confidential | Department or role level access | Specific roles or teams | DBA runbooks with server names, financial procedures |
| Restricted | Named individuals only | Explicit named access | Security configurations, credentials documentation, PII-adjacent data |
-- Store classification metadata with every chunk in SQL Server
CREATE TABLE dbo.KnowledgeBase (
ChunkID BIGINT IDENTITY(1,1) PRIMARY KEY,
DocumentName NVARCHAR(500) NOT NULL,
DocumentType NVARCHAR(50) NOT NULL,
Department NVARCHAR(100) NULL,
Classification NVARCHAR(20) NOT NULL
CHECK (Classification IN
('Public','Internal','Confidential','Restricted')),
Owner NVARCHAR(200) NOT NULL,
LastVerified DATE NOT NULL,
ChunkText NVARCHAR(MAX) NOT NULL,
Embedding VECTOR(1536) NOT NULL,
IsActive BIT NOT NULL DEFAULT 1,
CreatedAt DATETIME2 NOT NULL DEFAULT SYSDATETIME()
);
-- Row-level security enforces classification at retrieval time
-- Users only get results they are authorized to see
-- Classification at ingestion time makes this enforcement possible
6 Step 3: The Ingestion Pipeline Intermediate
The ingestion pipeline takes your raw documents and transforms them into vector embeddings stored in SQL Server. It has four stages: collect, chunk, embed, and load. Each stage has decisions that affect retrieval quality.
# Python ingestion pipeline for a SQL Server knowledge base
# pip install openai pyodbc python-docx PyPDF2
import openai
import pyodbc
import json
from datetime import date
# Configuration
CONN_STRING = (
"DRIVER={ODBC Driver 18 for SQL Server};"
"SERVER=your-server;"
"DATABASE=KnowledgeDB;"
"Trusted_Connection=yes;"
"Encrypt=yes;"
)
CHUNK_SIZE = 500 # tokens per chunk (approximately 375 words)
CHUNK_OVERLAP = 50 # tokens of overlap between chunks
client = openai.OpenAI()
def chunk_text(text: str) -> list[str]:
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + CHUNK_SIZE, len(words))
chunk = ' '.join(words[start:end])
chunks.append(chunk)
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def generate_embedding(text: str) -> list[float]:
"""Convert text to a vector embedding via OpenAI."""
response = client.embeddings.create(
model = "text-embedding-3-small",
input = text
)
return response.data[0].embedding
def ingest_document(
document_text: str,
document_name: str,
document_type: str,
classification: str,
department: str,
owner: str,
last_verified: date
):
"""
Full ingestion pipeline for one document.
Chunk, embed, and load into SQL Server with metadata.
"""
conn = pyodbc.connect(CONN_STRING)
cursor = conn.cursor()
chunks = chunk_text(document_text)
for i, chunk in enumerate(chunks):
embedding = generate_embedding(chunk)
cursor.execute("""
INSERT INTO dbo.KnowledgeBase
(DocumentName, DocumentType, Department, Classification,
Owner, LastVerified, ChunkText, Embedding)
VALUES (?, ?, ?, ?, ?, ?, ?, CAST(? AS VECTOR(1536)))
""",
document_name,
document_type,
department,
classification,
owner,
last_verified,
chunk,
str(embedding)
)
conn.commit()
conn.close()
print(f"Loaded {len(chunks)} chunks from: {document_name}")
# Example: ingest a DBA runbook
with open('suspect_database_runbook.txt', 'r') as f:
runbook_text = f.read()
ingest_document(
document_text = runbook_text,
document_name = "Suspect Database Recovery Runbook v2.1",
document_type = "runbook",
classification = "Confidential",
department = "Database Administration",
owner = "DBA Team",
last_verified = date(2026, 5, 1)
)
Chunking Strategy by Document Type
| Document Type | Chunk Strategy | Why |
|---|---|---|
| Runbooks and procedures | 300 to 500 tokens with 75 token overlap | Step sequences need enough context to be actionable without losing procedure continuity across chunks |
| Policy documents | 200 to 400 tokens with 50 token overlap | Policies have section-level meaning. Smaller chunks improve precision when retrieving specific rules |
| Schema documentation | One chunk per table description | Table context should stay together. Splitting a table description across chunks loses the relational meaning |
| Incident reports | Whole document if under 800 tokens | Incidents are self-contained narratives. Chunking loses the cause-effect relationship |
| Long technical articles | 400 to 600 tokens with 100 token overlap | Larger overlap preserves the argument being made across section boundaries |
7 Step 4: Load Into SQL Server 2025 Intermediate
SQL Server 2025 stores the knowledge base natively using the VECTOR data type and DiskANN index. This means no separate vector database (Pinecone, Chroma, Weaviate) is needed. Everything lives in one place with the governance, security, backup, and high availability features SQL Server already provides.
-- Complete knowledge base schema in SQL Server 2025
USE KnowledgeDB;
GO
CREATE TABLE dbo.KnowledgeBase (
ChunkID BIGINT IDENTITY(1,1) PRIMARY KEY,
DocumentName NVARCHAR(500) NOT NULL,
DocumentType NVARCHAR(50) NOT NULL,
Department NVARCHAR(100) NULL,
Classification NVARCHAR(20) NOT NULL
CHECK (Classification IN
('Public','Internal','Confidential','Restricted')),
Owner NVARCHAR(200) NOT NULL,
LastVerified DATE NOT NULL,
ChunkText NVARCHAR(MAX) NOT NULL,
Embedding VECTOR(1536) NOT NULL,
IsActive BIT NOT NULL DEFAULT 1,
CreatedAt DATETIME2 NOT NULL DEFAULT SYSDATETIME()
);
-- DiskANN vector index for fast semantic similarity search
CREATE VECTOR INDEX IX_KB_Embedding
ON dbo.KnowledgeBase (Embedding)
WITH (METRIC = 'cosine', TYPE = 'diskann');
-- Supporting index for classification-scoped retrieval
-- Applied before vector search to reduce search space
CREATE NONCLUSTERED INDEX IX_KB_Classification_Type
ON dbo.KnowledgeBase (Classification, DocumentType, IsActive)
INCLUDE (ChunkText, DocumentName, Department);
-- Verify what is in the knowledge base
SELECT
DocumentType,
Classification,
Department,
COUNT(*) AS ChunkCount,
COUNT(DISTINCT DocumentName) AS DocumentCount,
MIN(LastVerified) AS OldestVerified,
MAX(LastVerified) AS NewestVerified
FROM dbo.KnowledgeBase
WHERE IsActive = 1
GROUP BY DocumentType, Classification, Department
ORDER BY DocumentType, Classification;
8 Your Existing Applications Are Unaffected Beginner
This is the question every application owner and DBA asks before approving an AI deployment. The answer is clear: your existing applications connect to SQL Server exactly the same way they always have. Nothing changes for them.
Your .NET application, Java application, SSIS packages, reporting tools, and every other system using connection strings to reach SQL Server continues working identically. SQL Server does not know or care that an LLM is also connected through MCP. The two access paths are completely independent, use separate database accounts with separate permissions, and have no awareness of each other.
The key design principle is that the AI access layer is additive. You are adding a new controlled access path alongside the existing ones. You are not modifying existing access paths, changing schemas that applications depend on, or introducing any dependencies between the application layer and the AI layer.
9 MCP: The Secure Bridge Between AI and SQL Server Intermediate
The Model Context Protocol is the standard that connects AI tools to external systems including SQL Server. An MCP server sits between the AI client (Claude, ChatGPT, Amazon Q, your internal assistant) and SQL Server. It translates AI tool call requests into SQL queries, enforces permissions, logs every action, and returns results in a format the AI can use.
There are two separate MCP connections in a complete deployment: one to the knowledge base for document retrieval, and one to your operational data for live queries. They use separate accounts with separate permissions.
-- Two separate SQL accounts for two separate MCP connections
-- Account 1: Knowledge base retrieval (read the vector store)
CREATE LOGIN mcp_knowledge_reader
WITH PASSWORD = 'StrongPassword2026!';
USE KnowledgeDB;
CREATE USER mcp_knowledge_reader FOR LOGIN mcp_knowledge_reader;
GRANT SELECT ON dbo.KnowledgeBase TO mcp_knowledge_reader;
GRANT VIEW DATABASE STATE TO mcp_knowledge_reader;
-- Account 2: Operational data queries (read-only, scoped tables only)
CREATE LOGIN mcp_ops_reader
WITH PASSWORD = 'DifferentStrongPassword2026!';
USE ProductionDB;
CREATE USER mcp_ops_reader FOR LOGIN mcp_ops_reader;
-- Grant SELECT only on approved tables for AI queries
-- Never grant db_datareader -- too broad
GRANT SELECT ON dbo.OrderSummary TO mcp_ops_reader;
GRANT SELECT ON dbo.PerformanceStats TO mcp_ops_reader;
GRANT VIEW DATABASE STATE TO mcp_ops_reader;
-- Deny access to sensitive tables even within approved schemas
DENY SELECT ON dbo.CustomerPII TO mcp_ops_reader;
DENY SELECT ON dbo.PaymentDetails TO mcp_ops_reader;
DENY SELECT ON dbo.EmployeeRecords TO mcp_ops_reader;
For a complete production MCP server implementation for SQL Server including forbidden keyword blocking, row caps, query timeouts, and audit logging, see the SQLYARD article: MCP and SQL Server: What Every DBA Needs to Know.
10 How a Question Becomes an Answer: The Complete Flow Beginner
Here is exactly what happens from the moment an employee types a question to the moment they receive an answer grounded in your company’s actual knowledge and data.
The entire flow from question to answer typically takes 2 to 5 seconds in a well-designed deployment. The knowledge base retrieval and the live data query run in parallel where possible. The LLM synthesizes both into one coherent response.
11 Keeping the Knowledge Base Current Intermediate
A knowledge base that is not maintained becomes a liability. The AI will answer confidently from stale runbooks, superseded policies, and documentation written by people who left the company two years ago. Freshness is a governance problem, not a technology problem. The solution is treating knowledge base maintenance as an operational process, not a one-time setup task.
-- Monitor knowledge base freshness
-- Flag documents that have not been verified in over 90 days
SELECT
DocumentName,
DocumentType,
Department,
Owner,
LastVerified,
DATEDIFF(DAY, LastVerified, GETDATE()) AS DaysSinceVerified,
CASE
WHEN DATEDIFF(DAY, LastVerified, GETDATE()) > 180
THEN 'CRITICAL: Over 6 months old'
WHEN DATEDIFF(DAY, LastVerified, GETDATE()) > 90
THEN 'WARNING: Over 90 days old'
ELSE 'Current'
END AS FreshnessStatus
FROM (
SELECT DISTINCT
DocumentName,
DocumentType,
Department,
Owner,
LastVerified
FROM dbo.KnowledgeBase
WHERE IsActive = 1
) docs
WHERE DATEDIFF(DAY, LastVerified, GETDATE()) > 90
ORDER BY DaysSinceVerified DESC;
-- Deactivate outdated documents rather than deleting them
-- Keeps audit trail while removing from retrieval
UPDATE dbo.KnowledgeBase
SET IsActive = 0
WHERE DocumentName = 'Old Runbook v1.0'
AND LastVerified < '2025-01-01';
-- Re-ingest an updated document
-- Run the Python ingestion script with the new version
-- Then deactivate the old version
Recommended Verification Schedule by Document Type
- Runbooks and procedures: quarterly review, update whenever the underlying system changes
- Schema documentation: automated regeneration on every schema change via a DDL trigger or deployment pipeline
- Security and compliance policies: annual review minimum, immediately on regulatory changes
- Incident reports and resolutions: load once, no expiry (historical record)
- Employee knowledge guides: ownership-based review, flag when the owner leaves the organization
12 The DBA Role in the AI Data Layer Beginner
Every layer of this architecture is a data engineering and database administration problem. The DBA does not need to build the LLM. The DBA does need to own the data layer that makes the LLM useful and safe.
- Knowledge base schema design. The vector table, classification columns, freshness tracking, and access control structure are database design decisions.
- Vector index management. DiskANN indexes need monitoring and periodic maintenance just like B-tree indexes.
- MCP security governance. The dedicated SQL accounts, permission grants and denies, and audit configuration for MCP connections are DBA responsibilities.
- Data quality governance. Stale knowledge base content produces wrong answers. The DBA team is best positioned to build the monitoring queries and alerting that flag outdated content.
- Schema documentation quality. The LLM's ability to generate correct SQL against your operational databases depends on the quality of your schema documentation in the knowledge base. Meaningful table names, column descriptions, and extended properties are now directly tied to AI output quality.
- Performance monitoring. AI workloads generate distinctive query patterns. Vector searches, parallel embedding lookups, and high-frequency small queries from agent sessions need to be understood and indexed for appropriately.
The DBA who builds and owns this layer is the person making the AI deployment work safely and reliably. Developers building AI applications do not naturally think about data classification, stale content governance, least-privilege SQL accounts, or audit logging at the database layer. Those skills are exactly what makes a senior DBA or data architect uniquely valuable in 2026. The technology changed. The discipline did not.
13 The Complete Architecture Summary Beginner
Here is the complete picture from company data to employee answer. Every layer has a clear owner and a clear purpose.
| Layer | Owner | Technology | Key Decision |
|---|---|---|---|
| Data Sources | Business + DBA | Documents, SharePoint, exports | What knowledge matters most |
| Knowledge Base | DBA | SQL Server 2025 VECTOR tables | Classification and chunking strategy |
| MCP Servers | DBA + Security | FastMCP, Python | What the AI is allowed to access |
| AI Client | Application team | Claude, Amazon Q, custom | Which AI tool serves which users |
| Existing Apps | Application team | Unchanged | No changes needed |
References
- Atlan: How to Build an LLM Knowledge Base for Enterprise (April 2026)
- Atlan: Enterprise LLM Knowledge Base Architecture (April 2026)
- Atlan: Preparing Data for LLM Knowledge Bases: Governance and Readiness (April 2026)
- Microsoft Docs: Vector Support in SQL Server 2025
- Microsoft Docs: CREATE VECTOR INDEX (DiskANN)
- Model Context Protocol: Official Specification
- SQLYARD: MCP and SQL Server: What Every DBA Needs to Know
- SQLYARD: What Every Senior DBA Needs to Know About LLMs, RAG, Agents, and MCP
- SQLYARD: From Rows to Reasoning: Designing SQL Server Databases for AI Apps and Agents
- SQLYARD: SQL Server Index Tuning: From Basics to AI-Assisted Recommendations
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


