How AI Connects to Your Company Data: LLM, MCP, SQL Server, and the Knowledge Base Explained

How AI Connects to Your Company Data: LLM, MCP, SQL Server, and the Knowledge Base Explained – SQLYARD

How AI Connects to Your Company Data: LLM, MCP, SQL Server, and the Knowledge Base Explained


Every organization asking “how do we use AI with our data” is really asking the same set of questions. How do we get our company knowledge into the LLM? How does it learn what our databases contain? Does it read our actual data or just the structure? What happens to our existing applications? How does MCP fit in? And where does SQL Server sit in all of this?

These questions do not have complicated answers. They have a straightforward architecture that most articles either overcomplicate with technical jargon or undercomplicate with vague diagrams. This article walks through the complete flow from the ground up, in plain language, with real SQL Server examples at every layer. By the end you will understand exactly how each piece connects and what your role as a DBA or data professional is in building and governing it.

This is the foundational article. Other SQLYARD AI articles go deep on individual components: MCP security, RAG pipeline design, vector indexes, agent patterns. This one explains how all of it fits together as a complete system. Read this first, then go deeper on whichever layer is most relevant to your situation.

1 What the LLM Is and What It Cannot Do Alone Beginner

A Large Language Model is a powerful reasoning and language engine. It can understand questions, write code, summarize documents, and generate intelligent responses. What it cannot do on its own is know anything about your organization.

An LLM trained on public internet data knows general SQL Server concepts. It does not know that your company calls its main database ProductionDB. It does not know that your Orders table has a special column called LegacyRefID that links to a 1998 system. It does not know your backup policy, your on-call runbook, or which stored procedures drive your most critical business processes. None of that is in its training data.

This is not a flaw. It is by design. Your company data is private and was never part of the LLM’s training. The solution is not to retrain the LLM on your data, which is enormously expensive and quickly becomes stale. The solution is to give the LLM access to your company knowledge at query time so it can answer questions grounded in your actual information rather than general knowledge.

That is what the knowledge base, the ingestion pipeline, SQL Server, and MCP are all designed to enable.

Think of the LLM as a brilliant consultant who just walked in the door. They are highly skilled and can reason through complex problems. But they know nothing about your company yet. The knowledge base is the briefing package you hand them before they start. MCP is the secure access badge that lets them query your live systems when they need current data. SQL Server is where both the briefing package and the live systems live.

2 The Knowledge Base: Your Company’s AI Memory Beginner

The knowledge base is the organized collection of your company’s knowledge stored in a format the LLM can retrieve from. It is not a single file or a single database table. It is a structured repository of documents, runbooks, policies, schema documentation, process guides, and institutional knowledge that has been processed into a form suitable for semantic search.

The key concept is that the LLM does not memorize your knowledge base. It retrieves from it at query time. When an employee asks a question, the system finds the most relevant pieces of your knowledge base, passes them to the LLM as context alongside the question, and the LLM generates an answer based on what it retrieved. This pattern is called Retrieval Augmented Generation, or RAG. The knowledge base is external to the model and can be updated at any time without retraining the model.

A 2026 enterprise RAG study found that when a system was restricted to high-quality curated content, hallucinations dropped to near zero. The same system fed unvetted data fabricated responses for 52% of out-of-scope questions. The quality of your knowledge base is the single most important factor in the quality of your AI system’s answers. This is not a technology problem. It is a data governance problem.

The most common enterprise LLM failure in 2026 is not technical. It is source data quality. The same document existing in three versions across SharePoint, email archives, and a shared drive means the AI retrieves whichever version is semantically closest to the query, which may be two years out of date. A RAG system is only as good as the knowledge it retrieves from. Garbage in, confident wrong answers out.

3 What the LLM Actually Learns From: Metadata, Not Row Data Beginner

This is the question that causes the most confusion. When people say “the AI learns from your data” what they usually mean is that the AI learns from your metadata, your schema, your documentation, and your descriptions of the data, not from the actual rows in your tables.

Your application’s actual transaction rows, customer records, and order history are not read by the LLM to make it smarter. They are queried at runtime through controlled MCP tools when a specific question requires live data. The LLM understands your data environment through documentation and schema structure, not through reading your actual data values.

What the LLM Learns From in SQL Server Terms

-- METADATA the LLM learns from (loaded into the knowledge base):
sys.tables          -- your table names
sys.columns         -- your column names, data types, nullability
sys.foreign_keys    -- relationships between tables
sys.extended_properties  -- descriptions you have written on tables and columns
sys.indexes         -- your indexing strategy
Stored procedure    -- names, parameters, what they do
  documentation
Business glossary   -- what "LegacyRefID" actually means
  documentation
Process runbooks    -- how to handle specific operational scenarios
Data dictionary     -- definitions of domain-specific terms in your data

-- ACTUAL DATA the LLM does NOT learn from automatically:
-- Your actual rows in Orders, Customers, Transactions etc.
-- These are only accessed at query time through approved MCP tools
-- with specific permissions for specific questions

This distinction matters enormously for security. The LLM understanding that you have a table called dbo.Customers with a column called EmailAddress is different from the LLM having read every email address in that table. The metadata tells it the structure. The MCP tool controls whether and how it can query the actual values.

4 Step 1: Categorize Your Company Data Beginner

Before building anything technical, you need to decide what goes into the knowledge base. This is a business and governance decision, not a technology decision. Start by identifying the categories of knowledge your organization produces and needs to access.

Operational Knowledge

DBA runbooks, incident response procedures, maintenance checklists, escalation paths, on-call guides. The institutional knowledge that lives in people’s heads or scattered documents.

Technical Documentation

Database schema documentation, stored procedure descriptions, data dictionary, API documentation, system architecture guides, deployment procedures.

Business Policies

Data retention policies, compliance requirements, access control policies, data classification standards, regulatory requirements specific to your industry.

Historical Knowledge

Past incident reports and resolutions, post-mortems, architecture decision records, migration documentation, lessons learned from past projects.

Employee Knowledge

Subject matter expert guides, training materials, onboarding documentation, how-to guides written by experienced staff. The knowledge that leaves when people leave.

Live Data Access

This is NOT loaded into the knowledge base. Live operational data is accessed at query time through MCP tools with specific permissions. Kept completely separate from the knowledge base.

IEEE research from 2026 confirmed that adding structured metadata to every ingested document lifts RAG retrieval precision by over 9 percentage points. Every document loaded into your knowledge base should carry at minimum: owner, last verified date, classification level, department, and document type. Without last verified date you cannot detect stale content. Without classification you cannot enforce access controls at retrieval time.

5 Step 2: Classify Data Before Ingestion Intermediate

Every document must be classified before it enters the knowledge base. This is not optional. Organizations that expose unclassified data to the retrieval pipeline face what security researchers call “blast radius” expansion at inference time: sensitive information retrieved and surfaced to users who should not have access to it. The European Data Protection Board identified this as the primary LLM data leakage vector in 2025 guidance.

Use a four-level classification:

LevelDefinitionAccessExample
Public Safe for all users including external All users General SQL Server best practices, public documentation
Internal Safe for all employees All employees General company processes, standard runbooks
Confidential Department or role level access Specific roles or teams DBA runbooks with server names, financial procedures
Restricted Named individuals only Explicit named access Security configurations, credentials documentation, PII-adjacent data
-- Store classification metadata with every chunk in SQL Server
CREATE TABLE dbo.KnowledgeBase (
    ChunkID         BIGINT          IDENTITY(1,1) PRIMARY KEY,
    DocumentName    NVARCHAR(500)   NOT NULL,
    DocumentType    NVARCHAR(50)    NOT NULL,
    Department      NVARCHAR(100)   NULL,
    Classification  NVARCHAR(20)    NOT NULL
                    CHECK (Classification IN
                        ('Public','Internal','Confidential','Restricted')),
    Owner           NVARCHAR(200)   NOT NULL,
    LastVerified    DATE            NOT NULL,
    ChunkText       NVARCHAR(MAX)   NOT NULL,
    Embedding       VECTOR(1536)    NOT NULL,
    IsActive        BIT             NOT NULL DEFAULT 1,
    CreatedAt       DATETIME2       NOT NULL DEFAULT SYSDATETIME()
);

-- Row-level security enforces classification at retrieval time
-- Users only get results they are authorized to see
-- Classification at ingestion time makes this enforcement possible

6 Step 3: The Ingestion Pipeline Intermediate

The ingestion pipeline takes your raw documents and transforms them into vector embeddings stored in SQL Server. It has four stages: collect, chunk, embed, and load. Each stage has decisions that affect retrieval quality.

Raw Documents (Word docs, PDFs, runbooks, schema exports, emails, SharePoint, Confluence) │ ▼ STAGE 1: COLLECT Gather source documents Assign metadata tags (owner, date, classification, department, type) Remove duplicates and stale versions │ ▼ STAGE 2: CHUNK Split into pieces the LLM can process in one context window slot (500 tokens per chunk typical, 50 token overlap between chunks to preserve sentence continuity) │ ▼ STAGE 3: EMBED Convert each chunk to a vector (list of 1,536 numbers representing semantic meaning) via an embedding model API (Azure OpenAI, OpenAI, local model via Ollama) │ ▼ STAGE 4: LOAD Store chunk text + embedding + metadata in SQL Server 2025 VECTOR table with DiskANN index for fast similarity search
# Python ingestion pipeline for a SQL Server knowledge base
# pip install openai pyodbc python-docx PyPDF2

import openai
import pyodbc
import json
from datetime import date

# Configuration
CONN_STRING = (
    "DRIVER={ODBC Driver 18 for SQL Server};"
    "SERVER=your-server;"
    "DATABASE=KnowledgeDB;"
    "Trusted_Connection=yes;"
    "Encrypt=yes;"
)
CHUNK_SIZE    = 500    # tokens per chunk (approximately 375 words)
CHUNK_OVERLAP = 50     # tokens of overlap between chunks

client = openai.OpenAI()

def chunk_text(text: str) -> list[str]:
    """Split text into overlapping chunks."""
    words  = text.split()
    chunks = []
    start  = 0
    while start < len(words):
        end   = min(start + CHUNK_SIZE, len(words))
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks

def generate_embedding(text: str) -> list[float]:
    """Convert text to a vector embedding via OpenAI."""
    response = client.embeddings.create(
        model = "text-embedding-3-small",
        input = text
    )
    return response.data[0].embedding

def ingest_document(
    document_text:   str,
    document_name:   str,
    document_type:   str,
    classification:  str,
    department:      str,
    owner:           str,
    last_verified:   date
):
    """
    Full ingestion pipeline for one document.
    Chunk, embed, and load into SQL Server with metadata.
    """
    conn   = pyodbc.connect(CONN_STRING)
    cursor = conn.cursor()
    chunks = chunk_text(document_text)

    for i, chunk in enumerate(chunks):
        embedding = generate_embedding(chunk)

        cursor.execute("""
            INSERT INTO dbo.KnowledgeBase
                (DocumentName, DocumentType, Department, Classification,
                 Owner, LastVerified, ChunkText, Embedding)
            VALUES (?, ?, ?, ?, ?, ?, ?, CAST(? AS VECTOR(1536)))
        """,
            document_name,
            document_type,
            department,
            classification,
            owner,
            last_verified,
            chunk,
            str(embedding)
        )

    conn.commit()
    conn.close()
    print(f"Loaded {len(chunks)} chunks from: {document_name}")

# Example: ingest a DBA runbook
with open('suspect_database_runbook.txt', 'r') as f:
    runbook_text = f.read()

ingest_document(
    document_text  = runbook_text,
    document_name  = "Suspect Database Recovery Runbook v2.1",
    document_type  = "runbook",
    classification = "Confidential",
    department     = "Database Administration",
    owner          = "DBA Team",
    last_verified  = date(2026, 5, 1)
)

Chunking Strategy by Document Type

Document TypeChunk StrategyWhy
Runbooks and procedures 300 to 500 tokens with 75 token overlap Step sequences need enough context to be actionable without losing procedure continuity across chunks
Policy documents 200 to 400 tokens with 50 token overlap Policies have section-level meaning. Smaller chunks improve precision when retrieving specific rules
Schema documentation One chunk per table description Table context should stay together. Splitting a table description across chunks loses the relational meaning
Incident reports Whole document if under 800 tokens Incidents are self-contained narratives. Chunking loses the cause-effect relationship
Long technical articles 400 to 600 tokens with 100 token overlap Larger overlap preserves the argument being made across section boundaries

7 Step 4: Load Into SQL Server 2025 Intermediate

SQL Server 2025 stores the knowledge base natively using the VECTOR data type and DiskANN index. This means no separate vector database (Pinecone, Chroma, Weaviate) is needed. Everything lives in one place with the governance, security, backup, and high availability features SQL Server already provides.

-- Complete knowledge base schema in SQL Server 2025
USE KnowledgeDB;
GO

CREATE TABLE dbo.KnowledgeBase (
    ChunkID         BIGINT          IDENTITY(1,1) PRIMARY KEY,
    DocumentName    NVARCHAR(500)   NOT NULL,
    DocumentType    NVARCHAR(50)    NOT NULL,
    Department      NVARCHAR(100)   NULL,
    Classification  NVARCHAR(20)    NOT NULL
                    CHECK (Classification IN
                        ('Public','Internal','Confidential','Restricted')),
    Owner           NVARCHAR(200)   NOT NULL,
    LastVerified    DATE            NOT NULL,
    ChunkText       NVARCHAR(MAX)   NOT NULL,
    Embedding       VECTOR(1536)    NOT NULL,
    IsActive        BIT             NOT NULL DEFAULT 1,
    CreatedAt       DATETIME2       NOT NULL DEFAULT SYSDATETIME()
);

-- DiskANN vector index for fast semantic similarity search
CREATE VECTOR INDEX IX_KB_Embedding
ON dbo.KnowledgeBase (Embedding)
WITH (METRIC = 'cosine', TYPE = 'diskann');

-- Supporting index for classification-scoped retrieval
-- Applied before vector search to reduce search space
CREATE NONCLUSTERED INDEX IX_KB_Classification_Type
ON dbo.KnowledgeBase (Classification, DocumentType, IsActive)
INCLUDE (ChunkText, DocumentName, Department);

-- Verify what is in the knowledge base
SELECT
    DocumentType,
    Classification,
    Department,
    COUNT(*)            AS ChunkCount,
    COUNT(DISTINCT DocumentName) AS DocumentCount,
    MIN(LastVerified)   AS OldestVerified,
    MAX(LastVerified)   AS NewestVerified
FROM dbo.KnowledgeBase
WHERE IsActive = 1
GROUP BY DocumentType, Classification, Department
ORDER BY DocumentType, Classification;

8 Your Existing Applications Are Unaffected Beginner

This is the question every application owner and DBA asks before approving an AI deployment. The answer is clear: your existing applications connect to SQL Server exactly the same way they always have. Nothing changes for them.

Your .NET application, Java application, SSIS packages, reporting tools, and every other system using connection strings to reach SQL Server continues working identically. SQL Server does not know or care that an LLM is also connected through MCP. The two access paths are completely independent, use separate database accounts with separate permissions, and have no awareness of each other.

EXISTING APPLICATIONS AI TOOLS VIA MCP (unchanged, same connection) (new separate path) │ │ │ Connection string │ MCP Protocol │ ADO.NET / ODBC │ JSON-RPC │ Existing SQL account │ Dedicated read-only account ▼ ▼ SQL Server (same instance) Serves both simultaneously Treats as completely separate sessions Separate permissions, separate audit logs

The key design principle is that the AI access layer is additive. You are adding a new controlled access path alongside the existing ones. You are not modifying existing access paths, changing schemas that applications depend on, or introducing any dependencies between the application layer and the AI layer.

9 MCP: The Secure Bridge Between AI and SQL Server Intermediate

The Model Context Protocol is the standard that connects AI tools to external systems including SQL Server. An MCP server sits between the AI client (Claude, ChatGPT, Amazon Q, your internal assistant) and SQL Server. It translates AI tool call requests into SQL queries, enforces permissions, logs every action, and returns results in a format the AI can use.

There are two separate MCP connections in a complete deployment: one to the knowledge base for document retrieval, and one to your operational data for live queries. They use separate accounts with separate permissions.

-- Two separate SQL accounts for two separate MCP connections

-- Account 1: Knowledge base retrieval (read the vector store)
CREATE LOGIN mcp_knowledge_reader
    WITH PASSWORD = 'StrongPassword2026!';
USE KnowledgeDB;
CREATE USER mcp_knowledge_reader FOR LOGIN mcp_knowledge_reader;
GRANT SELECT ON dbo.KnowledgeBase TO mcp_knowledge_reader;
GRANT VIEW DATABASE STATE         TO mcp_knowledge_reader;

-- Account 2: Operational data queries (read-only, scoped tables only)
CREATE LOGIN mcp_ops_reader
    WITH PASSWORD = 'DifferentStrongPassword2026!';
USE ProductionDB;
CREATE USER mcp_ops_reader FOR LOGIN mcp_ops_reader;

-- Grant SELECT only on approved tables for AI queries
-- Never grant db_datareader -- too broad
GRANT SELECT ON dbo.OrderSummary     TO mcp_ops_reader;
GRANT SELECT ON dbo.PerformanceStats TO mcp_ops_reader;
GRANT VIEW DATABASE STATE            TO mcp_ops_reader;

-- Deny access to sensitive tables even within approved schemas
DENY SELECT ON dbo.CustomerPII       TO mcp_ops_reader;
DENY SELECT ON dbo.PaymentDetails    TO mcp_ops_reader;
DENY SELECT ON dbo.EmployeeRecords   TO mcp_ops_reader;

For a complete production MCP server implementation for SQL Server including forbidden keyword blocking, row caps, query timeouts, and audit logging, see the SQLYARD article: MCP and SQL Server: What Every DBA Needs to Know.

10 How a Question Becomes an Answer: The Complete Flow Beginner

Here is exactly what happens from the moment an employee types a question to the moment they receive an answer grounded in your company’s actual knowledge and data.

Employee asks: “What is the procedure for a suspect database and are there any blocked sessions on ProductionDB right now?” │ ▼ AI CLIENT (Claude, Amazon Q etc.) LLM receives the question Decides it needs two things: 1. Knowledge base retrieval (procedure) 2. Live data query (blocked sessions) │ ┌──────────┴──────────┐ ▼ ▼ KNOWLEDGE BASE MCP OPERATIONAL MCP Convert question Call get_blocking_sessions to vector embedding tool on ProductionDB │ │ ▼ ▼ SQL Server: VECTOR_SEARCH SQL Server: SELECT from finds top 5 chunks from sys.dm_exec_requests suspect database runbook WHERE blocking_session_id > 0 │ │ └──────────┬──────────┘ │ ▼ LLM receives both: – Runbook chunks as context – Live blocking session data │ ▼ LLM generates answer: “The procedure for a suspect database is: 1. Run sp_resetstatus 2. Set EMERGENCY mode… [from your runbook] Currently on ProductionDB, session 54 is blocking 3 other sessions. It has been running for 8 minutes on dbo.Orders…” │ ▼ Employee receives a grounded, accurate, company-specific answer in plain English

The entire flow from question to answer typically takes 2 to 5 seconds in a well-designed deployment. The knowledge base retrieval and the live data query run in parallel where possible. The LLM synthesizes both into one coherent response.

11 Keeping the Knowledge Base Current Intermediate

A knowledge base that is not maintained becomes a liability. The AI will answer confidently from stale runbooks, superseded policies, and documentation written by people who left the company two years ago. Freshness is a governance problem, not a technology problem. The solution is treating knowledge base maintenance as an operational process, not a one-time setup task.

-- Monitor knowledge base freshness
-- Flag documents that have not been verified in over 90 days

SELECT
    DocumentName,
    DocumentType,
    Department,
    Owner,
    LastVerified,
    DATEDIFF(DAY, LastVerified, GETDATE())  AS DaysSinceVerified,
    CASE
        WHEN DATEDIFF(DAY, LastVerified, GETDATE()) > 180
        THEN 'CRITICAL: Over 6 months old'
        WHEN DATEDIFF(DAY, LastVerified, GETDATE()) > 90
        THEN 'WARNING: Over 90 days old'
        ELSE 'Current'
    END                                     AS FreshnessStatus
FROM (
    SELECT DISTINCT
        DocumentName,
        DocumentType,
        Department,
        Owner,
        LastVerified
    FROM dbo.KnowledgeBase
    WHERE IsActive = 1
) docs
WHERE DATEDIFF(DAY, LastVerified, GETDATE()) > 90
ORDER BY DaysSinceVerified DESC;

-- Deactivate outdated documents rather than deleting them
-- Keeps audit trail while removing from retrieval
UPDATE dbo.KnowledgeBase
SET    IsActive = 0
WHERE  DocumentName = 'Old Runbook v1.0'
AND    LastVerified < '2025-01-01';

-- Re-ingest an updated document
-- Run the Python ingestion script with the new version
-- Then deactivate the old version

Recommended Verification Schedule by Document Type

  • Runbooks and procedures: quarterly review, update whenever the underlying system changes
  • Schema documentation: automated regeneration on every schema change via a DDL trigger or deployment pipeline
  • Security and compliance policies: annual review minimum, immediately on regulatory changes
  • Incident reports and resolutions: load once, no expiry (historical record)
  • Employee knowledge guides: ownership-based review, flag when the owner leaves the organization

12 The DBA Role in the AI Data Layer Beginner

Every layer of this architecture is a data engineering and database administration problem. The DBA does not need to build the LLM. The DBA does need to own the data layer that makes the LLM useful and safe.

  • Knowledge base schema design. The vector table, classification columns, freshness tracking, and access control structure are database design decisions.
  • Vector index management. DiskANN indexes need monitoring and periodic maintenance just like B-tree indexes.
  • MCP security governance. The dedicated SQL accounts, permission grants and denies, and audit configuration for MCP connections are DBA responsibilities.
  • Data quality governance. Stale knowledge base content produces wrong answers. The DBA team is best positioned to build the monitoring queries and alerting that flag outdated content.
  • Schema documentation quality. The LLM's ability to generate correct SQL against your operational databases depends on the quality of your schema documentation in the knowledge base. Meaningful table names, column descriptions, and extended properties are now directly tied to AI output quality.
  • Performance monitoring. AI workloads generate distinctive query patterns. Vector searches, parallel embedding lookups, and high-frequency small queries from agent sessions need to be understood and indexed for appropriately.

The DBA who builds and owns this layer is the person making the AI deployment work safely and reliably. Developers building AI applications do not naturally think about data classification, stale content governance, least-privilege SQL accounts, or audit logging at the database layer. Those skills are exactly what makes a senior DBA or data architect uniquely valuable in 2026. The technology changed. The discipline did not.

13 The Complete Architecture Summary Beginner

Here is the complete picture from company data to employee answer. Every layer has a clear owner and a clear purpose.

LAYER 1: YOUR COMPANY DATA SOURCES Runbooks, policies, schema docs, incident reports, employee knowledge, process guides, training materials │ ▼ Classify, chunk, embed, load LAYER 2: SQL SERVER 2025 KNOWLEDGE BASE dbo.KnowledgeBase with VECTOR(1536) column DiskANN index for semantic search Row-level security for classification enforcement Freshness monitoring and staleness alerts │ │ │ ▼ ▼ LAYER 3a: KNOWLEDGE BASE LAYER 3b: OPERATIONAL DATA MCP SERVER MCP SERVER Read-only access to Read-only access to knowledge base chunks approved operational tables mcp_knowledge_reader mcp_ops_reader account account Forbidden write keywords blocked, row cap enforced, all queries audit logged │ │ └──────────┬──────────┘ ▼ LAYER 4: AI CLIENT Claude, Amazon Q, Copilot, your internal assistant Connects to both MCP servers Retrieves knowledge base context for runbook questions Queries live data for current operational questions Synthesizes both into plain English answers │ ▼ LAYER 5: YOUR EXISTING APPLICATIONS Unchanged, unaware of AI layer Same connection strings, same accounts, same stored procedures, same performance
LayerOwnerTechnologyKey Decision
Data SourcesBusiness + DBADocuments, SharePoint, exportsWhat knowledge matters most
Knowledge BaseDBASQL Server 2025 VECTOR tablesClassification and chunking strategy
MCP ServersDBA + SecurityFastMCP, PythonWhat the AI is allowed to access
AI ClientApplication teamClaude, Amazon Q, customWhich AI tool serves which users
Existing AppsApplication teamUnchangedNo changes needed

References


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading