top of page
Search

How Sensitive Data Actually Enters AI Systems (And Why It's Hard to Stop)

Updated: Apr 14

There's a question most security and data leaders aren't asking clearly enough: exactly how does sensitive data enter an AI system?


Not in theory. In practice, through which pipelines, at which points, in which forms. The answer matters because it determines where governance must occur and why the controls most organizations have in place today aren't sufficient.


This blog maps the real paths: RAG pipelines, copilots, direct uploads, and shadow AI. Each one behaves differently, exposes data differently, and requires a different conversation about protection. Understanding them is the starting point for any serious AI data governance program.


The Myth: Training Datasets Are the Primary Risk


Early AI data concerns centered on model training: did the vendor use proprietary data, did the model memorize sensitive records? Those questions matter if you're a model provider. For enterprises deploying AI, they're largely the wrong starting point.


The copilots, assistants, and RAG-powered search tools your teams use weren't trained on your data. They connect to your data dynamically at runtime, retrieving, processing, and synthesizing it in response to queries. That runtime connection, and what flows through it unprotected, is where most enterprise exposure actually lives.


The distinction matters because it changes where governance has to act. You can't audit what the model learned. You can control what gets protected before the model ever accesses it.

The Real Entry Points: How Sensitive Data Gets Into AI Systems


1. RAG Pipelines

Retrieval-Augmented Generation (RAG) is the architecture behind most enterprise AI assistants. Rather than relying solely on pre-trained knowledge, RAG systems dynamically retrieve relevant documents at query time and inject them into the model's context window before generating a response.


Which sounds manageable, until you think through what "relevant documents" means in a poorly scoped deployment.


Your SharePoint libraries, internal wikis, and document repositories aren't static archives anymore. They're live AI inputs. Every document a RAG pipeline can reach is one the model can surface, summarize, and quote back to whoever's asking.


A gap that catches many teams off guard: access controls defined in backend systems don't always carry through to the retrieval layer. A user may be restricted from viewing a document directly while the RAG pipeline still retrieves and serves content derived from it. The document is restricted for the human. The model accesses it anyway.


When sensitive data isn't protected at the content level, in the document itself before it enters the knowledge base, the content travels through the pipeline in a form the model can read and surface. That's not inevitable. It's an architecture choice, and it's one most deployments haven't made deliberately.


Common gaps in RAG deployments that widen this exposure:

  • Overly broad retrieval scope: configurations built for better answers without accounting for what "everything" actually includes

  • Access policies that don't consistently apply at retrieval time, leaving gaps between what users can see and what the pipeline can reach

  • Ranking logic that surfaces sensitive content because it matches common query terms

  • Cached context shared across sessions or users in high-traffic assistants


Financial records, legal documents, HR files, M&A materials: anything indexed in the knowledge base is part of the retrieval surface. That surface grows every time a new repository gets connected without a corresponding governance review.


2. Copilots and Plugins

Enterprise copilots operate with broad access to the systems they're embedded in. That's intentional. It's what makes them useful for synthesizing information across email, documents, calendars, and conversations.


It's also what makes the data exposure surface difficult to bound.


A copilot connected to email, Teams, SharePoint, and OneDrive can retrieve and synthesize across all of them at query time, including content a user was never intended to see, in systems where permissions are inconsistently applied or have drifted over time.


In June 2025, researchers documented EchoLeak (CVE-2025-32711), a zero-click vulnerability in Microsoft 365 Copilot. An attacker embedded hidden instructions in an email. Copilot ingested the prompt, retrieved sensitive data from OneDrive, SharePoint, and Teams, and exfiltrated it through trusted Microsoft domains with no user interaction required, no alert triggered, all activity moving through approved channels.


This is a prompt injection attack, a class of threat the OWASP Top 10 for LLM Applications ranks as its primary risk. The copilot was manipulated into retrieving and surfacing data it shouldn't have. Traditional perimeter controls saw nothing unusual because the activity looked legitimate.

What data-level protection addresses in this scenario is the retrieval surface itself. When sensitive spans within documents are protected at the content layer, encrypted selectively with decryption tied to authorized identity, a copilot that retrieves those documents under manipulation gets protected content. The injection succeeds in triggering retrieval. It doesn't succeed in surfacing readable sensitive data, because the protection travels with the content regardless of how it was retrieved.


This is different from preventing injection entirely. It's about limiting what a successful injection can actually expose.


Email threads, shared documents, meeting transcripts, internal communications: the full surface of whatever the copilot can retrieve at runtime is the exposure surface. Data-level protection reduces what's accessible within that surface, even when retrieval controls fail or are bypassed.


3. Direct File Uploads

This is the most direct ingestion path, and often the least visible to security teams.

Contracts go to a legal AI assistant. Financial models get uploaded to a data analysis tool. Customer records land in a summarization workflow. Source code goes into a coding assistant. This happens constantly, dozens of times a day across a typical organization, and nothing about the action looks unusual from any system's perspective.


The employee has legitimate access. The AI tool accepts uploads. Authorized all the way down.

The risk isn't authorization. It's protection. If encryption and access controls live only at the repository or system level, they don't travel with the file when it moves. A document uploaded to an external AI tool arrives without the governance context applied in the source system.


When protection is applied at the content layer, selectively encrypting sensitive fields, pages, or the document as a whole at or before the point the file enters an AI pipeline, that protection persists regardless of where the file goes. An AI tool receiving a protected document can process the unprotected context around sensitive spans, preserving utility, while the sensitive content itself remains inaccessible without authorized decryption.


Employees authorized to view sensitive data aren't always authorized to transfer it. Data-level protection makes that distinction enforceable rather than aspirational.


4. Shadow AI

Shadow AI is the ingestion path growing the fastest with the least visibility into it.

In a single year, employees using generative AI applications tripled, the volume of data sent to these tools increased sixfold, and data policy violations doubled. The average organization now sees 223 AI-related security incidents per month, and that figure captures only what's being detected.


76% of organizations consider shadow AI a definite or probable challenge. Workers paste proprietary code into ChatGPT. Marketing teams use unapproved generators. Developers link personal AI accounts to corporate repositories. None of it appears in monitoring. None of it is governed.


Banning AI tools doesn't resolve it. Research shows nearly half of employees continue using personal AI accounts after a ban is issued. The behavior continues. It becomes harder to see.

Shadow AI incidents cost organizations an average of $670,000 more than standard breaches. But the dollar figure undersells the structural problem: it's the continuous, invisible transfer of sensitive data into systems that exist entirely outside your governance framework, with no audit trail and no visibility into what left or where it went.


The only control that follows data into unauthorized systems is protection embedded in the data itself. If a document carries content-level protection applied at or before the point it left the source environment, an unauthorized AI tool receives content where the sensitive fields are already inaccessible, regardless of what the tool attempts to do with it.


5. AI Agents

The four paths above involve a human making a request. AI agents are different. Once deployed, they act autonomously, chaining actions across systems, querying data sources, and executing multi-step tasks without a human reviewing each step.


The governance gap this creates is specific. With a RAG pipeline, the retrieval scope is bounded by the query. With an agent, the scope is bounded by whatever access was provisioned during configuration, which is often broader than necessary and typically scoped once (based on what the agent might need). In practice, that scope is broader than necessary and rarely revisited.


Why Access Control Isn't the Same as AI Data Governance

The most common pushback when AI data exposure comes up: "We've locked down permissions. Only authorized users can reach sensitive systems."


That answer covers access. It doesn't cover what happens to data once it's retrieved, processed, embedded, or uploaded into an AI pipeline, and those are different governance problems that need different solutions at different points in the data lifecycle.


Access control governs who can reach data in authorized systems. AI data governance governs what protection travels with data as it moves through AI workflows: retrieval, chunking, embedding, inference, agent actions. The NIST AI Risk Management Framework treats these as distinct layers, and for good reason.


Consider what access control alone can't address: a user authorized to view a document can upload it to an external AI tool and the source system sees a legitimate access event. A RAG pipeline with appropriate read permissions may still retrieve and surface content in AI outputs that was never intended for that context. A copilot operating under a manipulated prompt retrieves data through channels that look authorized. In each case, the access event was legitimate. The governance gap is elsewhere.


The OWASP Top 10 for LLM Applications identifies Sensitive Information Disclosure (LLM02) and Excessive Agency (LLM06) as primary risks precisely because AI systems can surface and act on data in ways that access control alone doesn't constrain.


Why Protection Has to Travel With the Data

In many common architectures, once sensitive data has been processed into embeddings or incorporated into an AI system's context, the ability to meaningfully revoke or contain it becomes limited. Removing a source document from SharePoint doesn't always remove the embeddings derived from it from a vector database. Revoking a user's access doesn't necessarily prevent the model from surfacing outputs that reflect content it has already processed.


These aren't universal truths. Some systems handle this better than others. But they describe a real gap in the default architecture of most enterprise AI deployments today.


The more durable answer isn't better revocation mechanisms. Its protection extends to derived artifacts of the original data, including embeddings, so the question of revocation becomes less critical, since the data was never exposed in a form that could be misused.


Here's what that looks like in practice. When AI Guard processes a document at ingestion into a RAG pipeline, it applies selective encryption to sensitive spans: PII, PHI, IP, and financial data, while leaving the surrounding context intact so the model can still generate useful responses. When that protected document is vectorized, the embeddings inherit the protection. The vector database contains obfuscated representations rather than readable sensitive content. A retrieval that returns those embeddings surfaces protected content; the model can work with the non-sensitive context, but the sensitive spans remain inaccessible to unauthorized users or agents.


This is what it means for protection to follow data through the AI lifecycle, not just sit at the source.


Governing Data At and Before AI Ingestion

Effective AI data governance requires controls applied at the right points across the data lifecycle: before and during AI ingestion, through retrieval, inference, and agent actions.


  1. Discover before you connect. Most organizations attach RAG pipelines and copilots to repositories without knowing what sensitive data those repositories contain. Automated discovery and classification, run before AI systems connect, maps what's sensitive, where it lives, and what policies apply. PII, PHI, and IP flowing into pipelines untagged create compliance blind spots from the start.

  2. Classify before you index. Not all data should be eligible for RAG indexing by default. Classification-informed indexing policies determine what enters the knowledge base and under what conditions, a governance decision most teams haven't made explicitly, which means the default is permissive.

  3. Apply protection at and before ingestion, at the content level. Selective encryption and tokenization applied at the document, page, or entity level, at or before the point data enters an AI pipeline, means protection travels with the data rather than staying behind in the source system. Sensitive fields remain protected through chunking, vectorization, and retrieval. The model works with what it's authorized to see.

  4. Extend protection to embeddings. Protection applied during ingestion should carry through to the vector representations derived from source data. Embeddings that inherit fine-grained data protection can't be reconstructed into readable sensitive content even if the vector database is breached or queried by an unauthorized agent.

  5. Enforce least-privilege access for users and AI agents. File-level access controls are too coarse for AI workflows. Fine-grained policies at the span, page, or entity level, enforced at runtime and aware of both user identity and AI agent identity, mean sensitive content is revealed only to authorized principals, even within the same document. AI agents should be treated as non-human identities with explicitly defined access scope, not as proxies for the users they serve.

  6. Build audit trails for AI-specific data flows. Standard access logs track who opened a file. They don't always capture what content a RAG pipeline retrieved to answer a specific query, what spans a copilot surfaced in a response, or what data an autonomous agent accessed across a multi-step workflow. Tamper-proof, span-level audit logging built for AI data flows provides the forensic depth regulators require and that incident response actually needs.

  7. Govern before deployment. Connecting AI systems to data repositories without first establishing classification, protection policies, and access controls means inheriting every gap in that data's governance posture, immediately and completely.


What to Do Next

If your organization has deployed AI tools, copilots, RAG pipelines, agents, or connected assistants of any kind, data is already moving through those systems. The question is whether its protection moves with it.


Start by mapping the actual data flows. Not what your AI tools are authorized to access on paper, but what they retrieve and process in practice. Then ask whether the protection applied to sensitive data in your source systems travels with it into AI pipelines, or stays behind the moment the data moves.


If it stays behind, that's where the governance gap is.






FAQ

How does sensitive data get into AI systems? Through several paths that have nothing to do with model training: RAG pipelines that dynamically retrieve from document repositories at query time; copilots with broad runtime access to productivity systems; direct file uploads by employees to AI tools; and shadow AI tools operating outside IT visibility entirely. Runtime data access, not training, is where most enterprise exposure lives.


What is shadow AI and why does it matter for data governance? Shadow AI is any AI tool employees use without authorization from IT or security teams. These tools receive no monitoring, operate outside governance frameworks, and process whatever employees upload or paste. The average cost premium over a standard breach is $670,000. Banning tools doesn't stop the behavior; it moves it out of sight. The only control that follows data into unauthorized systems is protection embedded in the data itself before it left the governed environment.


Why don't access controls fully address AI data exposure? Because access control and AI data governance address different points in the data lifecycle. Access controls govern who can reach data in authorized systems. They don't govern what protection travels with data as it moves through AI pipelines, what gets embedded into vector stores, or what a copilot retrieves and surfaces at runtime. The OWASP Top 10 for LLMs specifically identifies these gaps under Sensitive Information Disclosure (LLM02) and Excessive Agency (LLM06).


What is a RAG pipeline and what are its data security risks? RAG (Retrieval-Augmented Generation) systems dynamically retrieve relevant documents at query time and inject them into a model's context window. The governance risk is that retrieval scope is often broader than intended, access policies don't always apply consistently at the retrieval layer, and sensitive content can surface in model outputs in ways that weren't anticipated when the system was designed. Without data-level protection applied at ingestion, the full retrieval surface is reachable by the model.


Why is it hard to contain sensitive data after it enters an AI system? In many architectures, once data has been processed into embeddings or incorporated into a model's context, the ability to meaningfully contain it is limited. Removing a source document doesn't always remove derived embeddings. This is why protection needs to travel with data through the AI lifecycle, applied at or before ingestion and extending to embeddings and agent interactions, rather than depending on source-system controls that don't follow the data as it moves.


What does data-level protection for AI actually mean? Applying selective encryption, tokenization, and access policies to sensitive content at or before the point it enters AI pipelines, and ensuring that protection persists through chunking, vectorization, retrieval, and inference. It includes fine-grained access controls enforced at runtime for both human users and AI agents, protection that extends to embeddings derived from source data, and span-level audit trails that capture AI-specific data flows rather than general file access events.

 
 
 

Comments


bottom of page