Building a PDF RAG System: A Practical Guide to Retrieval-Augmented Generation

In the never-ending quest to make AI slightly less hallucination-prone than a college freshman during finals week, Retrieval-Augmented Generation (RAG) has emerged as our best hope. Think of RAG as strapping a fact-checking librarian to a creative writer's back—suddenly, those imaginative stories come with actual citations.

In this guide, I'll walk you through building a fully functional PDF RAG system that will make your documents actually useful instead of just digital paperweights. By the end, you'll have a system that lets users upload PDFs, ask questions that would normally send them into an existential crisis of endless scrolling, and get answers that are—wait for it—actually correct.

Want to skip the explanations and just grab the code? The complete repository is available on GitHub.

Understanding RAG: A Brief Overview

Before diving into code like a caffeinated programmer on a deadline, let's understand what RAG actually is and why it matters.

Retrieval-Augmented Generation (RAG) is essentially an intervention for LLMs with fact-checking issues. It's like hiring a research assistant for your creative writer—first, it retrieves relevant information from a knowledge base, then lets the LLM generate a response that's actually connected to reality.

This two-step process combines:

Retrieval systems that find relevant information (so we're not just making stuff up)
Generative AI that can turn that information into something readable by humans who don't speak robot

Why should you care? Well, here's what RAG brings to the table:

Factual accuracy: Reduces those embarrassing hallucinations where your AI confidently invents historical events or technical specifications
Up-to-date knowledge: The knowledge base can be updated without having to retrain the entire model (which costs more than my car)
Domain adaptation: Systems can focus on specific domains without forcing the model to forget everything else it knows
Transparency: We can actually see where the information came from, unlike the mysterious black box of "trust me bro" that traditional LLMs operate on

System Architecture Overview

Our PDF RAG system isn't rocket science, but if it were, here's how the rocket would be built:

Document Processing Pipeline: Extracts text from PDFs without crying
Vector Store: Turns text into math (seriously, that's what embeddings are)
RAG Generator: Creates responses that won't make you facepalm
Web Interface: So users don't have to learn command line wizardry

Here's a high-level architecture diagram that makes it look more complicated than it is:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  PDF Upload  │────▶│ Text        │────▶│ Vector      │────▶│ Vector      │
│  & Parsing   │     │ Chunking    │     │ Embedding   │     │ Storage     │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                    │
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌──────▼────────┐
│  Response   │◀────│ Context     │◀────│ Similarity   │◀────│ User          │
│  Generation │     │ Building    │     │ Search       │     │ Query         │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

In our implementation, we'll use:

Cloudflare Workers for serverless computing (because who wants to manage servers in 2025?)
Cloudflare Vectorize for vector storage and similarity search (AKA "find me the text that kinda sounds like this question")
Cloudflare AI for embeddings and inference (turning words into numbers and back again)
unpdf for PDF text extraction (because PDFs are basically digital concrete blocks)
Hono as our web framework (it's like Express but with fewer gray hairs)

Setting Up the Development Environment

Before we dive into code, let's set up our digital workshop. You'll need:

A Cloudflare account with Workers and AI access (free tier works fine for experimenting, but don't try to process the entire Library of Congress)
Node.js and npm installed (if you're reading this article, I'm assuming you've figured this out already)
Wrangler CLI for Cloudflare Workers development (because deploying to the cloud should be easier than ordering pizza)

Let's start by setting up our project:

# Install Wrangler CLI (the magic wand for Cloudflare Workers)
npm install -g wrangler

# Create a new project (name it something more creative than "my-rag-app")
mkdir pdf-rag-system
cd pdf-rag-system

# Initialize a new Worker project (answer the prompts or just hit Enter like a lazy developer)
npx wrangler init

# Install dependencies (the real building blocks)
npm install hono unpdf

Next, configure your wrangler.jsonc file to enable AI and Vectorize. This is where we tell Cloudflare about our magical AI plans:

{
  "$schema": "node_modules/wrangler/config-schema.json",
  "name": "pdf-rag-system",
  "main": "src/index.ts",
  "compatibility_date": "2025-03-20",
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {"binding": "DOCUMENTS_INDEX", "index_name": "pdf-documents-index"}
  ]
}

Implementing the Document Processing Pipeline

Now for the real meat and potatoes: the document processing pipeline. This is where PDFs go to be transformed from unstructured chaos into something resembling useful information.

Our document processing consists of three main steps, like a digital assembly line:

Extracting text from PDF files (aka fighting with the PDF format)
Chunking the text into manageable segments (because context windows aren't bottomless pits)
Generating embeddings for each chunk (turning words into vectors, because math > words)

Let's start with the document processor implementation, which I promise is less painful than a root canal:

// src/services/documentProcessor.ts
import { Document, DocumentChunk, EmbeddedChunk } from '../interfaces';
import { extractText } from 'unpdf';

export class DocumentProcessor {
  private readonly ai: Ai;

  constructor(ai: Ai) {
    this.ai = ai;
  }

  /**
   * Extract text from a PDF file using unpdf
   * (or as I like to call it, "PDF wrestling")
   */
  public async extractTextFromPDF(pdfData: ArrayBuffer): Promise<string> {
    try {
      // Use unpdf to extract text from the PDF
      const result = await extractText(new Uint8Array(pdfData), { mergePages: true });

      // Ensure textContent is a string
      const textContent = Array.isArray(result.text) ? result.text.join(' ') : result.text;

      return textContent || '';
    } catch (error) {
      console.error('Error extracting text from PDF:', error);
      throw new Error('Failed to extract text from PDF. The document may be corrupted or password-protected.');
    }
  }

  /**
   * Process a PDF file and create a document
   */
  public async processPDF(pdfData: ArrayBuffer, filename: string): Promise<Document> {
    // Extract text from PDF
    const content = await this.extractTextFromPDF(pdfData);

    // Create a document ID from filename
    const id = filename.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();

    // Create a document from the extracted text
    return {
      id,
      title: filename,
      content,
      source: 'PDF Upload',
      timestamp: Date.now(),
    };
  }

  /**
   * Split a document into chunks by paragraphs with appropriate size limits
   * (because even LLMs get indigestion from too much text at once)
   */
  public chunkDocumentByParagraphs(document: Document, maxChunkSize: number = 1500): DocumentChunk[] {
    // Split content into paragraphs (handling various newline patterns)
    const paragraphs = document.content
      .split(/\n\s*\n|\r\n\s*\r\n/)
      .map((p) => p.trim())
      .filter((p) => p.length > 0);

    const chunks: DocumentChunk[] = [];
    let currentChunk = '';
    let currentChunkId = 0;

    for (const paragraph of paragraphs) {
      // If adding this paragraph would exceed the max size, create a new chunk
      if (currentChunk.length + paragraph.length > maxChunkSize && currentChunk.length > 0) {
        chunks.push({
          id: `${document.id}-chunk-${currentChunkId}`,
          documentId: document.id,
          content: currentChunk,
          metadata: {
            title: document.title,
            source: document.source,
            url: document.url,
            position: currentChunkId,
          },
        });
        currentChunkId++;
        currentChunk = '';
      }

      // Add paragraph with a separator if needed
      if (currentChunk.length > 0) {
        currentChunk += '\n\n';
      }
      currentChunk += paragraph;
    }

    // Add the last chunk if it's not empty
    if (currentChunk.length > 0) {
      chunks.push({
        id: `${document.id}-chunk-${currentChunkId}`,
        documentId: document.id,
        content: currentChunk,
        metadata: {
          title: document.title,
          source: document.source,
          url: document.url,
          position: currentChunkId,
        },
      });
    }

    return chunks;
  }

  /**
   * Default chunking method - uses paragraph-based chunking
   */
  public chunkDocument(document: Document, chunkSize: number = 1500): DocumentChunk[] {
    return this.chunkDocumentByParagraphs(document, chunkSize);
  }

  /**
   * Generate embeddings for document chunks
   * (where the magic of turning words into numbers happens)
   */
  public async generateEmbeddings(chunks: DocumentChunk[]): Promise<EmbeddedChunk[]> {
    const embeddedChunks: EmbeddedChunk[] = [];

    // Process chunks in batches to avoid overloading the API
    // (or facing the wrath of rate limits)
    const batchSize = 10;
    for (let i = 0; i < chunks.length; i += batchSize) {
      const batch = chunks.slice(i, i + batchSize);
      const texts = batch.map((chunk) => chunk.content);

      // Generate embeddings using Cloudflare AI
      const embeddings = await this.ai.run('@cf/baai/bge-base-en-v1.5', {
        text: texts,
      });

      // Combine the chunks with their embeddings
      for (let j = 0; j < batch.length; j++) {
        embeddedChunks.push({
          ...batch[j],
          embedding: embeddings.data[j],
        });
      }
    }

    return embeddedChunks;
  }
}

Let's break down what's happening here, because I promise it's not just random code I copied from Stack Overflow:

Text Extraction: We use the unpdf library to extract text from PDF documents, which is about as fun as trying to open a coconut with a plastic spoon.
Chunking Strategy: We split the document into manageable chunks based on paragraphs, ensuring each chunk stays under a maximum size limit. This is crucial because:
- Vector embeddings work best with cohesive text segments (not random word salad)
- LLMs have context length limitations (they get confused easily, like my uncle at a smartphone store)
- Smaller chunks allow for more precise retrieval (think surgical strikes instead of carpet bombing)
Embedding Generation: We generate vector embeddings for each chunk using Cloudflare AI's @cf/baai/bge-base-en-v1.5 model. These embeddings are the secret sauce that lets us find relevant information later.

Building the Vector Store

Next, let's implement the vector store component that will handle storage and retrieval of our document chunks. Think of this as a smart librarian who remembers not just where the books are, but what's actually in them:

// src/services/vectorStore.ts
import { EmbeddedChunk, RetrievedChunk } from '../interfaces';

export class VectorStore {
  private readonly vectorize: VectorizeIndex;
  private readonly ai: Ai;

  constructor(vectorize: VectorizeIndex, ai: Ai) {
    this.vectorize = vectorize;
    this.ai = ai;
  }

  /**
   * Store documents and their embeddings in Vectorize
   * (like uploading books to a bookshelf that actually knows what's in them)
   */
  public async storeChunks(embeddedChunks: EmbeddedChunk[]): Promise<void> {
    // Insert chunks in batches because flooding databases is rude
    const batchSize = 100;
    for (let i = 0; i < embeddedChunks.length; i += batchSize) {
      const batch = embeddedChunks.slice(i, i + batchSize);

      await this.vectorize.upsert(
        batch.map((chunk) => ({
          id: chunk.id,
          values: chunk.embedding,
          metadata: {
            documentId: chunk.documentId,
            content: chunk.content,
            title: chunk.metadata.title,
            source: chunk.metadata.source || '',
            url: chunk.metadata.url || '',
            position: chunk.metadata.position || 0,
          },
        }))
      );
    }
  }

  /**
   * Retrieve relevant chunks based on query similarity
   * (the "find me stuff that sounds like what I'm asking" function)
   */
  public async queryChunks(query: string, topK: number = 3): Promise<RetrievedChunk[]> {
    // Generate embedding for the query (turn the question into math)
    const embedding = await this.ai.run('@cf/baai/bge-base-en-v1.5', {
      text: [query],
    });

    // Search for similar vectors (find text chunks that match the question's vibe)
    const results = await this.vectorize.query(embedding.data[0], {
      topK: topK,
      returnMetadata: true,
    });

    // Convert results to RetrievedChunk format (make it usable)
    return results.matches.map((match) => ({
      chunk: {
        id: match.id,
        documentId: (match.metadata?.documentId as string) ?? '',
        content: (match.metadata?.content as string) ?? '',
        metadata: {
          title: (match.metadata?.title as string) ?? '',
          source: (match.metadata?.source as string) ?? '',
          url: (match.metadata?.url as string) ?? '',
          position: (match.metadata?.position as number) ?? 0,
        },
      },
      score: match.score,
    }));
  }
}

The vector store is like that friend who always remembers exactly what you told them two years ago (and will happily remind you). It serves two primary functions:

Storing document chunks: It persists text chunks and their embeddings in the vector database, like filing away documents in a cabinet that actually knows what's in each file.
Retrieving relevant chunks: When a query comes in, it converts the question to an embedding and finds the most similar document chunks. It's like asking, "What parts of these documents are most likely to contain information about this question?"

Cloudflare Vectorize handles the heavy lifting of vector similarity search, so we don't have to implement cosine similarity from scratch (which would probably involve sacrificing a graphing calculator to the math gods)

Creating the RAG Generator

Next, let's implement the component that generates answers based on retrieved context. This is where we take our lovingly retrieved text chunks and transmute them into something that looks like an answer:

// src/services/ragGenerator.ts
import { RetrievedChunk } from '../interfaces';

export class RAGGenerator {
  private readonly ai: Ai;

  constructor(ai: Ai) {
    this.ai = ai;
  }

  /**
   * Generate a response using the LLM and retrieved chunks
   * (where the AI pretends it knew this all along)
   */
  public async generateResponse(query: string, retrievedChunks: RetrievedChunk[]): Promise<string> {
    // Build context from retrieved chunks (the AI's cheat sheet)
    const context = retrievedChunks.map((item) => item.chunk.content).join('\n\n');

    // Create the prompt (the gentle art of telling the AI what to do)
    const prompt = `
You are a helpful assistant that provides accurate information based on the given context.
Answer the following question based ONLY on the provided context. If the context doesn't
contain relevant information to answer the question, admit that you don't know rather than making up information.

CONTEXT:
${context}

QUESTION:
${query}

ANSWER:`;

    // Generate response using Cloudflare AI (where the magic happens)
    const response: any = await this.ai.run('@cf/meta/llama-3-8b-instruct', {
      prompt: prompt,
      max_tokens: 500,
    });

    return response.response;
  }
}

The RAG generator performs several key functions that sound fancier than they actually are:

Context Building: It combines the retrieved document chunks into a coherent context. It's like assembling puzzle pieces into a picture that hopefully makes sense.
Prompt Engineering: It constructs a prompt that instructs the LLM to answer based only on the provided context. This is basically telling the AI, "Only use this information—I know you've 'read' the entire internet, but please forget all that for a minute."
Response Generation: It uses an LLM (in this case, Llama 3) to generate a natural language response that hopefully resembles an answer to the question.

The prompt engineering is actually critical here. We explicitly tell the model to rely only on the provided context and to admit when it doesn't know, rather than hallucinating information. This reduces the risk of the model confidently stating that Abraham Lincoln invented the iPhone while riding a dinosaur.

Putting It All Together: The RAG Service

Now, let's create a service that orchestrates the entire RAG process, like a conductor coordinating an orchestra of AI components that hopefully won't play out of tune:

// src/services/ragService.ts
import { DocumentProcessor } from './documentProcessor';
import { VectorStore } from './vectorStore';
import { RAGGenerator } from './ragGenerator';
import { RAGResponse } from '../interfaces';

export class RAGService {
  private readonly documentProcessor: DocumentProcessor;
  private readonly vectorStore: VectorStore;
  private readonly ragGenerator: RAGGenerator;

  constructor(ai: Ai, vectorize: VectorizeIndex) {
    this.documentProcessor = new DocumentProcessor(ai);
    this.vectorStore = new VectorStore(vectorize, ai);
    this.ragGenerator = new RAGGenerator(ai);
  }

  /**
   * Process a PDF file - extract text, chunk, embed, and store
   * (the full PDF-to-knowledge pipeline)
   */
  public async processPDF(pdfData: ArrayBuffer, filename: string): Promise<{ documentId: string; chunkCount: number }> {
    // Process the PDF file (fight with the format)
    const document = await this.documentProcessor.processPDF(pdfData, filename);

    // Chunk the document (slice and dice)
    const chunks = this.documentProcessor.chunkDocument(document);

    // Generate embeddings (math time!)
    const embeddedChunks = await this.documentProcessor.generateEmbeddings(chunks);

    // Store in vector database (filing everything away)
    await this.vectorStore.storeChunks(embeddedChunks);

    return {
      documentId: document.id,
      chunkCount: chunks.length,
    };
  }

  /**
   * Generate a response for a query
   * (the "actually answer the question" part)
   */
  public async generateResponse(query: string, topK: number = 3): Promise<RAGResponse> {
    // Step 1: Retrieve relevant chunks (find the needle in the haystack)
    const retrievedChunks = await this.vectorStore.queryChunks(query, topK);

    // Step 2: Generate response (wave the magic wand)
    const answer = await this.ragGenerator.generateResponse(query, retrievedChunks);

    // Prepare and return response
    return {
      query,
      answer,
      sourceChunks: retrievedChunks,
    };
  }
}

The RAG service is the conductor of our AI orchestra, orchestrating the entire process:

PDF Processing: It handles the end-to-end process of ingesting a PDF document, from raw bytes to searchable knowledge.
Query Processing: It coordinates retrieval and generation to answer user queries, making sure the right information gets to the right place at the right time.

This service-oriented architecture provides a clean separation of concerns and makes the system maintainable—assuming you leave good comments for the poor developer who inherits your code (which might be future you, with no memory of how any of this works).

Creating the API Endpoints

Now, let's create the API endpoints that will expose our RAG functionality to the outside world (or at least to anyone with internet access and your API URL):

// src/routes/api.ts
import { Hono } from 'hono';
import { RAGService } from '../services/ragService';
import { RAGRequest } from '../interfaces';

// Define environment bindings
interface Env {
  AI: Ai;
  DOCUMENTS_INDEX: VectorizeIndex;
}

// Create API router
export const apiRouter = new Hono<{ Bindings: Env }>();

// PDF ingestion endpoint (where PDFs go to get processed)
apiRouter.post('/ingest-pdf', async (c) => {
  try {
    // Parse the form data (what did the user send us?)
    const formData = await c.req.formData();
    const file = formData.get('file') as File | null;

    if (!file) {
      return c.json({ success: false, error: 'No file uploaded' }, 400);
    }

    // Check if it's a PDF (because we're not dealing with your vacation photos)
    if (!file.name.toLowerCase().endsWith('.pdf')) {
      return c.json({ success: false, error: 'Only PDF files are supported' }, 400);
    }

    // Read the file data (bytes, glorious bytes)
    const fileArrayBuffer = await file.arrayBuffer();

    // Initialize RAG service (our digital paper processor)
    const ragService = new RAGService(c.env.AI, c.env.DOCUMENTS_INDEX);

    // Process the PDF (where the magic happens)
    const result = await ragService.processPDF(fileArrayBuffer, file.name);

    // Return success response (celebrate!)
    return c.json({
      success: true,
      message: `Processed PDF: ${file.name}`,
      documentId: result.documentId,
      chunkCount: result.chunkCount,
    });
  } catch (error: any) {
    console.error('Error processing PDF:', error);
    return c.json(
      {
        success: false,
        error: error.message || 'An error occurred while processing the PDF',
      },
      500
    );
  }
});

// Query endpoint (where questions get answered)
apiRouter.post('/query', async (c) => {
  try {
    // Parse the request body (what's the question?)
    const { query, topK = 3 } = await c.req.json<RAGRequest>();

    if (!query || typeof query !== 'string') {
      return c.json({ success: false, error: 'Query is required' }, 400);
    }

    // Initialize RAG service (our digital oracle)
    const ragService = new RAGService(c.env.AI, c.env.DOCUMENTS_INDEX);

    // Process the query (find and generate the answer)
    const response = await ragService.generateResponse(query, topK);

    // Return the response (the moment of truth)
    return c.json(response);
  } catch (error: any) {
    console.error('Error processing query:', error);
    return c.json(
      {
        success: false,
        error: error.message || 'An error occurred while processing the query',
      },
      500
    );
  }
});

Our API has two main endpoints, which is probably two more than some "REST APIs" I've seen in production:

/api/ingest-pdf: For uploading and processing PDF documents. This is where we take a PDF file, extract its text, chunk it, generate embeddings, and store everything in our vector database.
/api/query: For asking questions about the uploaded documents. This is where the magic of RAG happens—we find relevant document chunks and generate an answer based on their content.

These endpoints interact with our RAG service to handle the respective operations, making our complex internal processing accessible through simple HTTP requests. Just don't try to upload your entire digital library at once unless you enjoy watching progress bars.

Building the User Interface

To provide a user-friendly interface for our RAG system, we'll create a simple HTML page with JavaScript for client-side interactions. This way users don't need a computer science degree just to ask a question about their documents.

Here's a simplified version of our UI (I've omitted the full CSS to preserve your sanity):

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PDF RAG System</title>
    <style>
      /* CSS styles omitted for brevity and to protect your eyes */
    </style>
  </head>
  <body>
    <header>
      <h1>PDF RAG System</h1>
    </header>

    <div class="container">
      <div class="tab-container">
        <div class="tabs">
          <div class="tab active" data-tab="upload">Upload PDF</div>
          <div class="tab" data-tab="query">Ask Questions</div>
          <div class="tab" data-tab="about">About</div>
        </div>

        <div class="tab-content active" id="upload-tab">
          <div class="panel">
            <h2>Upload PDF Document</h2>
            <div class="file-drop" id="fileDrop">
              <p>Drag & drop your PDF file here or click to browse</p>
              <input type="file" id="fileInput" accept=".pdf" style="display: none" />
            </div>
            <div class="flex">
              <span id="fileName" class="grow"></span>
              <button id="uploadBtn" disabled>Upload</button>
            </div>
            <div class="progress" id="uploadProgress">
              <div class="progress-bar" id="progressBar"></div>
            </div>
            <div id="uploadResult" class="hidden">
              <h3>Upload Result:</h3>
              <pre id="resultText"></pre>
            </div>
          </div>
        </div>

        <div class="tab-content" id="query-tab">
          <div class="panel">
            <h2>Ask Questions About Your Documents</h2>
            <div class="flex">
              <input id="query" type="text" placeholder="Enter your question here..." class="grow" />
              <button id="queryBtn">Ask</button>
            </div>
            <div id="answerSection" class="hidden">
              <h3>Answer:</h3>
              <div id="answer" class="answer-box"></div>
              <h3>Sources:</h3>
              <div id="sources"></div>
            </div>
          </div>
        </div>

        <div class="tab-content" id="about-tab">
          <div class="panel">
            <h2>About This RAG System</h2>
            <p>This PDF RAG (Retrieval-Augmented Generation) system allows you to:</p>
            <ul style="margin-left: 20px; margin-bottom: 15px">
              <li>Upload PDF documents (yes, even those horrible scanned ones)</li>
              <li>Extract and process text content (no more Ctrl+F nightmares)</li>
              <li>Ask questions about your documents (like having a personal assistant who actually read the material)</li>
              <li>Get AI-generated answers based on document content (not hallucinations from the void)</li>
            </ul>
            <!-- More about content -->
          </div>
        </div>
      </div>
    </div>

    <script>
      // JavaScript for tab handling, file uploads, and queries
      // (If I included all of it, this article would be 20 pages long)
    </script>
  </body>
</html>

Our UI provides several key features that won't win design awards but will get the job done:

A tab-based interface for different features, because nothing says "modern web app" like tabs (except maybe hamburger menus)
A drag-and-drop file uploader for PDFs, which is much nicer than those "Click here to open a file browser that looks like it's from 1997" buttons
A simple query interface for asking questions, complete with a text input field and a button (revolutionary design, I know)
Display of answers along with source chunks for transparency, so users can see where the information came from and verify it wasn't just made up

The JavaScript to make all this work involves event listeners, fetch requests to our API endpoints, and DOM manipulation to update the UI with results. I've omitted it here because if I included every line of code, this article would be longer than the novel I've been "almost finished with" for the past five years.

Deployment and Testing

Let's deploy our application using Wrangler, Cloudflare's command-line tool that makes deploying to their network almost as easy as ordering a pizza (but with fewer cheese options):

# Deploy to Cloudflare Workers
npx wrangler deploy

After deployment, you can test the system by:

Uploading a PDF document (preferably one that contains actual text and not just pictures of your cat)
Asking questions related to the document content (try to be specific; "What is the meaning of life?" isn't going to work unless you uploaded a very philosophical PDF)
Reviewing the answers and their sources (and marveling at how the AI didn't just make everything up)

If all goes well, you'll see answers that actually make sense and relate to your documents. If something goes wrong, well... that's what Stack Overflow is for.

Performance Optimization and Considerations

While our basic implementation works well enough to impress your non-technical friends, there are several areas you might want to optimize for a production system (you know, if you want it to actually work reliably):

1. Chunking Strategies: The Art of Text Butchery

The way you chunk documents significantly impacts retrieval quality. It's like cutting a pizza—do it wrong, and someone's getting just crust. Consider:

Overlap between chunks: Adding overlap can help capture context that might be split between chunks, like making sure each slice has both pepperoni and cheese.
Semantic chunking: Instead of fixed-size chunks, consider splitting at natural semantic boundaries. It's like cutting pizza along the topping divisions rather than in rigid triangles.
Hierarchical chunking: Create chunks at different levels of granularity for more flexible retrieval. Think of it as having both big slices for the really hungry and smaller ones for the "just a taste" people.

Here's an example of implementing overlapping chunks:

public chunkDocumentBySentences(document: Document, maxChunkSize: number = 1000): DocumentChunk[] {
  // Split content into sentences using a regex that respects punctuation
  const sentenceRegex = /(?<=[.!?])\s+(?=[A-Z])/g;
  const sentences = document.content.split(sentenceRegex);

  const chunks: DocumentChunk[] = [];
  let currentChunk = '';
  let currentChunkId = 0;
  let sentencesInChunk: string[] = [];

  for (const sentence of sentences) {
    // Skip empty sentences (the literary equivalent of "um")
    if (!sentence.trim()) continue;

    // If adding this sentence would exceed the max size, create a new chunk
    if (currentChunk.length + sentence.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push({
        id: `${document.id}-chunk-${currentChunkId}`,
        documentId: document.id,
        content: currentChunk,
        metadata: {
          title: document.title,
          source: document.source,
          url: document.url,
          position: currentChunkId,
        },
      });
      currentChunkId++;

      // Create overlap by keeping the last 2 sentences (like a text DJ's crossfade)
      const overlapSize = Math.min(2, sentencesInChunk.length);
      const overlapSentences = sentencesInChunk.slice(-overlapSize);
      currentChunk = overlapSentences.join(' ');
      sentencesInChunk = [...overlapSentences];
    } else {
      // Add a space if needed (because running sentences together is rude)
      if (currentChunk.length > 0 && !currentChunk.endsWith(' ')) {
        currentChunk += ' ';
      }
      currentChunk += sentence;
      sentencesInChunk.push(sentence);
    }
  }

  // Add the last chunk if it's not empty (we don't waste food here)
  if (currentChunk.length > 0) {
    chunks.push({
      id: `${document.id}-chunk-${currentChunkId}`,
      documentId: document.id,
      content: currentChunk,
      metadata: {
        title: document.title,
        source: document.source,
        url: document.url,
        position: currentChunkId,
      },
    });
  }

  return chunks;
}

2. Prompt Engineering: The Gentle Art of AI Manipulation

The prompt we provide to the LLM significantly impacts the quality of responses. It's like giving directions to a very intelligent but extremely literal alien who's never been to Earth before. Consider:

More detailed instructions: Specify exactly how you want answers formatted, like telling someone not only to make a sandwich but also which bread to use and how thick to slice the tomatoes.
Few-shot learning: Include examples of good responses, because "show, don't tell" works for AIs too.
Handling uncertainty: Explicitly instruct the model how to handle ambiguous or missing information, so it doesn't resort to making things up like your uncle at Thanksgiving dinner.

Here's an improved prompt template that would make even the pickiest LLM behave:

const prompt = `
You are a helpful assistant that provides accurate information based on the given context.
Answer the following question based ONLY on the provided context. 

CONTEXT:
${context}

QUESTION:
${query}

Instructions:
1. If the context contains the information, provide a comprehensive answer based solely on that information.
2. If the context doesn't contain enough information to fully answer the question, explain what specific information is missing.
3. If the question is completely unrelated to the context, state "I don't have information about that in the provided documents."
4. Do not use any knowledge outside of the provided context. I repeat, DO NOT use any knowledge outside of the provided context.
5. Cite specific parts of the context that support your answer.
6. Don't just make stuff up because you think it sounds right. That's how we get conspiracy theories.

ANSWER:`;

3. Handling Multiple Documents: The Digital Librarian

In a real-world scenario, users might upload multiple documents. Unless your RAG system can handle this, it's like having a library where you can only check out one book at a time. Consider:

Document-level metadata: Store information about which document each chunk comes from, so you know whether to blame "Quarterly Financial Report" or "Company Holiday Party Guidelines" for that bizarre answer.
Document filtering: Allow users to query specific documents, for when they definitely don't want answers from the company joke book.
Cross-document reasoning: Enable the model to synthesize information across documents, like connecting the dots between "Budget Shortfalls" and "Executive Bonus Structure."

4. Error Handling and Edge Cases: Preparing for Users to Break Everything

Robust error handling is essential for a production system, because if there's a way to break your system, users will find it with the precision of a heat-seeking missile:

PDF extraction failures: Handle corrupted or password-protected PDFs, because some people still think "Password123" is keeping their documents secure.
Empty documents: Handle cases where text extraction yields little or no content, like when someone uploads a PDF that's just a single image (why do people do this?).
Rate limiting: Implement throttling for API requests to avoid overloading services, or having one enthusiastic user bring down your entire system.
Timeout handling: Gracefully handle timeouts in LLM calls or vector searches, because nothing says "professional application" like a spinning wheel of death.

Conclusion

In this guide, we've built a complete PDF RAG system from scratch, leveraging Cloudflare's serverless infrastructure and AI capabilities. It's like we went from "I can't find anything in these PDFs" to "I have a robot that reads them for me" in just a few hundred lines of code.

Our system allows users to:

Upload PDF documents (those digital artifacts that somehow still dominate business despite being invented in 1993)
Process and store document content (converting the PDF chaos into searchable knowledge)
Ask questions about their documents (without having to actually read them—the dream!)
Receive accurate, contextually relevant answers (not the creative fiction that pure LLMs sometimes produce)

This implementation demonstrates the power of combining retrieval-based approaches with generative AI. By grounding the LLM's responses in the retrieved documents, we create a system that is more accurate, transparent, and trustworthy than a standard LLM alone. It's like giving the model a cheat sheet instead of asking it to remember everything it's ever seen.

As you continue to explore and build RAG systems, remember that the quality of your retrieval component is just as important as the generative model. Experiment with different chunking strategies, embedding models, and prompt engineering techniques to optimize the performance of your system. It's like fine-tuning an instrument—small adjustments can make a big difference in the final output.

While we've focused on PDFs in this implementation, the concepts and architecture can be extended to other document types or data sources. The core principles of chunking, embedding, retrieval, and generation apply across different RAG applications, whether you're processing legal contracts, technical documentation, or your extensive collection of cupcake recipes.

If you're looking to implement this in a production environment, consider adding authentication (because not everyone should have access to your top-secret margarita formula), user-specific document collections (so the marketing team doesn't get answers from the engineering team's technical specs), and more sophisticated error handling (because things will go wrong, I promise).

I hope this guide helps you build more intelligent, context-aware applications using RAG. The combination of powerful language models with targeted information retrieval opens up exciting possibilities for document analysis, question answering, and knowledge management systems. It's like having a team of eager interns who actually read all the documents and remember everything perfectly—except they never ask for college credit or complain about the coffee.

Happy building! And remember: if your RAG system starts providing answers that aren't in your documents, it's not being creative—it's hallucinating. Unlike in humans, that's a bug, not a feature.