Building MeroStudySathy: A Multi-Agent RAG System That Actually Teaches You

By Trilochan Sharma · March 2026 · 15 min read

Most AI study tools are wrappers around a search box.

You upload a PDF. You ask a question. It returns a chunk of text that was already in the document. Congratulations — you just paid API credits to ctrl+F.

That's not learning. That's retrieval. And retrieval without structure, without feedback, without progression — it's just a fancier way to skim.

I wanted to build something different. Something closer to sitting with a tutor who actually read the material and knows how to break it down. So I built MeroStudySathy — a multi-agent RAG system that turns static PDFs into structured, interactive learning experiences.

This post is the full technical breakdown. Architecture, data pipeline, agent design, RAG implementation, caching strategy, and every non-obvious decision along the way.

The Problem With Existing Approaches

Before getting into the system, it helps to understand exactly what's broken about the current generation of "AI study tools."

Problem 1: Pure Retrieval Is Not Teaching

Retrieval-Augmented Generation (RAG) at its most basic is a search problem. You embed a query, find the nearest chunks in your vector store, and pass them to an LLM as context.

This works for Q&A. It does not work for learning.

Learning requires:

Structure — what order should I encounter these concepts?
Scaffolding — build on what I already know
Active engagement — not reading, doing
Feedback — did I actually understand that?
Repetition — spaced, targeted, on weak areas

A RAG pipeline alone gives you none of this.

Problem 2: One Agent Can't Do Everything Well

Early prototypes had a single LLM call doing everything — analyzing the document, building a study plan, generating questions, evaluating answers. The output was unfocused and inconsistent.

The solution is specialization. Different cognitive tasks need different system prompts, different context shapes, and different output formats. A planner agent should think like a curriculum designer. A teacher agent should think like an educator. A practice agent should think like an examiner.

Problem 3: API Cost Is A Real Constraint

Repeated LLM calls on the same content is expensive. If you're studying a document over multiple sessions — which is the whole point — you'll burn through API credits fast.

This needs a caching layer that's intelligent enough to save full generated responses, not just chunks.

System Architecture Overview

MeroStudySathy is built around four specialized agents sitting on top of a RAG pipeline, with a response cache layer that eliminates repeat API costs.

┌──────────────────────────────────────────────────────────────┐
│                      YOUR PDF DOCUMENT                        │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │    PDF EXTRACTION    │
              │   (page-by-page)     │
              │     pdf-parse        │
              └──────────┬───────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │    TEXT CHUNKING     │
              │  1000 tok / 150 ovlp │
              │   semantic windows   │
              └──────────┬───────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  EMBEDDING PIPELINE  │
              │   batch size: 100    │
              │  provider-agnostic   │
              └──────────┬───────────┘
                         │
                         ▼
        ┌────────────────────────────────┐
        │       SQLITE VECTOR STORE      │
        │   chunks + embeddings + cache  │◄─── Response Cache Layer
        │        (local database)        │     (0 API cost on repeat)
        └───────────────┬────────────────┘
                        │
       ┌────────────────┼────────────────┐
       │                │                │
       ▼                ▼                ▼
  ┌─────────┐     ┌──────────┐     ┌──────────┐
  │ PLANNER │     │ TEACHER  │     │ PRACTICE │
  │  AGENT  │     │  AGENT   │     │  AGENT   │
  └────┬────┘     └────┬─────┘     └────┬─────┘
       │               │                │
       ▼               ▼                ▼
  Study Plan    Teaching Session   Quiz Questions
  (structured)  (7-part format)   (MCQ/Short/Why)
       │               │                │
       └───────────────┼────────────────┘
                       │
                       ▼
             ┌─────────────────┐
             │    EVALUATOR    │
             │      AGENT      │
             └────────┬────────┘
                      │
                      ▼
           Score + Feedback + Weak Topic ID
           Progress Tracking (SQLite)

The Data Pipeline: Step by Step

Phase 1: PDF Extraction

When a user uploads a PDF, the first step is text extraction using pdf-parse. The key design decision here is per-page extraction rather than treating the document as one blob.

PDF Upload
    │
    ├─→ Extract text per page (pdf-parse)
    │       page 1: "text content..."
    │       page 2: "text content..."
    │       page N: "text content..."
    │
    ├─→ Store raw text: /data/uploads/{document_id}.txt
    │
    └─→ Create document record in SQLite:
            id, filename, page_count, created_at

Why per-page? Because it preserves page numbers for citations. Every chunk retains a reference to its source page, which feeds the [Source X, Page Y] citations in the Teacher Agent's output. If you can't verify what the AI tells you against the original document, you can't trust it.

Phase 2: Chunking

Raw page text goes into a chunking pipeline before embedding.

Raw Text (per page)
    │
    ├─→ Split into chunks:
    │       target size: 1000 tokens
    │       overlap: 150 tokens
    │       strategy: sentence-boundary aware
    │
    ├─→ Each chunk tagged with:
    │       chunk_id, document_id, page_number,
    │       chunk_index, token_count, text
    │
    └─→ Store in SQLite chunks table

The 150-token overlap is important. Without it, concepts that span a natural chunk boundary get split — the first half of an explanation ends up in one chunk, the second half in the next. Retrieval then pulls incomplete context. Overlap ensures semantic continuity across chunk boundaries.

The 1000-token target balances two competing needs:

Too small: chunks lack enough context for the LLM to generate good explanations
Too large: retrieval becomes imprecise — you pull in too much irrelevant content

1000 tokens tends to be about 2-4 paragraphs, which maps naturally to a single concept or idea.

Phase 3: Embedding Pipeline

Once chunks are stored, the embedding pipeline converts text to vectors.

Chunks (from SQLite)
    │
    ├─→ Batch into groups of 100
    │       (API rate limit management)
    │
    ├─→ Call embedding API:
    │       OpenAI: text-embedding-ada-002
    │       Google: text-embedding-004
    │
    ├─→ Receive float[] vectors
    │       typically 1536 dimensions (OpenAI)
    │       or 768 dimensions (Google)
    │
    └─→ Store in SQLite:
            chunk_id → vector (JSON serialized float[])

Why batch size 100? Most embedding APIs have rate limits measured in requests per minute and tokens per minute. Batching 100 chunks per call reduces API calls by 100x compared to embedding one chunk at a time, staying well within rate limits while completing indexing quickly.

Why SQLite for vectors? Dedicated vector databases (Pinecone, Chroma, Weaviate) offer ANN search with sub-millisecond latency at scale. But for a personal study tool with documents in the hundreds of pages — not millions — SQLite with cosine similarity is fast enough, requires zero external dependencies, and keeps everything local. The operational simplicity is worth more than the marginal performance gain.

Phase 4: Retrieval (Query Time)

When an agent needs context from the document:

Query (string)
    │
    ├─→ Embed query using same model as chunks
    │
    ├─→ Load all chunk vectors from SQLite
    │
    ├─→ Compute cosine similarity:
    │       similarity = dot(q, c) / (|q| × |c|)
    │       for each chunk vector c
    │
    ├─→ Rank by similarity score descending
    │
    ├─→ Return top-K chunks (default K=5)
    │       with text, page_number, chunk_index
    │
    └─→ Format as context string:
            [Source 1, Page 3]: "chunk text..."
            [Source 2, Page 7]: "chunk text..."

Cosine similarity vs dot product: Cosine similarity normalizes for vector magnitude, making it robust to length variation between chunks. A short chunk that's highly relevant won't be penalized against a longer chunk with more total signal.

K=5: Five chunks gives the LLM enough context to generate a substantive explanation without overwhelming the context window. For most sections of a technical document, five chunks covers the relevant material while keeping prompt size manageable.

The Response Cache Layer

This is the most impactful optimization in the system.

The insight: generated teaching sessions are deterministic enough to cache. If you're studying "Binary Search Trees" today and come back in a week, the teaching session for that section should be essentially the same. No reason to hit the API again.

User selects section: "Binary Search Trees"
    │
    ├─→ Check cache:
    │       key = hash(document_id + section_title + agent_type)
    │       lookup in SQLite response_cache table
    │
    ├─→ CACHE HIT
    │       return stored response immediately
    │       cost: 0 API calls, ~5ms
    │
    └─→ CACHE MISS
            ├─→ Build query from section title
            ├─→ Retrieve top-5 chunks (cosine similarity)
            ├─→ Format context with page citations
            ├─→ Stream LLM response
            ├─→ Store complete response in cache
            └─→ Return to user

The cache table schema:

CREATE TABLE response_cache (
  id          TEXT PRIMARY KEY,
  cache_key   TEXT UNIQUE NOT NULL,
  document_id TEXT NOT NULL,
  agent_type  TEXT NOT NULL,
  section     TEXT NOT NULL,
  response    TEXT NOT NULL,
  created_at  TEXT NOT NULL
);

Result: 60-80% reduction in API costs for typical study sessions. For exam prep — reviewing the same content repeatedly — savings are even higher.

This also changes user behavior. When revisiting a section costs API credits, users hesitate to review. When it's free, they review freely. That's pedagogically better. The cache is a product decision as much as a technical one.

The Four Agents

Agent 1: Planner Agent

Job: Analyze the document and produce a pedagogically-ordered learning plan.

Input: Full document text (or representative chunks)

Output:

{
  "plan": [
    {
      "section_id": "s1",
      "title": "Introduction to Neural Networks",
      "summary": "Neurons, weights, activation functions",
      "estimated_minutes": 15,
      "prerequisites": [],
      "order": 1
    },
    {
      "section_id": "s2",
      "title": "Backpropagation",
      "summary": "Gradient computation, chain rule, weight updates",
      "estimated_minutes": 25,
      "prerequisites": ["s1"],
      "order": 2
    }
  ]
}

The critical thing the Planner does that a table of contents doesn't: it identifies prerequisite relationships and reorders sections accordingly. Backpropagation comes after forward propagation, regardless of how the PDF is structured.

Agent 2: Teacher Agent

Job: Deliver a structured teaching session for a given section.

Input: Section title + top-5 retrieved chunks with page citations

Output: A 7-part teaching session, streamed in real-time

Part	Purpose
1. Definition	Precise, unambiguous definition
2. Why It Matters	Real-world motivation
3. Core Theory	Mechanistic explanation
4. Examples	Concrete walkthroughs
5. Common Mistakes	What learners get wrong
6. Recap	Compressed key points
7. Next Steps	What comes next and why

Every claim includes a citation: [Source 3, Page 12]. Users can verify anything against the original document.

After the session, users can ask follow-up questions. These go through the same retrieval pipeline with a shorter, conversational prompt rather than the 7-part structure.

Agent 3: Practice Agent

Job: Generate questions that test understanding at multiple depths.

Three question types, three assessment depths:

MCQ — recall and recognition. Weakest measure but useful as a warm-up
Short answer — articulate the concept in your own words. Much stronger signal than MCQ
Conceptual "why" questions — reason about the system, not just recall facts. Hardest to fake, most valuable for actual learning

{
  "questions": [
    {
      "type": "mcq",
      "question": "Time complexity of search in a balanced BST?",
      "options": ["O(1)", "O(log n)", "O(n)", "O(n log n)"],
      "correct": 1
    },
    {
      "type": "short_answer",
      "question": "Why does inserting a sorted sequence into a BST result in O(n) search?",
      "key_concepts": ["degenerate tree", "linear chain", "worst case"]
    },
    {
      "type": "conceptual",
      "question": "Why do self-balancing trees exist and what problem do they solve?",
      "depth": "high"
    }
  ]
}

Agent 4: Evaluator Agent

Job: Score answers and generate actionable feedback.

{
  "score": 75,
  "correct_elements": [
    "Correctly identified that sorted insertion creates a linear chain",
    "Mentioned O(n) worst case"
  ],
  "missing_elements": [
    "Did not explain the mechanism (each node > all previous)",
    "Did not connect to linked list equivalence"
  ],
  "feedback": "Good understanding of the outcome, but the mechanism needs more depth...",
  "weak_topics": ["BST degenerate case", "worst case analysis"]
}

The weak_topics array feeds the progress tracking system. Over sessions, the system builds a profile of which concepts the user consistently struggles with and surfaces them for review.

Multi-Provider LLM Architecture

User Settings (provider + encrypted API key)
    │
    ▼
┌───────────────┐
│  LLM Router   │
└───────┬───────┘
        │
┌───────┼───────┐
▼       ▼       ▼
OpenAI  Gemini  Claude
Client  Client  Client
        │
        ▼
Unified Response Interface
(streaming + non-streaming)

Each provider implements the same interface:

interface LLMClient {
  complete(prompt: string, options: CompletionOptions): Promise<string>;
  stream(prompt: string, options: CompletionOptions): AsyncGenerator<string>;
  embed(texts: string[]): Promise<number[][]>;
}

Agents don't know which provider they're talking to. Switching from GPT-4 to Claude is a settings change, not a code change.

API keys are encrypted with AES-256-CBC using a machine-specific derived key. Decrypted in memory only when needed, never written to logs.

Tech Stack Decisions

Layer	Choice	Reason
Framework	Next.js 14 App Router	Server Components + native streaming
Language	TypeScript	Type safety across agent I/O
Styling	Tailwind + shadcn/ui	Fast, consistent, accessible
Database	SQLite (better-sqlite3)	Zero config, local-first, sync API
Vector Store	SQLite cosine similarity	No external dependency
PDF	pdf-parse	Per-page text + page numbers
LLM	OpenAI / Gemini / Claude	Multi-provider flexibility
Encryption	AES-256-CBC	API key security at rest

Why not Postgres? Zero benefit at personal tool scale. SQLite's synchronous API is actually an advantage — no async/await for simple reads, no connection pooling.

Why not a dedicated vector DB? Pinecone and Chroma require cloud infra or a running local server. For a local-first tool, that's a non-starter. SQLite cosine similarity across ~500 chunks takes milliseconds.

Why Next.js App Router? Streaming from Server Components to the client is first-class. The Teacher Agent's response appears as it generates without WebSocket complexity.

Project Structure

merostudysathy/
├── app/
│   ├── page.tsx                  # Upload + document list
│   ├── settings/                 # LLM provider config
│   ├── doc/[id]/                 # Learning interface
│   └── api/
│       ├── documents/            # Upload, list, delete
│       ├── plan/                 # Planner agent
│       ├── teach/                # Teacher agent (streaming)
│       ├── practice/             # Practice agent
│       ├── evaluate/             # Evaluator agent
│       └── progress/             # Progress tracking
│
├── lib/
│   ├── agents/
│   │   ├── planner.ts
│   │   ├── teacher.ts
│   │   ├── practice.ts
│   │   └── evaluator.ts
│   ├── llm/
│   │   ├── router.ts
│   │   ├── openai.ts
│   │   ├── gemini.ts
│   │   └── anthropic.ts
│   ├── rag/
│   │   ├── chunker.ts            # 1000 tok / 150 overlap
│   │   ├── embedder.ts           # Batch embedding pipeline
│   │   ├── retriever.ts          # Cosine similarity search
│   │   └── citations.ts          # [Source X, Page Y] formatting
│   └── storage/
│       ├── db.ts                 # SQLite schema + connection
│       ├── chunks.ts
│       ├── cache.ts              # Response cache layer
│       └── progress.ts           # Weak topic tracking
│
└── data/                         # Gitignored
    ├── tutor.db
    └── uploads/

What I Learned

Agents need tight scope. The single-agent prototype produced unfocused output. Four specialized agents with distinct system prompts produce dramatically better results. Separation of concerns applies to AI systems too.

Streaming is a product feature. Waiting for a full response before displaying anything feels broken. Users start reading while the response generates. This changes perceived speed significantly.

Caching changes user behavior. Free revisits encourage reviewing. Costly revisits discourage it. The cache enables a better pattern of use — not just an optimization.

Local-first is a genuine differentiator. Private documents shouldn't require trusting a cloud backend. Zero telemetry, zero data collection — delete /data and everything's gone.

SQLite is underrated. I almost reached for Postgres and Pinecone out of habit. For a personal tool, the simplest database that works is the right database.

What's Next

RAPTOR Tree Indexing — build a hierarchy of summaries above raw chunks, so retrieval operates at the right abstraction level for each query type.

Spaced Repetition Scheduling — the Evaluator already tracks weak topics. SM-2 scheduling would surface them at optimal review intervals.

Ollama Support — fully offline operation, no API keys required.

Try It

git clone https://github.com/parnish007/merostudysathy.git
cd merostudysathy
npm install
npm run dev

Node.js 18+, one API key (OpenAI / Google / Anthropic). Everything local. No data leaves your machine.

Trilochan Sharma — CS student at Kathmandu University.

Portfolio · GitHub · LinkedIn