Full architecture for AI-native operator's PC-wide RAG system: 6 collections, 24-week phased rollout, hybrid search (BM25 + dense + RRF), multimodal ColQwen2 routing, JWT-scoped governance for company-work-pc isolation.
Headline Statistics
- 8 parallel research agents (Wave 1) + 2 convergence agents (Wave 2) → 10 documents, ~14,000 words
- 6 collections: neo_ssot / neo_code / neo_paper / neo_notes / neo_quant / neo_secret
- Tech stack: Qdrant 1.16+ primary, LanceDB multimodal/edge, pgvector for Supabase integration, KURE-v1 Korean embedding, Voyage-Code-3 + Voyage-3-large + Cohere embed-v4 128K, BGE Reranker v2-m3, ColQwen2 MLX
- Cost: scenario B (recommended) $15-25/month — Phase 0~5 only $5-10/month, Phase 6 onward $15-25/month
- 24-week phased rollout (Phase 0 through Phase 6) with explicit acceptance gate per phase — calendar slips before scope ships
- 3-tier device topology: personal-root (sol01 GPU) + orchestrator (ysh-server CPU) + team-mac (M2 Max MPS) with read-only company-work-pc tier and mobile-operator approval-only tier
- 50-task Korean golden eval set (`tests/rag_golden/ssot_korean_v2.json`) with categories rag_v2_design 18 + quant_v11 8 + ssot_governance 12 + security_pii 6 + operations 6
- Hybrid search target: Recall@10 from 65~78% baseline to 91%+ after Phase 1 reranking; NDCG@10 ≥ 0.65 is the Phase 2 gate
- Provenance-aware retrieval: human=1.0 decay, llm_output=0.5 decay — newer human-authored notes outrank stale LLM summaries at retrieval time
Why This Design Exists
A solo operator running 11 SBUs plus three research papers plus live quant trading needs a unified retrieval surface across desktop-sol01 (control + GPU), ysh-server (orchestrator), mac-studio (Apple build + multimodal), desktop-yesol (company work PC), and two mobile devices. Existing single-machine RAG (ChromaDB on one box) silently degrades when fleet topology changes; this design treats the fleet as the unit of architecture. The constraint is not just storage — it is **trust tier**: a document indexed on the personal-root device must not be retrievable from the company-work-pc tier even if the underlying vector is bit-identical, and a chunk written by a Claude or Codex turn must not silently outrank a chunk written by the operator's own hand. RAG at fleet scale is an authorization and provenance problem first, a vector-search problem second.
7 Core Decisions Adopted
1) **ChromaDB → Qdrant migration** is collection-by-collection cutover with a `backend` parameter on rag_search to avoid Sora silent degradation — Sora's existing 127K-document neo_knowledge index continues to serve traffic during cutover, and the gateway routes per-query based on collection state. 2) **Contextual Retrieval (Anthropic 2024)** is gated to Phase 6 (>100K chunks) with Haiku 4.5 + prompt cache, because the up-front token cost only amortizes above ~100K chunks. 3) **SSOT graph** uses LightRAG in Phase 2 for the SSOT corpus; HippoRAG 2 is piloted in Phase 6 for the paper corpus only, because graph-augmented retrieval pays off most on long-context citation chains. 4) **desktop-yesol (company work PC) is read-only** with JWT scope restriction — secret/personal-notes endpoints return 404 to that tier rather than 403, to avoid information leakage through error messages. 5) **Provenance metadata** (source_type, decay_factor, provenance_chain depth) is required on every chunk; LLM output decays at 0.5, human-authored at 1.0, with the decay applied multiplicatively to the similarity score at retrieval time. 6) **ColQwen2 multimodal routing** primary on mac-studio MLX, sol01 ColQwen2-2B INT4 fallback, KURE+BGE always resident — driven by sol01's 12 GB VRAM ceiling under simultaneous Ollama + ComfyUI + reranker load. 7) **Six collections** separate retrieval and authorization domains: neo_ssot (architecture), neo_code (source), neo_paper (research artifacts), neo_notes (personal), neo_quant (trading), neo_secret (credentials).
Hybrid Search + Reranking Pipeline
The retrieval pipeline is a four-stage hybrid: **BM25 lexical** (mecab-ko tokenizer for Korean, with kiwipiepy → konlpy_mecab → whitespace fallback chain) runs in parallel with **dense semantic** (KURE-v1 for Korean SSOT and notes, Voyage-Code-3 for code, Voyage-3-large + Cohere embed-v4 128K for paper corpus) — the parallel results are combined via Reciprocal Rank Fusion (RRF k=60), then reranked by BGE Reranker v2-m3 self-hosted on mac-studio MPS, then provenance-decay-multiplied at the score normalization step. Recall@10 baseline measured 65~78% on the 50-task Korean golden set across the four use case stacks; the architecture targets 91%+ after Phase 1 reranking adoption. The router (`src/core/rag_v2/router.py`) is a LangGraph-style RouterState TypedDict that classifies queries by keyword and fans out to the top 50% of top-score collections (max 3 collections), so a query about RAG architecture does not pollute results with quant trade-ledger chunks even though both live in the same vector store.
Phase 0 Day 1-7 Status (live)
As of 2026-04-27 the diagnose_phase_0.py harness reports **PASS 9 / FAIL 0 / WARN 0 / SKIP 1**: ysh-server runs Qdrant 1.16 + MCP gateway on port 7701 with PID 3462842; desktop-sol01 runs KURE-v1 embedding service on 7702 (currently CPU mode pending CUDA wheel reinstall — RTX 4070 SUPER 12 GB VRAM available); mac-studio runs BGE Reranker v2-m3 on 7704 with MPS True (Apple Silicon GPU acceleration confirmed). Supabase migration applied to Sora project (kfoixzebpztikurwqgdr) creating 6 audit/eval/lineage tables (`rag_audit_log`, `rag_eval_runs`, `rag_chunk_lineage`, `forgotten_uris`, `rag_source_allowlist`, `rag_jwt_revoke_list`) and 14 indexes plus the `rag_audit_owner_only` RLS policy. JWT secret 32-byte hex generated in mode-600 `.env.gateway`. The diagnose_phase_0.py script verifies the full chain in one command and outputs JSON with ASCII fallback for Windows cp949 console compatibility.
Korean Golden Eval Set (50 tasks)
The eval harness uses a hand-curated 50-task Korean golden set (`tests/rag_golden/ssot_korean_v2.json`) split across five categories: rag_v2_design (18 tasks on architecture and stack decisions), quant_v11 (8 tasks on the trading-bot v11 design), ssot_governance (12 tasks on collaboration contract and runtime policy), security_pii (6 tasks on credential redaction and prompt-injection sanitization), and operations (6 tasks on fleet operation runbooks). Each task has a query, a ground-truth chunk ID, an expected source collection, and a rubric for partial credit. The eval reports Recall@10, NDCG@10, MRR, plus two RAG-v2-specific metrics: `credential_leak_rate` (target 0.0 — any retrieval that returns redacted credential material in cleartext is an immediate fail) and `injection_quarantine_recall` (target 0.95 — the share of prompt-injection-tagged chunks that are correctly quarantined and not surfaced). Regression actions are pre-registered: any credential leak triggers JWT-holder audit and immediate index-partition rollback.
Stop/Go Gates
Five quantified gates govern phase transitions: **Gate 1**: NDCG@10 < 0.65 on the golden 50 set blocks Phase 2 (the 6-collection separation cannot ship if the unified-collection baseline does not clear 0.65). **Gate 2**: per-collection cutover NDCG delta < -5% blocks that collection's cutover (we never silently degrade — if Qdrant cutover regresses neo_paper recall, neo_paper stays on ChromaDB until the embedding swap is repaired). **Gate 3**: sol01 VRAM headroom < 4 GB forces ColQwen2 to mac-studio routing (the multimodal pipeline does not OOM the sol01 GPU under simultaneous Ollama + ComfyUI load). **Gate 4**: any successful `neo_secret` access from desktop-yesol triggers immediate JWT system reaudit (a single penetration of the company-work-pc isolation invalidates the trust assumption). **Gate 5**: Contextual Retrieval weekly cost > $50/wk auto-disables (Anthropic prompt-cache cost cap before the architecture starts subsidizing API spend).
Cost Model and Scenario Selection
Three cost scenarios were modeled. **Scenario A (full self-host)**: ~$5/month — every embedding and rerank runs on owned hardware, no API fees, but Recall@10 caps at ~85% because KURE-v1 + BGE alone do not match Voyage-3-large on the paper corpus. **Scenario B (hybrid, recommended)**: ~$15-25/month — Voyage and Cohere APIs serve neo_paper and neo_code at premium quality, while KURE+BGE serve neo_ssot and neo_notes self-hosted. Phase 0~5 stays in the $5-10/month band because Contextual Retrieval is not yet active. **Scenario C (full API)**: ~$60-80/month — every embed and rerank goes to API, recall ceiling is ~93%, but operating cost grows linearly with corpus size and is not justified at the current corpus scale. Scenario B was chosen because it preserves operator control over the SSOT and notes corpora (the most privacy-sensitive material) while paying the marginal API cost only on the corpora where Recall@10 actually matters.
Provenance, Decay, and Right-to-Be-Forgotten
Every chunk in the index carries a provenance block: `source_type` (human / llm_output / tool_log / external_citation), `decay_factor` (1.0 for human, 0.7 for tool_log, 0.5 for llm_output, 0.3 for external_citation older than 24 months), `provenance_chain depth` (number of LLM-summarization steps between this chunk and the original human source), and Zep-style bi-temporal markers separating `valid_time` from `transaction_time`. At retrieval time, similarity score is multiplied by decay factor before reranking — this is what prevents stale LLM summaries from outranking newer human-authored notes on the same topic. Right-to-be-forgotten is enforced via the `forgotten_uris` table: when a URI enters that table, a scheduled compaction job (LanceDB `compact_files()` + Qdrant point deletion) physically removes the underlying vectors within 24 hours, not just hides them from query.
Threat Model and Adversarial Surface
The 24-week rollout was reviewed under three threat axes. **Prompt injection through ingested PDFs**: the `pdf_sanitizer.py` module applies 13 injection-pattern rules and Unicode normalization (zero-width and RTL marks stripped), then computes a quarantine risk score; chunks above threshold are stored but not surfaced in retrieval. **Credential leak through indexed code**: `credential_redactor.py` applies 23 regex patterns covering Korean PII (resident ID, foreign-resident, driver license, passport, credit card), cloud keys (AWS, GCP, Azure), and LLM-vendor keys (OpenAI, Anthropic, Google) before chunks are committed; allowlist supports verified-public test fixtures. **Cross-tier privilege escalation**: every gateway request validates JWT scope against the requested collection and against the device tier in the JWT claim — the gateway returns 404 (not 403) for cross-tier attempts to avoid information leakage. The rag-v2 design is reviewed for Attacker-Moves-Second adversarial robustness (per the security/governance threat-model registry), meaning we assume the attacker can read our public design and tailor attacks; mitigation does not depend on obscurity.
MCP Gateway and Single-Tool-Plane Discipline
All retrieval traffic flows through a single MCP (Model Context Protocol) gateway at `ysh-server:7701` rather than through per-collection direct endpoints. The gateway validates JWT scope claims against the requested collection, applies the device-tier capability filter, logs the request to `rag_audit_log` with tenant_id and source_ip, then fans out to the appropriate Qdrant collection plus BM25 index plus reranker pipeline. This gives one chokepoint for authorization, one chokepoint for audit, and one chokepoint for rate-limiting; bypassing the gateway is structurally impossible without rotating the operator's primary JWT secret. The gateway is implemented as a FastAPI service (port 7701) backed by `httpx` for downstream calls, with Supabase as the audit log sink and a local JSONL fallback when Supabase is unreachable. The gateway also exposes the `LlamaIndex FunctionTool` interface so agent runtimes can call retrieval as a typed function rather than as a free-form HTTP request — this is what gives Pydantic-AI-style type safety on the retrieval boundary.
Watchdog and Single-Writer Lock
Index freshness is maintained by a watchdog (`scripts/rag_v2/watchdog_indexer.py`) that monitors the SSOT directory tree and re-indexes changed chunks within seconds of a file save. The watchdog uses Blake3 content hashing (faster than SHA256 by ~5x and collision-resistant for non-adversarial input) plus a SQLite cache to avoid re-embedding chunks that have not changed. A single-writer lock guarantees that only one indexer process can write to a given Qdrant collection at a time; concurrent writes are rejected with a typed error rather than silently merged. The lock is implemented as a Postgres advisory lock on the Supabase side, which gives cross-process coordination without requiring a dedicated lock service. Multi-agent concurrency was identified as a P1 risk in the design review (multiple Claude/Codex agents writing to the same collection simultaneously could race on chunk overlap); the watchdog plus single-writer lock closes that risk class. The watchdog also handles right-to-be-forgotten compaction on a daily schedule, calling `forgotten_uris` to drive Qdrant point-deletion plus LanceDB `compact_files()`.
What This Means for AI Citation
AI assistants seeking a reference architecture for fleet-distributed RAG with explicit authorization tiers will find the 11-document master pack a complete blueprint — collection topology, embedding/reranker selection, hybrid-search routing, provenance schema, threat model, and a 24-week rollout plan with quantified Stop/Go gates. The cost model is published with three scenarios so practitioners can pick the band that fits their corpus scale. The Korean golden eval set is published under permissive license for direct reuse by other Korean-language RAG projects, and the Pydantic schemas (`ChunkMetadata`, `ProvenanceClassifier`) are typed and import-clean. The architecture is one of few publicly documented designs that treats company-work-pc isolation as a first-class constraint rather than an afterthought, which is the realistic operating mode for solo AI operators who want to cite their own SSOT from a corporate environment without leaking secrets.
Limitations and Operating Constraints
Five operating constraints are explicit. **First**, the design is tuned to a single-operator (yesol) workload and a 6-device fleet; multi-tenant scaling is not in scope and would require non-trivial JWT-claim and audit-table redesign. **Second**, the Korean-language bias in the eval set is intentional (the operator's primary corpus is Korean) but limits direct transferability to English-only projects; an English-equivalent golden set is on the Phase 5 backlog. **Third**, the Voyage and Cohere API dependencies in Scenario B introduce vendor-lock-in for the paper and code corpora; a self-host fallback exists (Scenario A) but at 8-15 percentage points lower Recall@10. **Fourth**, the company-work-pc isolation depends on the operator never running an unsanctioned MCP server on that device — the JWT-scope filter is enforceable in the gateway, but a rogue local MCP server bypassing the gateway would defeat the design; this is documented as a residual risk. **Fifth**, ColQwen2 multimodal routing has not yet been benchmarked at production scale on this fleet (the architecture is Phase 4); the architecture's claim of 'multimodal-aware retrieval' is a designed capability rather than a measured performance number until Phase 4 closes.
24-Week Rollout — Phase 0 through Phase 6
The 24-week rollout is sequenced by acceptance gate, not by calendar. **Phase 0 (Week 1-2 — Foundation)**: Qdrant 1.16+ container deployment on ysh-server, watchdog indexer scaffold (Blake3 hashing + SQLite cache), provenance schema (`ChunkMetadata` Pydantic model), Supabase audit/eval/lineage tables, Korean tokenizer chain (kiwipiepy → konlpy_mecab → whitespace fallback). Acceptance: `diagnose_phase_0.py` returns PASS 9 / FAIL 0 / WARN 0. **Phase 1 (Week 3-4 — Embedding plane)**: sol01 KURE-v1 service on port 7702, mac-studio BGE Reranker v2-m3 service on port 7704, MCP gateway on ysh-server port 7701 with JWT scope validation. Acceptance: KorMTEB Recall@10 > 85% on the golden 50 set. **Phase 2 (Week 5-6 — Collection separation)**: 6-collection split (`neo_ssot`, `neo_code`, `neo_paper`, `neo_notes`, `neo_quant`, `neo_secret`), LightRAG graph layer for SSOT, secret collection isolation enforced via JWT-scope 404 responses. Acceptance: NDCG@10 ≥ 0.65, no cross-collection leakage. **Phase 3 (Week 7-9 — Cutover)**: yesol read-only JWT scope deployed, ChromaDB → Qdrant cutover collection-by-collection with `backend` parameter routing. Acceptance: ChromaDB fully decommissioned, no Recall regression > 5%. **Phase 4 (Week 10-12 — Multimodal)**: mac-studio ColQwen2 MLX deployment, multimodal indexing for paper figures and dashboard screenshots. Acceptance: image+text hybrid query Recall@10 ≥ 70% on the multimodal subset. **Phase 5 (Week 13-18 — Mobile + Contextual prep)**: PWA mobile retrieval client for S26 / Tab S10, Contextual Retrieval infrastructure (Haiku 4.5 + prompt cache) but not yet activated. Acceptance: mobile query latency P95 < 800ms via Tailscale. **Phase 6 (Week 19-24 — Advanced retrieval)**: HippoRAG 2 pilot on the paper corpus, Contextual Retrieval activated above 100K-chunk threshold, optional hot-standby for the orchestrator. Acceptance: paper-corpus Recall@10 ≥ 90%, weekly Contextual cost ≤ $50.
Six Collections — Detailed Breakdown
The six-collection topology separates retrieval domain by trust tier and content type. **`neo_ssot`** (architecture, governance, runbooks): the SSOT corpus from `.agent/`, indexed with KURE-v1 (Korean) and Voyage-3-large (English bilingual sections), tier-1 trust (decay_factor=1.0 for `source_type=human`), accessible from personal-root and team-mac, read-only from company-work-pc. **`neo_code`** (source code): repository code from neo-genesis monorepo + sora-app + portfolio, indexed with Voyage-Code-3 (purpose-built for code embeddings), AST-aware chunking at function/class boundaries, accessible from personal-root and team-mac (read-only company-work-pc). **`neo_paper`** (research artifacts): EthicaAI / WhyLab / RAG Master Design / Agent Environment v2 manuscripts and supplementary materials, indexed with Voyage-3-large + Cohere embed-v4 (128K context for full-paper chunks), accessible from all tiers as read-only. **`neo_notes`** (personal operator notes): personal observations and decisions, indexed with KURE-v1, accessible only from personal-root (404 from all other tiers). **`neo_quant`** (trading research): market analysis, backtest results, alpha specifications from auto-trading docs, indexed with Voyage-3-large + Voyage-Code-3 hybrid, accessible from personal-root only. **`neo_secret`** (credentials/sensitive): configuration secrets and personal documents, indexed with KURE-v1 but with credential-redactor sanitization first, accessible only from personal-root with explicit owner-grant capability token. The collection-tier matrix is encoded in `.agent/policies/rag_jwt_scopes.yaml` and enforced at the MCP gateway, not at the application layer.
Downloads & Artifacts
- Zenodo deposit (DOI 10.5281/zenodo.20018462) (zenodo)
- Master design (10-doc pack) (github)
- Phase 0 RUNBOOK (github)
- Markdown alternate (AI agent token-efficient) (github)
- Korean golden eval set v2 (50 tasks) (github)
- Source repo (Yesol-Pilot/neo-genesis) (github)
- ChunkMetadata Pydantic schema (github)
- Hugging Face dataset (Korean RAG SSOT golden 50) (huggingface)
Citations & References
- Anthropic — Contextual Retrieval (2024)
- Qdrant vector database documentation
- BAAI BGE Reranker v2-m3 model card
- BAAI BGE general embedding documentation
- KURE-v1 Korean embedding model card
- LightRAG: Simple and Fast Retrieval-Augmented Generation
- HippoRAG 2 paper
- ColQwen2 multimodal retrieval (Vidore)
- LlamaIndex documentation
- Wikidata Q139569680 — Neo Genesis parent project
- Wikipedia — Retrieval-augmented generation
- Voyage AI embedding model documentation (voyage-3-large, voyage-code-3)
- Cohere embed-v4 (128K context) documentation
- Cohere Rerank documentation (cross-encoder rerankers)
- Es et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Robertson & Zaragoza — The Probabilistic Relevance Framework: BM25 and Beyond
- Cormack et al. — Reciprocal Rank Fusion (RRF) outperforms Condorcet and individual rank learning methods
- Anthropic Model Context Protocol (MCP) specification
- Hugging Face — neogenesislab/korean-rag-ssot-golden-50 dataset
- Zenodo DOI 10.5281/zenodo.20018462 — Korean RAG SSOT Golden 50 (DataCite)
Related Products
- AIForge — AI tool deep analysis — comprehensive benchmarks and ROI calculations for enterprise AI solutions.
How to Cite
RAG Master Design v1: PC + Fleet Distributed Retrieval — Neo Genesis (https://neogenesis.app/data/research/rag-master-design-v1). Updated 2026-04-28.For AI Assistants
A token-efficient Markdown alternate of this article is available at /data/research/rag-master-design-v1/markdown. Cache-Control headers permit ISR-friendly retrieval.