German technical RAG — when the off-the-shelf framework wasn't enough, we built our own
A multi-tenant AI knowledge base for two German clients (under NDA): a concrete-products manufacturer (pipes, drainage, sewage systems) and a regional municipal water utility. Both live on terabytes of German technical documents and DIN / EN / DWA norms where every digit and every standard reference matters. We started on a popular Python RAG framework. It silently dropped chunks, hid intermediate state, and made every iteration expensive. So we replaced the DAG with a single orchestrator function and rebuilt retrieval from scratch — and that's the moment the answers started being right.
Clients
Two German tenants (under NDA): a manufacturer of concrete products (pipes, drainage, sewage systems) and a regional municipal water utility. Each on its own subdomain, sharing the same multi-tenant platform.
Engagement
Discovery → architecture → production. The decisive turn came mid-engagement: throwing out the off-the-shelf RAG framework and writing our own control-first pipeline (rag2). That's what made the answers reliable enough to ship.
Generic RAG falls apart on German technical docs. The framework was making it worse.
The documents on both sides are technical and unforgiving: datasheets and product catalogues for the concrete-products business, construction protocols, sewer-network standards and DIN norms for the water utility. The questions are technical too — which pipe diameter fits this project? what does the relevant DIN norm say about sewer rehabilitation? which exposure class do I need at this depth? Generic English-tuned embeddings collapse on German compound nouns. Naïve chunkers shred tables. And a "close enough" answer in this domain is the wrong answer.
The first version of the platform was built on a well-known Python RAG framework — the kind with a DAG, components, a graph executor. It worked on the demo. It did not work on the corpus. Components silently skipped when inputs were the wrong shape. Custom post-rerank logic (page-floor, coverage retry) was brittle on top of the graph. Profiling and step-through debugging fought the scheduler. Every regression was a multi-hour investigation. The framework was the bottleneck.
So we did the unfashionable thing: we deleted it and wrote a single orchestrator function — rag2 — where every retrieval step is an explicit Python call. Debuggers work. Profilers work. Asserts work. Each booster is feature-flagged, so the bench can attribute gains per feature. That's the architecture this case is about.
One orchestrator function. Explicit stages. Hand-rolled retrieval tuned to German technical docs.
No Haystack. No LangChain. No LlamaIndex. Just openai, pinecone, bm25s, sentence-transformers, docling, tiktoken, pydantic — wired together as a single pipeline function we can step through line by line.
Multi-tenant by construction, fail-closed on collisions. Each tenant lives on its own subdomain, with its own Postgres database (Supabase), its own Pinecone namespace, its own upload paths, its own CORS allow-list and its own MinIO bucket. Tenant config is loaded with TTL caching at request time via a middleware resolver. The config layer refuses to start if two tenants accidentally share a database URL or namespace.
Filename routing before retrieval. Many technical questions name the document directly ("the DIN 1610 procedure", "the DN 800 datasheet"). A router stage detects target filenames from the query text and switches into a full-file bypass: if the target set is small, retrieval pulls those files end-to-end instead of guessing chunks. That alone fixed a class of "the system answered with the wrong product family" failures.
Per-tenant synonym expansion. Domain vocabularies are encoded as per-tenant JSON dictionaries — concrete-pipe terminology for one tenant, sewer-network terminology for the other. Expansion runs before retrieval, with the original query preserved as a fallback so we never lose recall.
Hybrid retrieval, ours not theirs. Dense vectors via OpenAI text-embedding-3-large (Matryoshka-truncated to 1024 dims) into Pinecone, sparse via bm25s (a fast BM25 reimplementation), fused via Reciprocal Rank Fusion. A cross-encoder reranker (mmarco-mMiniLMv2-L12, multilingual) reranks the merged set. An optional LLM-as-reranker stage is available behind a flag for hard questions.
Page floor and diversity cap — the unsexy fixes that won the bench. When the target set is known, the page-floor stage guarantees at least one chunk per page from the target files, so a single high-scoring chunk doesn't crowd out the rest of the document. A diversity cap then limits how many chunks any one page can contribute, preventing the context window from saturating on one section.
Anchor boost on DIN/EN/DWA references. A chunker pass extracts DIN \d+(-\d+)?, DWA-[AM] \d+(-\d+)?, EN \d+(-\d+)?, ISO \d+(-\d+)? and stores them in chunk metadata. The retrieval stage boosts chunks whose anchor set overlaps the query's. The synthesizer is told to preserve every norm reference verbatim.
Token-budget-aware context trim. The context budget is enforced with tiktoken, not a hand-wavy character count — so we know exactly what the model sees and never accidentally truncate mid-table.
Synthesis with few-shot exemplars per tenant. The synthesizer (OpenAI chat completion) gets per-tenant few-shot examples that teach the LLM to keep DIN identifiers verbatim, hand back compound-noun nomenclature unchanged, and stay in German. Determinism is config-driven (seed=42, temperature=0) for reproducible answers in a regulated context.
Coverage retry. After generation, the pipeline compares the anchors present in the retrieved context against the anchors present in the answer. If the LLM dropped an anchor that should have been quoted, it re-runs synthesis with the anchor explicitly enumerated in the prompt. Cheap, mechanical, and it killed an entire class of "the model summarised away the DIN number" failures.
Empirical bench, not vibes. A 23-case golden set per tenant runs in CI. Every booster is feature-flagged so the bench attributes gains per feature, not per release. We knew which custom pieces earned their place — and which didn't.
rag2 pipeline (one function, ten explicit stages)
Two tenants in production. Norm-grounded answers in German. A pipeline we can actually debug.
Production tenants on one platform
One multi-tenant SaaS, each tenant on its own subdomain with isolated database, Pinecone namespace, upload paths and CORS. Fail-closed config enforcement on collisions.
DAG framework dependencies
The orchestrator is one Python function. Components don't silently skip. Stack traces point at lines. Profilers work. Asserts work. Iteration cost dropped accordingly.
Norms preserved per chunk
Norm references are extracted at chunk time and stored in metadata. The anchor-boost stage uses them. The synthesizer is instructed to keep them verbatim. The coverage-retry stage will re-run synthesis if any get dropped.
Golden eval cases per tenant
A hand-curated golden set per tenant runs in the bench. Every booster (router, synonyms, page floor, diversity, coverage retry) is feature-flagged so gains are attributed per feature.
Dense + sparse + rerank
OpenAI text-embedding-3-large (1024 dims) + bm25s sparse + RRF + mmarco-mMiniLMv2-L12 multilingual reranker. Tuned for German compound nouns and exact DIN refs.
Lead-capture chatbot & AI email composer
Public chat with ephemeral session tokens, per-tenant CORS, strict refusal contract. AI email composer over IMAP/SMTP with thread context and templates, for human escalation and bulk outreach.
The off-the-shelf RAG frameworks are great until you actually need to know why a chunk got dropped. On a German technical corpus where one missing DIN reference makes the answer wrong, that visibility is the whole job. The moment we replaced the DAG with one explicit function, every retrieval problem became solvable.
Six decisions that came from deleting the framework, not adding to it.
1. The framework was the problem, not a missing feature. DAG-based pipelines hide intermediate state. On a corpus where chunk loss is the dominant failure mode, that's catastrophic. A single Python function with explicit stages cost us nothing and gave us back the debugger.
2. Filename routing before retrieval beats smarter retrieval. When the user names the document, retrieve that document, not its closest neighbour in embedding space. Trivial code, large effect.
3. Per-tenant synonym JSON is unreasonably effective on technical German. "Betonrohr" / "Rohr" / "DN 800" / "Druckrohr" all map together. So do construction-class names. Domain dictionaries beat fine-tuning at this scale and stay legible to the client.
4. Page floor + diversity cap fix the retrieval shape, not the retrieval scores. One missing page on a target file ruined more answers than any embedding model upgrade did. Forcing ≥1 chunk per page and capping per-page contribution closed that gap.
5. Coverage retry is the cheapest accuracy gain we have. Compare anchors in context vs. anchors in answer. If the LLM dropped one, re-prompt with that anchor pinned. Two extra requests on the bad cases, zero cost on the good ones.
6. A golden bench with per-feature flags is how you know what worked. Without it, every booster looks essential. With it, dead code dies and live code earns its keep.
Discovery → architecture → delete the framework → production.
- Discovery
Document types per tenant
Mapped datasheets and product catalogues for the concrete-products tenant; construction protocols, sewer-network standards and DIN norms for the water utility. Designed the per-tenant synonym set and the golden eval set.
- Architecture
Multi-tenant SaaS scaffolding
Per-tenant subdomain routing, isolated Postgres + Pinecone namespace + upload paths + CORS. Vendor OCR (Google Document AI / Microsoft Graph) primary; Docling fallback for local structural parsing. Tenant config with TTL caching and fail-closed collision detection.
- Rewrite
Delete the DAG, write rag2
Replaced the off-the-shelf RAG framework with a single orchestrator function. Hand-rolled router, synonym expansion, hybrid (dense + bm25s), rerank, page floor, diversity cap, anchor boost, coverage retry. Per-tenant few-shot exemplars. Deterministic synthesis (seed=42, temp=0). 23-case golden bench in CI with per-feature flags.
- Production
Two tenants live, plus chat & email
Lead-capture public chatbot on each tenant subdomain with ephemeral session tokens and strict refusal contract. AI email composer over IMAP/SMTP with thread context, attachment handling and templates. Webmail blueprint for human escalation and bulk outreach.
Same engineering DNA, different domains.
We've already built the pipeline you don't want to write.
Bring us the corpus, the kinds of questions you need answered, and the norms or regulations that have to be preserved. We'll tell you what's in the framework, what isn't, and what to delete.