Dark Factory - Level 4

nvisia AI Lab · Prototype
Dark Factory
What if you handed AI your requirements and came back to working, tested, production-quality code?
Kevin Quon built a fully autonomous L4 code generation platform — in 8 days, using Claude Code. Drop in requirements as Word docs, PDFs, markdown, or YAML. Get back production-quality code, tests, and cross-browser E2E validation reports. No human intervention between phases.
Prototype by Kevin Quon · nvisia AI Lab · Built with Claude Code
Explore the Experiment ↓
Connect with the AI Lab →
What Is a Dark Factory?
In manufacturing, a "dark factory" runs lights-out - fully automated, no human operators required. We asked a simple but provocative question:
Can software development work the same way?
The AI Dark Factory is built on a single premise: a coordinated swarm of specialised agents with persistent memory produces higher-quality code than a single monolithic agent with a bigger context window. Every architectural choice flows from that premise.
We didn't just theorize about it. We built one — and it runs.
"Level 5 agentic engineering - agents build, test, validate, and iterate autonomously from specs. Humans drive via specifications only."
The 7-Phase Pipeline
Requirements → Production Code
01
Phase 1 — Ingest
Accepts any mix of document formats (Word, Excel, PDF, YAML, JSON, Markdown, transcripts). Parses each into discrete testable requirements. Runs a semantic deduplication pass to eliminate near-duplicate requirements before processing begins.
02
Phase 2 — Spec Generation
Each requirement passes through a Planner decomposition step, then an architect → critic → refine loop scored by an independent LLM judge (DeepEval + GPT). Specs that score below threshold don't advance.
03
Phase 2b — Spec Reconciliation
A dedicated sandboxed pass strips phantom references, breaks circular dependencies, and flags uncovered requirements. An LLM-assisted pass then surfaces implicit dependencies the planner missed.
04
Phase 3 — Knowledge Graph
Specs and requirements are persisted to Neo4j with IMPLEMENTS and DEPENDS_ON relationships. Simultaneously indexed into Qdrant so downstream Coder agents can pull semantically similar prior work as context.
05
Phase 4 — Swarm Orchestration
Four specialised agents — Planner, Coder, Reviewer, Tester — collaborate per feature. Features run in parallel within dependency layers. Each swarm is isolated; no shared mutable state between concurrent features.
06
Phase 5 — Reconciliation
A single extended Claude Agent SDK invocation reviews the full output directory — catching cross-feature import breakage, inconsistent API shapes, and missing glue files that per-feature swarms structurally cannot detect.
07
Phase 6 — End-to-End Validation
A Playwright cross-browser smoke test suite runs the generated application in Chromium, Firefox, and WebKit. Results, failure screenshots, and a browsable HTML report are surfaced in the Run Detail view.
Four Agents, One Feature
Every feature is handled by the same four specialised agents — each with a narrow role, a focused context, and a clear definition of done.
Planner
Entry point for every swarm. Reads the spec, recalls prior strategies and episode history, decides the opening move, and is the only agent that can terminate the swarm successfully.
Coder
Responsible for implementation. Runs two RAG queries before writing a single line — one for similar specs, one for similar code artifacts — to reuse proven patterns and avoid known mistakes.
Reviewer
Evaluates generated code against a DeepEval rubric. Writes Mistake nodes to memory when it finds problems, paired with Solution nodes when fixes are known. Future Coders on similar features inherit those lessons.
Tester
Writes unit tests, integration tests, and edge cases against the spec's acceptance criteria. Records test failures as Mistake nodes so future Testers on similar features start with that knowledge.
Specialisation is the point. A Reviewer that only evaluates catches problems a Coder cannot self-diagnose. A Planner that only reads specs and history stays focused. The swarm encodes the division of labour that high-performing engineering teams already use.
A System That Gets Smarter With Every Run
Most agent systems treat memory as a convenience. Dark Factory treats it as load-bearing infrastructure.
Every agent reads from a shared Neo4j + Qdrant memory graph before its first action, and writes back after its last. Memory is split into two tiers:
Semantic Memory
Generalised lessons. Patterns (reusable code structures), Mistakes (known failure modes with root causes), Solutions (fixes paired to Mistakes), and Strategies (approach decisions). Lessons that prove useful accumulate relevance. Lessons that fail in practice decay and eventually prune themselves.
Episodic Memory
Specific past trajectories. After every feature swarm, the orchestrator synthesises what happened, what worked, and which prior memories influenced the outcome into a structured Episode. Future Planners can recall not just "what should I do?" but "what actually happened the last three times I was in this exact situation?"
"Memory is earned, not configured. There are no manually defined rules or knowledge bases to maintain. The memory graph is entirely self-generated from agent observations during real runs. The first run starts with no memory. Every run after starts from a higher baseline."
L4 — Fully Autonomous / Explorer
Where Dark Factory sits on the Vellum agentic behavior scale — and why each trait is implemented, not claimed.
L4 — Fully Autonomous / Explorer on the Vellum agentic behavior scale. The four L4 traits are all implemented:
Persistent State Across Sessions
Neo4j procedural memory + Qdrant embeddings + eval history + run history all survive restarts.
Refines Execution from Feedback
DeepEval scores drive memory boosts and demotions, adaptive thresholds, and mid-run strategy overrides.
Parallel Execution
Concurrent feature swarms within each dependency layer, bounded by configurable limits.
Real-Time Adaptation
Layer-level strategy switches and cross-feature briefing within the same run.
Cross-Feature Reconciliation
Phase 5's extended Claude Agent SDK pass sees the full output and polishes the integration.
What Kevin Discovered Building This
Real findings from 8 days of autonomous code generation
nvisia's AI Lab set out to build a working Dark Factory prototype applied to a real internal problem: utilization forecasting. Leadership needed to understand workforce trends - who's engaged where, what's influencing it, and where gaps are emerging. The process was manual, fragile, and slow.
Our hypothesis: Could an AI-driven, spec-defined system do this work autonomously?
The Mandate: 
We gave our bench team one constraint - we do not write code. We do not validate code. We update specifications.

Everything had to flow from human-readable specs. The AI did the rest.
What We Explored:
Specialized agent swarms beat single large agents
Four small agents (Planner, Coder, Reviewer, Tester) with distinct roles outperform one monolithic agent with a bigger context window.
Memory is the multiplier
The system's procedural memory layer means each run benefits from everything prior runs learned. Patterns that work get reinforced; mistakes get recorded and avoided.
Spec quality is everything
Bad specs produce bad code. The pipeline's most critical input is well-structured, unambiguous requirements.
Cost is real but manageable
A simple requirements run costs a few dollars. A full production application could run into hundreds. Cost visibility is built in — every token and LLM call is tracked.
L5 autonomy isn't here yet
The current system is L4. True L5 would require self-directed goal generation and autonomous strategy invention — the system has documented what that would require.
"By pushing the boundary, we're learning the best way to do things that'll get you to wherever that boundary ends - because it'll end at a different point in the future." 
— Shaun Lovick, President, nvisia
Under the Hood
The technical stack powering autonomous code generation
Backend
FastAPI + Uvicorn (Python 3.12+)
Frontend
React 18 + Vite + TypeScript
Agent Runtime
LangGraph swarm + Claude Agent SDK
Knowledge Store
Neo4j (graph + procedural memory)
Vector Store
Qdrant (text-embedding-3-large, 3072 dimensions)
Metrics
Prometheus + Grafana + optional Postgres forensic store
E2E Validation
Playwright (Chromium, Firefox, WebKit)
Evaluation Judge
DeepEval with GPT (intentionally separate from the Anthropic models used for codegen)
Storage
Pluggable local, S3, or replicated (local + S3)
What This Delivers
Outcomes from the whitepaper — what the Dark Factory produces for the teams that run it.
Faster Time-to-Code
Requirements become working, tested code in a single invocation. Engineers supervise outcomes instead of typing them.
Continuous Improvement
Procedural memory persists across runs. The system remembers what worked, what broke, and how it was fixed — and applies those lessons automatically next time.
Predictable Quality
Every spec and artifact is evaluated by an independent LLM judge against an explicit rubric. No artifact ships without a score.
Full Cost Visibility
Every LLM call, token count, tool invocation, and eval result is recorded. Operators see exactly what a run cost and which features drove the spend.
Auditable Decisions
Every agent action, handoff, tool call, and evaluation is logged, timestamped, and accessible in the Run Detail view. Every run leaves a complete forensic trail.
Honest Findings. Real Value.
The Dark Factory isn't ready for most organizations - but what we learned getting there absolutely is.
What we confirmed:
True Level 5 autonomy is not yet practical for core business systems. The models, tooling, and organizational readiness aren't there yet, and the ROI doesn't justify the risk today.
The journey to L5 exposed exactly what is achievable, and where the highest-value opportunities sit right now.
What we discovered along the way:
✅ Spec-Driven Development works
When specifications are structured, versioned, and treated as living documents, AI agents produce dramatically more consistent and trustworthy code.
✅ Roles are evolving
The traditional developer-engineer-QA pipeline is shifting. Product engineers who write precise intent are becoming the most valuable contributors in an AI-augmented team.
✅ The PM gap is real
Existing SDD frameworks are engineering-centric. We forked OpenSpec to incorporate product brief structure (KPIs, measurable outcomes, total addressable problem) making it accessible to non-engineers.
✅ QA is transforming
It's no longer about testing endpoints. It's Visual Quality Assurance (VQA): did the system actually deliver the experience and outcome the business envisioned?
✅ Specs as institutional knowledge
Version-controlled specs co-hosted with code don't go stale the way Confluence pages do. They become a durable, queryable record of intent.
Start the conversation with nvisia →
What This Means for Your Organization
You don't need to build a Dark Factory. 
But you should know what the path looks like.
Kevin built this as a science experiment. But what it demonstrates has real implications for how engineering organizations think about automation, roles, and velocity.
The organizations that will win in an AI-first world aren't the ones chasing Level 5 today - they're the ones systematically moving from L2 to L3 to L4, building the capability, the culture, and the specs to get there.
Key questions this experiment helps you answer:
1
Investment & ROI
How much should we invest in AI-assisted development - and what should we expect back?
2
Role Evolution
Which roles in our engineering organization need to evolve - and how?
3
Spec-Driven Workflow
What does a spec-driven workflow actually look like in practice?
4
Human Gates
Where are the human gates we should never remove — and where are we over-relying on human review?
5
Team Upskilling
How do we upskill our teams without disrupting delivery?
nvisia AI Symposium · Chicago · May 7, 2025
See the Experiment Live
Kevin Quon will be at the booth with the live system. Here's what you'll see:
Live pipeline run
Watch requirements transform into specs, code, and tests in real time
Agent memory dashboard
See the patterns, mistakes, and strategies the system has learned across runs
Run comparison
Side-by-side diffs of what changed between two pipeline runs
E2E test report
The actual Playwright validation output from a completed run
The architecture
Kevin will walk through the 7-phase pipeline and answer technical questions
Kevin Quon, nvisia AI Lab — on-site and ready to go deep.
"It should be fascinating to novices and experts alike — to actually see this kind of thing moving."
Prototype by Kevin Quon · nvisia AI Lab · linkedin.com/in/kwkwan00 · Built with Claude Code
Back to the AI Lab Home →
Connect with the nvisia AI Lab →