The Metallurgy of Making
On the structural isomorphism between metalworking and autonomous software development
The Smith’s Problem
There is a moment in blacksmithing that separates the apprentice from the master.
The metal is at cherry red, pulled from the forge with tongs and placed on the anvil face. The smith has perhaps ninety seconds before the workpiece cools below forging temperature and must be returned to the fire. In that window, every blow of the hammer matters. The angle, force, and placement of each strike must be calibrated against a material that is simultaneously cooperating and resisting.
This is not an analogy for software development. It is software development, expressed in iron.
For the past two and a half years, I have been working with large language models: first at startups building agentic AI systems, then increasingly on my own. Over this time, I have been developing a thesis about how software should be made in an age when AI can do more and more of the making. That thesis crystallized into a development methodology I call The Foundry: an autonomous engineering harness that converts raw ideas into shipped, tested, monitored software. It has produced hundreds of completed engineering issues across 35 milestones, a working product, and a comprehensive test suite for me in just the last three weeks.
It works because I stopped treating the software development process as a conveyor belt and started treating it as a forge.
Raw Ore: The Problem with Ideas
Every software project begins with ore.
Ore is valuable but unusable. A lump of hematite contains iron, but you cannot build a bridge from hematite. The iron is locked inside a matrix of silicon, oxygen, aluminum, and a dozen other elements that must be separated before the metal becomes workable. The raw material must be transformed, not once, but through a sequence of transformations, each one removing impurities and adding structure.
Ideas are the same. A braindump in a notes app is ore. It contains signal (genuine insight, valid requirements, real user needs) but it is locked inside a matrix of ambiguity, assumption, contradiction, and wishful thinking. You can build software from a braindump. The vibe coding movement of 2025 proved that much: prompt an LLM with a rough idea and code comes out. But you cannot build good software this way: software that is reliable, explainable, maintainable, and auditable. That requires the same progressive refinement that turns raw ore into structural steel. The distance between “it runs” and “it holds load” is the entire discipline of metallurgy.
Traditional software development recognizes this implicitly. We have requirements gathering, specification writing, task decomposition, and implementation: a pipeline of progressive refinement. But we treat these as administrative stages, not as metallurgical ones. We miss the physics of the process: the temperatures required, the timing constraints, the material properties that change at each stage, and most critically, the quality gates that prevent defective material from advancing.
In my system, each stage has a name drawn from metalworking. These names are not poetic license. They are structural descriptions of four transformations: braindump, specification, task decomposition, and implementation.

The Vein: Braindump
The first stage is The Vein, the mine where raw ideas are extracted.
In traditional prospecting, the miner reads gossans: oxidized, rust-stained outcrops that mark where ore deposits have been exposed to weathering. A gossan is a signal, but an ambiguous one. The miner must interpret incomplete signals from a system they cannot fully see, and discover whether their reading was correct only after significant investment.
My “Mining” process does the same work. It takes raw input (a braindump, a conversation transcript, a fragment of an idea) and extracts the signal. The output is an Obsidian note: a captured thought, tagged and cross-referenced, but still raw. The Vein is not where structure happens. It is where ideas are found and connected. Obsidian’s graph view draws the relationships between notes, revealing clusters of related thinking that might not be obvious in isolation, the way a geological survey maps connections between surface outcrops and the deposits below.
I chose Obsidian for this stage because mining needs a tool that is fast, local-first, and built for linking rather than formatting. What matters at the braindump stage is not document structure but connection: how does this idea relate to the last three? Where does this fragment fit in the larger picture? Any tool that supports freeform capture with tagging and cross-referencing (Logseq, Bear, even a well-organized folder of markdown files) would serve the same purpose. The key is that the output of this stage is raw and connected, not refined.
The critical property of mining is that you don’t know what you have until you extract it. Sometimes the vein pinches out. Sometimes the braindump contains no actionable signal at all. The Miner must report this honestly rather than fabricating richness where none exists.
The Furnace: Specification
The bloom that emerges from the mine is not yet workable metal. It is a spongy, porous mass of iron particles and trapped slag: useful material intermingled with waste. The next stage is The Furnace, the smelting operation that separates signal from noise and produces workable specification.
Smelting works through reduction: heat and a chemical agent strip away everything that isn’t the desired metal. The lighter slag floats to the surface; the heavier metal sinks to the bottom. What was locked inside the ore is freed through the deliberate removal of what doesn’t belong.
My “Smelting” process performs the same operation on raw notes. It takes the Miner’s unstructured, ambiguous output and reduces it. Not in the colloquial sense of “making smaller,” but in the chemical sense: stripping away what doesn’t belong to reveal the essential structure underneath. Ambiguities are resolved. Contradictions are surfaced. Architectural decisions are made explicit, with alternatives documented and rationale provided. The output is a formal technical specification with Overview, Architecture, Implementation, Testing, and Cost sections.
I chose Notion for this stage because smelting needs a tool built for structured, rich documents. The specification is the most important artifact in the pipeline: it is where raw ideas become engineered plans. That requires headings, tables, embedded diagrams, and enough formatting capability to express architectural decisions clearly. Google Docs, Confluence, or any tool that supports collaborative structured documents would work. The key is that the output of this stage is formal and reviewable, not freeform.
The Smelter’s quality gate enforces the separation with deterministic checks: does the spec have architectural decisions with documented alternatives? Does the implementation define phases? Are scope boundaries explicitly stated? If impurities remain in the metal, the finished product will be brittle.
The Smelter also identifies which of the system’s many prior specifications most closely resembles the current work, and uses it as an exemplar. This is institutional memory applied at the moment of creation, ensuring that each new specification inherits the structural patterns that have proven successful.
The Anvil: Task Decomposition
Forging is the most physically intimate stage of metalworking. The smith’s hands, through the hammer, directly impose form on material.
The operation requires three things: heat (the metal must be at working temperature), force (the hammer must deliver precisely calibrated blows), and resistance (the anvil must refuse to yield). Without any one of these, forging is impossible. Cold metal cracks. Uncontrolled force damages. Without the anvil’s immovable hardness, the hammer merely displaces rather than shapes.
My “Smithing” process works on The Anvil: the workspace where specifications are decomposed into discrete, independently workable tasks. Each issue is a hammer blow, a precisely scoped piece of work with acceptance criteria, priority, labels, and dependency relationships. The Smith does not implement anything. It shapes the specification into a form that can be implemented: a set of tasks with clear boundaries, clear ordering, and clear definitions of done.
I chose Linear for this stage because task decomposition needs a tool built for engineering workflow: dependencies between issues, priority levels, milestones, and the ability to visualize the shape of the work. Jira, GitHub Issues, or any tracker with dependency support would serve the same purpose. The key is that the output of this stage is atomic and sequenced: every task is independently actionable, and the dependency graph determines which tasks must be completed before others can begin.
The Smith’s quality gate ensures the shaping is sound. Every issue has a description. Every issue has a priority. Acceptance criteria exist. The scope is reasonable, because a workpiece that requires too many heats is degraded by the process itself.
The Alloy: Implementation
The final stage is The Alloy: the code that emerges from implementation and testing.
In metallurgy, an alloy is stronger than any of its component metals. Bronze is harder than either copper or tin alone. Steel is stronger than pure iron. The strength comes from the interaction between unlike elements. The alloy’s power is in its heterogeneity.
Software is the same. A production system is an alloy of code, configuration, tests, documentation, infrastructure, and operational knowledge. No single component is sufficient.
Code without tests is pure iron, strong in one direction, brittle under unexpected stress.
My “Forging” process implements issues, runs tests, handles architectural decisions, and ships pull requests. But the Forger is not merely an executor. It is a temperer.
Tempering is the trade-off at the heart of all metalworking. Hardened steel is extremely hard but extremely brittle. Hardness without toughness is useless. Tempering solves this through controlled reheating, deliberately trading some hardness for significantly more toughness.
My test suite is my tempering process. Unit and integration tests temper the deterministic pieces: API responses, data transformations, workflow orchestration. Evaluations temper the inference pieces: the quality of LLM reasoning, the accuracy of generated plans, the coherence of synthesized outputs. Together they apply controlled stress to every part of the codebase, confirming that each piece has reached the right balance of rigidity and resilience for its purpose.
The most sophisticated technique in tempering is differential tempering: providing different degrees of temper to different parts of the same piece. A blade’s edge remains very hard (holding a sharp cutting edge), while the spine becomes tough (absorbing impact without shattering).
My system does this naturally. The core engine is tempered hard: deterministic, heavily tested, breaking changes caught immediately. The dashboard is tempered tough: flexible, rapidly iterable, tolerant of experimentation. The API layer is tempered for resilience: retry policies, graceful degradation, circuit breakers. Different properties for different purposes, all part of the same alloy.
The Ravens: Thought and Memory at Every Gate
Between every stage of the Foundry, a Raven flies.
In Norse mythology, Odin keeps two ravens: Huginn (Thought) and Muninn (Memory). Each day they fly across the world, observing everything, and return to whisper what they have seen. Odin fears losing Thought, but he fears losing Memory more. Thought can be reconstructed. Memory, once lost, is gone.
In alchemical tradition, the raven symbolizes nigredo: decomposition, dissolution, the necessary destruction of existing structures so that new, purer forms can emerge.
My Ravens are messages that pause the pipeline at every stage transition and wait for human judgment. They carry structured intelligence: what was produced, what decisions were made, what uncertainties remain, and a link to the artifact. They offer three actions: Approve (advance), Revise (restart the current stage with feedback), or Discuss (open a thread for clarification). If no human responds within thirty minutes, the Raven defaults to approval, because a stalled pipeline is a cooling workpiece, and cold metal cracks.
The Ravens carry both thought and memory simultaneously: the analytical summary of the current stage and the accumulated context of the pipeline run. But they also serve a deeper function. Each Raven gate is a nigredo: a controlled dissolution of the previous stage’s certainty. The Miner is certain its note captures the intent. The Smelter is certain its spec is architecturally sound. The Smith is certain its decomposition is complete. The Raven dissolves that certainty by exposing the work to a perspective that is not bound by the stage’s own assumptions.
Quality Gates: The Spark Test Before the Raven Flies
Before each Raven flies to the human, the work passes through an internal quality gate: an agent-to-agent validation that catches structural defects before they consume human attention.
Traditional smiths had their own quality tests. The spark test: reading the carbon content from the color and branching pattern of sparks on a grinding wheel. The bend test: observing whether the workpiece yields or snaps. The ring test: listening for a clear, sustained ring versus a dull thud. These tests are empirical, fast, and deterministic. They filter out defective work before it reaches the stage where expensive, judgment-intensive evaluation occurs.
My quality gates work the same way. Each stage has a checklist of deterministic checks: does the note have structure? Does the spec have architectural decisions? Do the issues have acceptance criteria? They verify that the material of each artifact is sound before the human is asked to evaluate its form.
On failure, the agent gets one retry with specific feedback. If it fails again, the Raven flies anyway with the gate failures highlighted. This is the equivalent of a smith presenting a piece to the master and saying:
“The ring test was dull. I’ve worked it twice but can’t clear the defect. Your judgment is needed.”
Economy of Heats
In blacksmithing, a “heat” is a unit of work: one cycle of heating, working, and cooling. Each heating cycle degrades the material. Scale forms. Carbon burns away. Crystal grains grow larger. A skilled smith completes the work in as few heats as possible, because each unnecessary cycle weakens the metal.
The same physics applies to software. Each revision, each rewrite, each “let’s try a different approach” compounds costs. Context is lost between iterations. Design decisions made for good reasons are forgotten and unmade. The codebase accumulates scale: dead code, orphaned configuration, comments that reference deleted features.
This is why my pipeline is designed for minimum viable heats. Each stage produces its artifact in a single pass whenever possible. The quality gates catch defects early. The Raven’s “Revise” action restarts the current stage from scratch rather than attempting incremental patching, because a fresh heat on clean metal produces better results than trying to salvage a cooling workpiece.
And there is a temperature beyond which the metal is permanently destroyed. In blacksmithing, it’s called burning. No subsequent treatment can repair burned steel. In software, the equivalent is the irreversible architectural decision: the choice of data model, the synchronization strategy, the runtime that permeates every module. My Forger sends a specific Raven type, the Architecture Decision Raven, precisely at these moments, because this is where the risk of burning is highest.
Extensibility Through Tooling
The London pattern anvil has a hardy hole: a square hole through the body into which specialized tools can be inserted. A hardy turns the anvil into a cutting station. A swage turns it into a shaping die. The anvil doesn’t change; it accepts new capabilities through a standard interface.
My harness works the same way. The stages and gates remain constant, but the tools at each stage are extensible through MCP servers (Model Context Protocol). When I need a new capability, I don’t redesign the pipeline. I forge a new tool and drop it into the hardy hole.
Three Autonomous Loops: The Living Forge
A traditional forge is not autonomous. The smith must be present for every heat, every blow, every decision. My Foundry departs from the metaphor here, not to abandon it, but to extend it.
Imagine a forge that monitors its own output. After the blade is tempered and delivered, three processes continue:
The Inward Eye (Autonomous Product Engineer): Monitors production for defects, traces problems to their source, and opens pull requests. The forge inspecting its own work.
The Outward Eye (Ecosystem Intelligence): Scans the broader world (new model releases, library updates, research papers) and maps discoveries to the codebase. The forge watching the trade.
The User’s Eye (Persona-Driven Testing): Synthetic users interact with the deployed product from distinct perspectives. Testing the blade by cutting with it: not the controlled stress of the test suite, but the unpredictable stress of actual use.
These three loops transform the Foundry from a pipeline into a living system. The blade doesn’t just ship. It reports back. And the reports feed into the next cycle’s ore.
The Convergence
In February 2026, OpenAI published “Harness engineering: leveraging Codex in an agent-first world.” A three-person team shipped a million-line internal product with zero hand-written code, using what they called a “harness”: the system of prompts, tools, guardrails, and feedback loops that surrounds and directs the AI agents doing the actual coding.
The convergence is striking. Their “golden principles” correspond to my quality gates. Their “promote rule to code” escalation maps precisely to my three-tier escalation path. Their “linter-as-agent-prompt” principle is something I independently implemented.
But the metaphor reveals the divergence. Their system is reactive: agents write code, reviewers find problems, rules get added. It is a forge without a furnace. The smelting step is implicit, embedded in prompts rather than formalized as a distinct stage with its own quality gate.
My system traces the full arc. A defect in production triggers the Inward Eye, which may trace the issue to a specification ambiguity, which feeds back to the Smelter, which updates the exemplar, which changes how the next specification is written. The correction propagates backward through the pipeline, not just forward through constraint accumulation.
This is the difference between a forge that adds rules and a forge that refines its ore.
The Smith Chooses Who to Forge For
Throughout history, the smith has occupied a peculiar position: respected, needed, and pressured. Every ruler needs weapons. The smith’s capability is strategic, and power has always sought to capture it. The Norse legend of Wayland the Smith tells of a craftsman imprisoned by a king and forced to forge treasures for the crown. The lesson endures: power does not ask the smith for permission.
The forge I have built is powerful, and capability is morally neutral until the smith directs it. The same fire that forges plowshares forges chains. The same Foundry that builds tools for liberation can build tools for surveillance.
The same fire that forges plowshares forges chains.
I choose who I forge for. I will build for people who are solving real problems, creating genuine value, expanding what is possible for those who lack the resources to build for themselves. I will resist any offer that asks me to trade principles for a contract.
The competitive advantage is not the pipeline definition (which is, after all, a collection of markdown files and configuration) but the accumulated intelligence within the pipeline: the specifications that serve as exemplars, the test suites that encode quality expectations, the autonomous loops that compound learning across every product shipped. This intelligence is mine to direct. And I choose to direct it toward the light.
Every smith’s first piece is a test of the forge itself. Not a commission, but a proof that the fire burns true and the anvil holds firm…
The first piece forged.
Matt Dionis, February 2026
The Foundry is not a product. It is a thesis: that software, like metal, must be transformed through a sequence of stages, each one removing impurities, adding structure, and testing for defects, before it is ready to bear load. The autonomous engineering harness is the realization of that thesis. The Ravens are its quality conscience. And the alloy, the shipped, tested, monitored, self-correcting product, is the proof.
If you’re building your own forge, or tearing one down to start over, I’d like to hear about it. This is the first in a series on AI-backed autonomous engineering. Subscribe to follow along.


