Every engineering team has an architecture diagram. Almost every engineering team has an architecture diagram that is wrong.
Not dramatically wrong — not missing entire subsystems or showing services that were decommissioned years ago. Just quietly, incrementally wrong. A dependency that was added three sprints ago and never documented. A service that was split in two but the diagram still shows one box. An external integration that moved from vendor A to vendor B. Small things that compound into a diagram that can mislead an engineer who trusts it.
This is one of the most universal problems in software engineering, and it is almost entirely caused by a structural mistake in how teams maintain documentation. Understanding the root cause makes the fix obvious.
Why Documentation Rots
Architecture diagrams go stale for one reason: they live outside the codebase. The diagram is in Confluence. The code is in Git. These are two separate systems with two separate workflows and two separate review processes. When an engineer adds a dependency between two services, they open a pull request for the code change. Nobody opens a pull request to update the diagram. The diagram update is a manual, optional, easy-to-skip step that happens in a completely different tool.
The more disconnected the documentation is from the code, the faster it drifts. Diagrams maintained in draw.io or Lucidchart degrade the fastest because the friction to update them is highest — you have to open a separate tool, find the right diagram, make the change, export, re-upload. Diagrams in Confluence are slightly better because at least they are in a shared location, but the update workflow is still entirely manual and entirely separated from the code review process.
Teams that recognize this problem often respond by adding “update the architecture diagram” to their definition of done. This slows the drift but does not stop it, because the step is easily forgotten under pressure and there is no automated check that enforces it.
Why It Matters More Than Teams Think
Teams that have lived with stale documentation long enough start to distrust it reflexively. They stop consulting the diagram and go straight to the source code. At that point the diagram has negative value: it adds no information and may actively mislead engineers who do not know to distrust it — typically new joiners.
Incident Response
A service is degraded. The on-call engineer needs to know what else might be affected. The architecture diagram shows three downstream consumers. There are actually five. The engineer focuses the response on the documented three and misses the other two, which are also silently failing. The incident takes longer to resolve and the full impact is not understood until after the fact.
New Engineer Onboarding
A new engineer spends their first week building a mental model of the system. They read the architecture documentation, internalize a picture of how the services relate, and start working from that picture. The picture is wrong in three places they do not know about. Two months later they make a change that causes an incident in a service that the diagram said was unrelated to what they were modifying.
Deprecation and Migration Planning
A team is deprecating a service. They look at the diagram to identify consumers that need to migrate. The diagram shows four consumers. There are actually seven. The deprecation is announced, four teams are notified, and three teams discover the problem only when their service starts failing after the deprecated service is shut down.
The Root Cause, Precisely
The problem is not that engineers are careless. The problem is that the architecture diagram is not part of the change. When you add a service dependency, the code change and the documentation change are two separate, disconnected artifacts. The code change gets reviewed and merged. The documentation change is optional and happens later, if at all.
The fix follows directly from the diagnosis: make the architecture documentation part of the code change. When adding a service dependency requires both a code change and a corresponding documentation change in the same pull request, the documentation stays current automatically, reviewed by the same people who review the code.
The YAML-as-Documentation Approach
The most practical implementation of this principle is to define your service architecture in YAML files that live in a Git repository. Each service gets a file. Dependencies are declared explicitly. The files are versioned, reviewed, and merged through the normal pull request workflow.
A minimal service definition looks like this:
id: checkout-service
name: Checkout Service
area: Commerce
kind: api
status: active
tech: [Node.js, PostgreSQL]
owner: commerce-team
depends_on:
- target: cart-service
- target: payment-service
- target: inventory-service
- target: notification-service
This is not a new idea. Backstage, the open-source internal developer portal from Spotify, uses an almost identical format for its software catalog. The insight that service definitions belong in version-controlled files has been proven at scale.
What makes the approach work in practice is the review process. When an engineer adds target: fraud-detection to the depends_on list of their service, that change shows up in the pull request diff. The reviewer sees it. They can ask questions about it. The change is not buried in application code — it is explicit, visible, and intentional.
Rendering the YAML as a Live Graph
YAML files in a repository are already a significant improvement over a diagram in Confluence. They are versioned, reviewable, and correct by definition. But they are not easy to explore. Reading raw YAML to understand a complex dependency graph is not practical.
The YAML needs to be rendered into a visual representation. The requirements for this rendering layer are:
- Automatic layout — manually positioning 30 nodes is not a workflow that will be maintained
- Filtering — the full graph of a large system is not readable; you need to narrow to a relevant subset
- Hover tracing — following dependency chains by eye through a dense graph is slow and error-prone
- Shareable state — when you find a useful view, you need to share it without screenshots
- Zero backend — adding a server to maintain undermines the simplicity advantage
A static site generator that reads the YAML at build time and produces a self-contained HTML/JS application satisfies all of these. The build runs in CI on every merge. The output is deployed to a static host. The map is always current, always accessible, and costs nothing to run.
Structuring the YAML for Maintainability
Use stable IDs, not display names, for dependency targets
If the depends_on list references service names as strings, renaming a service breaks all references to it silently. Use a kebab-case ID (user-service, payment-service) as the primary identifier and keep the human-readable name separate. IDs are stable across renames; display names are not.
Include a status field from the start
Even if every service is currently active, adding a status field (active, deprecated, planned) from day one means you can track migrations and deprecations in the same system without inventing a new convention later.
Model external dependencies explicitly
Third-party services (Stripe, Twilio, SendGrid, Auth0) are real dependencies that affect system behavior during outages. Include them in the graph as external nodes. This makes it immediately visible which services are affected when a third-party provider has an incident, and it surfaces the true scope of vendor relationships to management and architects.
Assign ownership at the service level
Add an owner field and populate it consistently. This makes the graph filterable by team — the most common real-world query. It also makes ownership disputes resolvable by looking at the file rather than asking around.
Integrating This Into Your Pull Request Workflow
Colocate the YAML with the service code
If the service catalog YAML lives in a dedicated repository, updating it requires a separate pull request. Engineers will skip it. Put the YAML file for each service in the root of that service’s repository. When an engineer adds a dependency to the code, the YAML update is a one-line change in the same diff. Reviewers will notice if it is missing.
Add a linting step for the YAML schema
Validate the YAML against the schema in CI. If a service YAML references a depends_on target that does not exist in the catalog, the build should fail. This catches typos in dependency IDs and dependencies on services that have been removed without the consumers being updated.
What Good Architecture Documentation Actually Enables
Faster and safer refactoring. Before moving or splitting a service, engineers can see the full downstream impact. Changes that look local in the codebase often have non-obvious ripple effects that are visible in the dependency graph before the change is made.
Honest incident scope assessment. During an incident, the dependency graph is the fastest way to understand blast radius. Teams with a reliable graph answer the “what else is affected” question in seconds. Teams without it spend the first 20 minutes of an incident reconstructing the dependency structure from memory.
Credible technical communication to non-engineers. When a manager asks why a change that looks simple is taking two weeks, an engineer can open the dependency graph and show the chain of services that need to be updated. That is a concrete explanation rather than a vague appeal to complexity.
Self-serve onboarding. New engineers can answer their own structural questions by exploring the graph rather than interrupting senior engineers. Teams that have deployed this consistently report significantly shorter time-to-first-contribution for new joiners.
Getting Started
The minimum viable implementation requires three things:
- A YAML schema — define the fields your service definitions will include. Start minimal:
id,name,owner,status,depends_on. Add fields as you identify the need for them. - A catalog repository — a Git repository where the YAML files live. Dedicated repo or a directory in an existing monorepo both work.
- A visualization layer — something that reads the YAML and renders it as an interactive graph with filtering, hover tracing, and shareable URLs.
The visualization layer is where most teams stall. Building a useful graph renderer from scratch is a non-trivial engineering project — the kind of internal tooling that gets started, gets 80% done, and never quite reaches daily use. For teams that want the practice without the build investment, Service Map provides the visualization layer as a self-hosted static application that deploys in under 15 minutes.
Whether you build it yourself or use an existing tool, the underlying practice is what matters: service definitions as code, reviewed alongside code changes, rendered into a graph the entire team can use. Once the documentation is part of the change, it stops drifting.
