Skip to content

SourceVision Analysis Deep Dive

How SourceVision analyzes your codebase: the pipeline, zone detection algorithm, findings system, and what you can configure.

The Analysis Pipeline

SourceVision runs a 6-phase pipeline. Each phase builds on the previous one's output.

PhaseNameWhat it does
1InventoryScans all files, classifies by type, language, and role
2ImportsBuilds the directed dependency graph from import/require statements
3ClassificationsAssigns architectural archetypes to files (route-handler, service, utility, etc.)
4ZonesDetects architectural communities via Louvain algorithm, then optionally enriches with AI
5ComponentsCatalogs React components with props, usage patterns, and route detection
6Call GraphMaps function/method definitions and call edges across files

Run specific phases with --phase or --only:

sh
ndx analyze --phase zones .    # run only the zones phase
ndx analyze --only imports .   # run only the imports phase

Incremental Analysis

Each phase detects whether its inputs have changed since the last run. Unchanged files are cached, and phases with stable inputs can be skipped entirely. This makes re-analysis fast after small changes.

Zone Detection

Zones are the core architectural insight. SourceVision uses Louvain community detection on the import graph to discover natural clusters of tightly-related files.

How Louvain Works

  1. Graph construction -- The directed import graph is converted to an undirected weighted graph. Edge weight = number of imported symbols (minimum 1).

  2. Directory proximity -- Light edges (weight 0.2) are added between adjacent files in the same directory. This helps convention-based frameworks (like Next.js or Remix) where directory layout defines architecture but files may not import each other directly.

  3. Modularity optimization -- Louvain iteratively moves nodes between communities to maximize modularity. A resolution parameter (gamma) controls granularity: higher values produce smaller, tighter zones. Processing order is deterministic (sorted) for reproducibility.

  4. Post-processing -- Several refinement passes clean up the raw communities:

    • Bidirectional coupling merge: Pairs of zones with >40% shared edges are merged
    • Small zone absorption: Zones with fewer than 3 files are absorbed into their most-connected neighbor
    • Satellite merging: Zones with <=8 files and >30% external coupling are merged into their dominant neighbor
    • Large zone splitting: Oversized zones are subdivided using progressively higher resolution, with directory-based fallback
    • Zone count capping: If there are too many zones, the weakest-connected pairs are merged

Zone Metrics

Each zone gets two metrics computed from the import graph:

  • Cohesion (0-1): Ratio of internal edges to total edges within the zone. Higher is better -- it means files in the zone import each other more than they import outside files.

  • Coupling (0-1): Ratio of external edges to total edges. Lower is better -- it means the zone is relatively independent.

Single-file zones have trivially perfect cohesion (1.0). Zones with fewer than 5 files have unreliable metrics and are treated as informational only.

Risk Assessment

Zones are classified by their metric health:

Risk LevelCondition
healthyBoth metrics within thresholds
at-riskOne metric outside thresholds
criticalBoth metrics outside thresholds
catastrophicCohesion < 0.3 AND coupling > 0.7

Default thresholds: cohesion floor 0.4, coupling ceiling 0.6. These are overridden per zone type (see Configuration below).

Findings

SourceVision produces findings from two sources: deterministic analysis (always runs) and AI enrichment (optional).

Finding Types

TypeDescriptionActionable?
anti-patternArchitectural violations, circular deps, high couplingYes
suggestionImprovement recommendations, naming issuesYes
move-fileConcrete file relocation proposalsYes
observationMetric descriptions ("Cohesion is 0.36")No
patternDetected architectural patternsNo
relationshipCross-zone dependency descriptionsNo

Use ndx recommend --actionable-only to filter to only the first three.

Algorithmic Findings (Always Run)

These are deterministic and reproducible -- no AI involved:

  • Risk scoring: Zones with low cohesion or high coupling are flagged
  • God functions: Functions with unusually high outgoing call count
  • Tightly coupled modules: Files with excessive cross-zone call edges
  • Unused exports: Exported symbols with no incoming calls
  • Hub functions: Functions called from many locations
  • Fan-in hotspots: Files that are popular callers

AI Enrichment (Optional)

When AI enrichment runs, it makes multiple passes over the zone data:

PassFocusFinding Types
1Zone naming, descriptions, initial observationsobservation
2Cross-zone relationships, clean boundaries, leaky abstractionspattern, relationship
3Anti-pattern detection, tight coupling, missing abstractionsanti-pattern
4Naming inconsistencies, risk areas, refactoring opportunitiessuggestion

Control enrichment with CLI flags:

sh
ndx analyze .              # default: 1-2 enrichment passes
ndx analyze --fast .       # skip AI enrichment entirely (algorithmic only)
ndx analyze --full .       # run all 4 enrichment passes
ndx analyze --per-zone .   # per-zone enrichment (smaller context per call)

The --per-zone mode sends each zone individually to the LLM (max 3 concurrent) instead of batching 5 zones per call. Better for large codebases or budget-constrained runs.

Incremental Enrichment

SourceVision computes content hashes per zone. If a zone's files haven't changed since the last enrichment, the LLM call is skipped and previous names/descriptions are preserved. This makes re-enrichment cheap after small changes.

Configuration

Zone Types

Annotate zones with an architectural role to apply type-specific risk thresholds:

sh
ndx config sourcevision.zones.types.my-zone domain .

Or in .n-dx.json:

json
{
  "sourcevision": {
    "zones": {
      "types": {
        "api-routes": "integration",
        "auth-core": "domain",
        "test-helpers": "test"
      }
    }
  }
}

Available types and their thresholds:

TypeCohesion FloorCoupling CeilingRationale
domain0.40.6Strict -- core business logic should be well-encapsulated
integration0.20.8Relaxed -- integration code naturally touches many things
test0.10.9Permissive -- test files import widely by design
infrastructure0.01.0No expectations -- config, build tooling, etc.
gateway0.10.9High coupling expected -- gateways bridge packages
orchestration0.20.8Wiring code -- moderate coupling expected

Zone Pins

Override Louvain's zone assignment for specific files:

json
{
  "sourcevision": {
    "zones": {
      "pins": {
        "src/utils/special-helper.ts": "core-domain"
      }
    }
  }
}

Pins take precedence over algorithmic detection. Useful when Louvain misclassifies a file due to sparse import edges.

Project Hints

Create .sourcevision/hints.md with context for the AI enrichment:

md
## Architecture Notes

- The `api/` directory follows REST conventions -- each file is one resource.
- `shared/` is a catch-all utility zone; low cohesion is expected and acceptable.
- Zone names should reflect business domains, not technical layers.

This file is included in every LLM enrichment prompt, helping the AI produce more accurate zone names and insights.

Custom Archetypes

SourceVision ships with 40+ built-in file archetypes (route-handler, service, utility, hook, etc.). Add custom ones or override per-file:

json
{
  "sourcevision": {
    "archetypes": {
      "custom": [
        {
          "id": "saga",
          "signals": ["*.saga.ts", "function* "],
          "description": "Redux saga file"
        }
      ],
      "overrides": {
        "src/legacy/weird-file.ts": "utility"
      }
    }
  }
}

Risk Justifications

If a zone is flagged as risky but you've accepted the trade-off, add a justification to downgrade findings to informational:

json
{
  "sourcevision": {
    "riskJustifications": [
      {
        "zone": "shared-utils",
        "reason": "Intentional catch-all -- 3 files, metrics unreliable at this size"
      }
    ]
  }
}

Zone type annotations (above) are preferred over justifications -- they're simpler and automatically apply the right thresholds.

Output Files

All output is written to .sourcevision/:

FileContents
manifest.jsonAnalysis metadata, module status, token usage
inventory.jsonFile catalog (path, size, role, category, language)
imports.jsonDirected import graph, external dependencies, circular dependency detection
classifications.jsonFile-to-archetype mappings
zones.jsonZone boundaries, cohesion/coupling metrics, findings, enrichment metadata
components.jsonReact component catalog with props and usage edges
callgraph.jsonFunction/method definitions and call edges
llms.txtStructured Markdown summary for LLM consumption
CONTEXT.mdDense XML-tagged summary optimized for Claude
zones/{zone-id}/context.mdDetailed per-zone context
zones/{zone-id}/summary.jsonPer-zone metadata and risk metrics

Determinism

The algorithmic analysis (phases 1-6 without AI enrichment) is fully deterministic: same codebase produces identical output. All tie-breaking in the Louvain algorithm is lexicographic.

AI enrichment introduces variation because LLM responses are non-deterministic. Zone names may differ slightly across runs, but the underlying zone boundaries (file assignments) are stable.

Released under the Elastic License 2.0.