BitZoom

Deterministic property-similarity layout with hierarchical bit-shift aggregation.

The Problem

Classical graph layout (force-directed, spectral) optimizes for topological fidelity: connected nodes should be near each other. For property graphs, this is often wrong. Semantically similar nodes frequently share no edge. Two ransomware samples may never connect, but they should be near each other.

BitZoom inverts the priority: property similarity first, topology second. Proximity means "these nodes have similar properties," not "these nodes are connected."

BitZoom started as a side project: melker.sh (a terminal UI framework for document-based apps) needed a fast graph layout that clusters by properties rather than connectivity. No existing algorithm did this well, so we built one.

TL;DR — The Pipeline in 30 Seconds

Each node's properties (labels, categories, numeric fields) are broken into short string tokens. Those tokens are hashed through MinHash to produce a fixed-size signature per property group — a compact fingerprint where similar nodes naturally share more hash values. Each signature is then projected onto a 2D point using Gaussian random vectors, so similarity in property space becomes proximity on the canvas.

Multiple property groups (e.g., platform, kill chain, version) each produce their own 2D coordinate. A weighted blend merges them into a single position, with an optional topology term (alpha) that pulls connected nodes closer. The blended coordinates are quantized onto a 16-bit integer grid using Gaussian CDF normalization, which spreads nodes evenly while preserving relative distances.

Zoom levels come free: shifting grid coordinates right by one bit halves the resolution, collapsing nearby nodes into aggregate "supernodes." This gives you 14 stable, hierarchical zoom levels — from full overview down to individual nodes — with no re-layout at any step.

The Pipeline

Steps 1–3 run once at load. Steps 4–5 rerun on weight or α change. Step 6 is O(1) per node on zoom.

Step 1: Tokenization

Each node's properties are converted into token sets, partitioned by independent property groups. This separation is key: it enables interactive weight adjustment later.

Numeric properties receive multi-resolution tokenization: three tokens at coarse (5 bins), medium (50 bins), and fine (500 bins) resolution. Nearby values share coarse tokens, producing smooth Jaccard similarity. Empty or undefined values emit zero tokens, resulting in a neutral projection rather than false clustering with other missing-data nodes.

Step 2: MinHash Signatures

For each property group, the token set is hashed into a MinHash signature of length k=128 using universal hashing (k-mins variant; see Cohen 2016 for a survey of MinHash sketch constructions). The probability that two nodes share a hash minimum equals their Jaccard similarity:

Step 3: Gaussian Random Projection

Each 128-dimensional signature is z-score normalized (subtract mean, divide by standard deviation) then projected to 2D via a fixed Gaussian random matrix R ∈ ℝ^2×128. Each property group uses an independently seeded matrix:

Z-score normalization is essential: MinHash values are integers in [0, 2³¹) with a large common offset. Without normalization, the projection is dominated by this offset and all nodes project to nearly the same point. Normalization extracts the discriminative signal and scales it to unit magnitude.

These per-group 2D projections are computed once and stored permanently. No weight change or parameter adjustment ever alters them. They are the fixed anchors for everything that follows.

Why does projecting 128D → 2D work?

It shouldn't, in the strict mathematical sense. The Johnson-Lindenstrauss lemma guarantees distance preservation only for target dimension O(log n / ε²), far more than 2. But BitZoom doesn't need distance preservation. It needs ordering preservation: if A is more similar to B than to C in 128D, then A should be closer to B than to C in 2D, most of the time.

This is an empirical heuristic, not a theorem. MinHash signatures with high Jaccard overlap share most hash values, so their z-normalized vectors tend to be correlated. The fixed random projection tends to place correlated vectors nearer in 2D. For BitZoom's purposes (quantization into grid cells), a stable ordering signal is sufficient. Exact distance preservation is not needed and not provided.

Step 4: Unified Blend

The final position is a weighted combination of per-group anchors plus an optional topology smoothing signal:

Because the blend is a convex combination of fixed anchors, changing a weight smoothly moves nodes along predictable paths. No rehashing, no reprojection. Just arithmetic.

Step 5: Grid Quantization

Continuous positions are quantized to uint16 grid coordinates (65536 × 65536). Two modes are available:

Gaussian (default): maps through Φ(z) using fixed boundaries (μ, σ) computed once from the dataset-tuned blend. This is a reasonable fit when blended coordinates are roughly bell-shaped, which often holds when multiple projected components contribute, but it is an approximation rather than a guarantee. In practice it tends to preserve density structure better than rank quantization: clusters remain tighter and sparse regions remain more spread out. Post-blend distributions can be multimodal or skewed, especially at high α or with dominant weights. The boundaries are fixed across weight changes; only the node positions shift.

Rank: sorts all nodes by blended position and assigns grid coordinates by rank. Guarantees equal occupancy per bucket but destroys density information.

Step 6: Bit-Shift Zoom Hierarchy

This is the key trick. All 14 zoom levels derive from the stored uint16 coordinates via bit shifts:

This is O(1) per node and requires zero recomputation on zoom navigation. Nodes sharing a cell become a supernode; cross-cell edges become weighted supernode edges.

Rendering Pipeline

BitZoom renders in five layers, back to front. This ordering ensures that edges never obscure nodes, heatmaps provide context without hiding structure, and highlighted connections are always visible.

Heatmap Visualization

Two heatmap modes reveal cluster density and color distribution at a glance, without requiring individual node inspection.

Gaussian Splat

Each node stamps a radial gradient blob using additive blending. Where nodes cluster, blobs overlap and intensify. The radius scales with node size (member count or degree).

Kernel Density Estimation

A grid-based KDE at 1/4 canvas resolution. For each pixel, nearby nodes contribute weight through an Epanechnikov kernel:

k (d) = {(1 - d^{2} / r^{2})}^{2}

. Colors blend by weighted average of contributing node group colors.

Seamless Zoom & Auto-Level

BitZoom maintains a logical zoom that can exceed the 1:1 grid scale. When logical zoom crosses a threshold, the aggregation level changes automatically while the visual scale stays continuous.

Adaptive Density Rendering

As you zoom in, more detail appears naturally. The renderer counts visible supernodes (those within the viewport) and progressively reveals labels and counts:

Why This Matters

Complexity

Where n = nodes, |E| = edges, k = 128 (MinHash length), G = property groups, T = avg tokens per node, passes = blend iterations (default 5).

Known Limitations

Phase	Cost
Tokenization	O(n × T)
MinHash signatures	O(n × k × G × T)
Gaussian projection	O(n × k × G)
Anchor recomputation	O(n × G)
Topology smoothing	O(passes × (n + \|E\|))
Quantization (rank)	O(n log n)
Quantization (Gaussian)	O(n)
Level construction	O(n + \|E\|) per level, lazy
Zoom-cell derivation	O(1) per node
Weight change	O(n × G + passes × (n + \|E\|)), no re-projection

Jaccard on tokens is crude for continuous or ordinal data. Multi-resolution tokenization helps but doesn't eliminate bin sensitivity.

2D projection doesn't preserve distances. Layout reflects coarse ordering, not metric geometry.

Quantization tradeoffs. Rank quantization destroys density (uniform occupancy by design). Gaussian quantization tends to preserve density better but uses fixed boundaries from the initial blend. Large weight changes can shift the distribution beyond the stored μ/σ, pushing nodes to grid extremes.

Low-entropy and undefined-value collapse is mitigated by an adaptive weight floor. Empty fields project to a neutral (0,0) point; low-entropy properties produce few distinct projections. The floor (10% of max weight, minimum 0.10) ensures zero-weight high-entropy groups always contribute spreading, preventing degenerate clustering and providing smooth transitions as weights change. With 3 zero-weight groups, the dominant group controls ~83% of layout. Trade-off: the floor spreads nodes using dimensions the user deprioritized, so the spreading axis does not reflect the user's stated preferences. When many nodes share the same value under a dominant group, some edge concentration still occurs.

Oversmoothing at high α with many passes. Well-connected components collapse toward their centroid. At high α, both axes are increasingly driven by the same smoothing operator, which can produce poorly separated, low-variance structure.

The individual components (MinHash, random projection, hierarchical grids, graph smoothing) are all established techniques. The contribution is their combination into a pipeline where each stage's output feeds naturally into the next, and where the stored intermediate representations support both efficient navigation and interactive parameter exploration.

File	Role
Interactive viewer	Load datasets, adjust weights, explore zoom levels
`bitzoom-algo.js`	Pure algorithm functions: MinHash, projection, blend, quantization, level building
`bitzoom-pipeline.js`	SNAP parsers, graph building, tokenization, full projection pipeline
`bitzoom-canvas.js`	Standalone embeddable component: canvas, interaction, rendering
`bitzoom-viewer.js`	Viewer application: UI, workers, data loading, detail panel, URL state
`bitzoom-renderer.js`	5-layer canvas rendering, heatmaps, edge sampling, hit testing
`bitzoom-worker.js`	Web Worker coordinator, fans out to projection sub-workers
SPEC.md	Algorithm specification: similarity model, blend formula, complexity, open questions
`<bz-graph>`	Web component: embed a graph with one HTML tag
GitHub	Source, datasets, tests, SNAP converters