BitZoom

Deterministic property-similarity layout with hierarchical bit-shift aggregation.

Live demo · Scroll to zoom · Drag to pan · Click to select

The Problem

Classical graph layout (force-directed, spectral) optimizes for topological fidelity: connected nodes should be near each other. For property graphs, this is often wrong. Semantically similar nodes frequently share no edge. Two ransomware samples may never connect, but they should be near each other.

BitZoom inverts the priority: property similarity first, topology second. Proximity means "these nodes have similar properties," not "these nodes are connected."

BitZoom started as a side project: melker.sh (a terminal UI framework for document-based apps) needed a fast graph layout that clusters by properties rather than connectivity. No existing algorithm did this well, so we built one.

TL;DR — The Pipeline in 30 Seconds

Each node's properties (labels, categories, numeric fields) are broken into short string tokens. Those tokens are hashed through MinHash to produce a fixed-size signature per property group — a compact fingerprint where similar nodes naturally share more hash values. Each signature is then projected onto a 2D point using Gaussian random vectors, so similarity in property space becomes proximity on the canvas.

Multiple property groups (e.g., platform, kill chain, version) each produce their own 2D coordinate. A weighted blend merges them into a single position, with an optional topology term (alpha) that pulls connected nodes closer. The blended coordinates are quantized onto a 16-bit integer grid using Gaussian CDF normalization, which spreads nodes evenly while preserving relative distances.

Zoom levels come free: shifting grid coordinates right by one bit halves the resolution, collapsing nearby nodes into aggregate "supernodes." This gives you 14 stable, hierarchical zoom levels — from full overview down to individual nodes — with no re-layout at any step.

The Pipeline

Properties Tokenize MinHash (k=128) z-norm + Project 2D Weighted blend + α Φ(z) quantize gx >> (16-L)

Steps 1–3 run once at load. Steps 4–5 rerun on weight or α change. Step 6 is O(1) per node on zoom.

Step 1: Tokenization

Each node's properties are converted into token sets, partitioned by independent property groups. This separation is key: it enables interactive weight adjustment later.

Node: PlugX type: malware platform: Windows family: RAT kill_chain: c2 degree: 47 GROUP group:malware LABEL label:plugx PLATFORM platform:Windows STRUCTURE deg:32+ leaf:false NUMERIC (degree: 47) c:40-60 coarse (5 bins) m:46-48 medium (50 bins) f:46.8-47.0 fine (500) Nearby values share coarse tokens. Empty fields: 0 tokens (neutral).
Properties are tokenized into independent groups. Each group gets its own MinHash signature.

Numeric properties receive multi-resolution tokenization: three tokens at coarse (5 bins), medium (50 bins), and fine (500 bins) resolution. Nearby values share coarse tokens, producing smooth Jaccard similarity. Empty or undefined values emit zero tokens, resulting in a neutral projection rather than false clustering with other missing-data nodes.

Step 2: MinHash Signatures

For each property group, the token set is hashed into a MinHash signature of length k=128 using universal hashing (k-mins variant; see Cohen 2016 for a survey of MinHash sketch constructions). The probability that two nodes share a hash minimum equals their Jaccard similarity:

P[ sigA[j] = sigB[j] ] = J(A,B) = |AB| |AB|
Node A signature: ... (k=128) Node B signature: ... (k=128) Matching positions ≈ Jaccard similarity = 12/16 = 0.75
MinHash signatures encode set similarity as position-wise agreement. More matches = more similar properties.

Step 3: Gaussian Random Projection

Each 128-dimensional signature is z-score normalized (subtract mean, divide by standard deviation) then projected to 2D via a fixed Gaussian random matrix R ∈ ℝ2×128. Each property group uses an independently seeded matrix:

[ px, py ] = R zscore(sig)

Z-score normalization is essential: MinHash values are integers in [0, 231) with a large common offset. Without normalization, the projection is dominated by this offset and all nodes project to nearly the same point. Normalization extracts the discriminative signal and scales it to unit magnitude.

These per-group 2D projections are computed once and stored permanently. No weight change or parameter adjustment ever alters them. They are the fixed anchors for everything that follows.

Why does projecting 128D → 2D work?

It shouldn't, in the strict mathematical sense. The Johnson-Lindenstrauss lemma guarantees distance preservation only for target dimension O(log n / ε²), far more than 2. But BitZoom doesn't need distance preservation. It needs ordering preservation: if A is more similar to B than to C in 128D, then A should be closer to B than to C in 2D, most of the time.

128-dimensional MinHash space web servers: ... ... J ≈ 0.85 databases: ... ... J ≈ 0.80 ML libraries: ... web ↔ db: J ≈ 0.15 web ↔ ML: J ≈ 0.10 R ⋅ sig 128D → 2D 2D projected space web db ML far apart close! Why this works despite losing dimensions Similar signatures produce correlated projections. Two nodes with J=0.85 share most hash values → their z-normalized vectors point in similar directions → multiplying by the same R matrix maps them to nearby 2D points. Dissimilar signatures project to distant points. Low Jaccard → different hash minimums → uncorrelated normalized vectors → effectively random 2D offset.
In practice, similar MinHash signatures tend to produce correlated z-normalized vectors, which a fixed random projection places nearer in 2D. Dissimilar signatures tend to land far apart. This is a useful heuristic, not a formal guarantee.
This is an empirical heuristic, not a theorem. MinHash signatures with high Jaccard overlap share most hash values, so their z-normalized vectors tend to be correlated. The fixed random projection tends to place correlated vectors nearer in 2D. For BitZoom's purposes (quantization into grid cells), a stable ordering signal is sufficient. Exact distance preservation is not needed and not provided.

Step 4: Unified Blend

The final position is a weighted combination of per-group anchors plus an optional topology smoothing signal:

pxi = (1-α) gwgpg(i) W + α avgjN(i) (pxj)
α = 0 (properties only) Clusters by similarity α = 0.5 (blended) Properties + topology α → 1 (topology) Oversmoothing!
The α parameter controls the balance between property similarity and topology. High α causes oversmoothing.

Because the blend is a convex combination of fixed anchors, changing a weight smoothly moves nodes along predictable paths. No rehashing, no reprojection. Just arithmetic.

Property only (α=0). Clusters by role
With topology (α=0.75). Connected nodes pull together

Step 5: Grid Quantization

Continuous positions are quantized to uint16 grid coordinates (65536 × 65536). Two modes are available:

Gaussian (default): maps through Φ(z) using fixed boundaries (μ, σ) computed once from the dataset-tuned blend. This is a reasonable fit when blended coordinates are roughly bell-shaped, which often holds when multiple projected components contribute, but it is an approximation rather than a guarantee. In practice it tends to preserve density structure better than rank quantization: clusters remain tighter and sparse regions remain more spread out. Post-blend distributions can be multimodal or skewed, especially at high α or with dominant weights. The boundaries are fixed across weight changes; only the node positions shift.

gx = Φ ( pxiμ σ ) × 65536

Rank: sorts all nodes by blended position and assigns grid coordinates by rank. Guarantees equal occupancy per bucket but destroys density information.

Either way, the output is 4 bytes per node: two uint16 values (gx, gy).

Step 6: Bit-Shift Zoom Hierarchy

This is the key trick. All 14 zoom levels derive from the stored uint16 coordinates via bit shifts:

cx = gx >> (16-L)
1 0 1 0 gx = L1: cx = 1 → 2×2 grid L3: cx = 101 → 8×8 grid L8: cx = 10110110 → 256×256 grid L1: 2×2 L3: 8×8 Bit-prefix containment: L1 cell always contains the L3 cell, for every node, unconditionally.
Higher bits select coarser cells. Each cell at level L is always a sub-cell of its parent at level L-1.

This is O(1) per node and requires zero recomputation on zoom navigation. Nodes sharing a cell become a supernode; cross-cell edges become weighted supernode edges.

Zoom hierarchy. Arrow keys or scroll to change level

Rendering Pipeline

BitZoom renders in five layers, back to front. This ordering ensures that edges never obscure nodes, heatmaps provide context without hiding structure, and highlighted connections are always visible.

1 Normal Edges Sampled, distance-faded cubic beziers. Adaptive cap: min(5000, nodes×3). 2 Heatmap Gaussian splat (additive) or kernel density estimation. Covers edges softly. 3 Highlighted Edges Selected/hovered node connections. No distance filter. Always visible above heatmap. 4 Node Circles Size by members or edges. Opacity scales with importance (sqrt). Selected nodes glow. 5 Labels & Counts Density-adaptive: hidden at >150 visible, large-node-only at 50–150, all at <50. Always on hover. flask redis numpy FRONT BACK
Five rendering layers, back to front. Labels are always topmost — never hidden behind circles or heatmap.

Heatmap Visualization

Two heatmap modes reveal cluster density and color distribution at a glance, without requiring individual node inspection.

Gaussian Splat

Each node stamps a radial gradient blob using additive blending. Where nodes cluster, blobs overlap and intensify. The radius scales with node size (member count or degree).

Individual nodes With Gaussian splat
Gaussian splat: each node emits a radial gradient. Overlapping blobs brighten additively, revealing cluster density and dominant color.

Kernel Density Estimation

A grid-based KDE at 1/4 canvas resolution. For each pixel, nearby nodes contribute weight through an Epanechnikov kernel: k(d)=(1-d2/r2)2. Colors blend by weighted average of contributing node group colors.

Kernel per node r (1 - d²/r²)² Accumulate on 1/4 grid Bilinear upscale to canvas
KDE: each node's kernel writes weighted color into a low-res grid. The grid is bilinear-upscaled for a smooth continuous field showing density and dominant color.

Seamless Zoom & Auto-Level

BitZoom maintains a logical zoom that can exceed the 1:1 grid scale. When logical zoom crosses a threshold, the aggregation level changes automatically while the visual scale stays continuous.

Auto-level switching via logical zoom logical zoom → 0.5× level DOWN level UP Level 3 (8×8) coarser: fewer, larger supernodes Level 4 (16×16) current level Level 5 (32×32) finer: more, smaller supernodes zoom < 0.5× zoom ×= 2 zoom ≥ 2× zoom /= 2 Visual scale is continuous renderZoom = max(1, zoom × 2level − base) Level UP halves zoom, but the exponent increases, so renderZoom stays the same. No visual jump.
The 4× hysteresis gap (0.5× to 2×) prevents oscillation. renderZoom compensates for level offset, keeping the visual scale perfectly continuous across transitions.

Adaptive Density Rendering

As you zoom in, more detail appears naturally. The renderer counts visible supernodes (those within the viewport) and progressively reveals labels and counts:

>150 visible circles only zoom in → 50–150 visible 48 92 31 counts on large nodes zoom in → <50 visible flask 48 postgresql 92 numpy 31 redis 12 labels + counts on all
Density-adaptive rendering: zoomed out shows only opacity-scaled circles. Zooming in progressively reveals counts, then full labels.

Why This Matters

1
Near-linear preprocessing. No O(n²) force simulation. MinHash + projection is O(n × k × G) where k=128 and G is the number of property groups.
2
Interactive weight tuning. Changing a property weight moves nodes smoothly. No rehashing, no reprojection. Anchor recomputation is O(n × G); topology smoothing adds O(passes × (n + |E|)) when α > 0.
3
Instant zoom. All 14 aggregation levels from 4 stored bytes per node. Bit shifts, not recomputation.
4
Deterministic. Same data + same parameters = same layout. Every time. No random initialization, no convergence sensitivity.

Complexity

PhaseCost
TokenizationO(n × T)
MinHash signaturesO(n × k × G × T)
Gaussian projectionO(n × k × G)
Anchor recomputationO(n × G)
Topology smoothingO(passes × (n + |E|))
Quantization (rank)O(n log n)
Quantization (Gaussian)O(n)
Level constructionO(n + |E|) per level, lazy
Zoom-cell derivationO(1) per node
Weight changeO(n × G + passes × (n + |E|)), no re-projection

Where n = nodes, |E| = edges, k = 128 (MinHash length), G = property groups, T = avg tokens per node, passes = blend iterations (default 5).

Known Limitations

Jaccard on tokens is crude for continuous or ordinal data. Multi-resolution tokenization helps but doesn't eliminate bin sensitivity.

2D projection doesn't preserve distances. Layout reflects coarse ordering, not metric geometry.

Quantization tradeoffs. Rank quantization destroys density (uniform occupancy by design). Gaussian quantization tends to preserve density better but uses fixed boundaries from the initial blend. Large weight changes can shift the distribution beyond the stored μ/σ, pushing nodes to grid extremes.

Low-entropy and undefined-value collapse is mitigated by an adaptive weight floor. Empty fields project to a neutral (0,0) point; low-entropy properties produce few distinct projections. The floor (10% of max weight, minimum 0.10) ensures zero-weight high-entropy groups always contribute spreading, preventing degenerate clustering and providing smooth transitions as weights change. With 3 zero-weight groups, the dominant group controls ~83% of layout. Trade-off: the floor spreads nodes using dimensions the user deprioritized, so the spreading axis does not reflect the user's stated preferences. When many nodes share the same value under a dominant group, some edge concentration still occurs.

Oversmoothing at high α with many passes. Well-connected components collapse toward their centroid. At high α, both axes are increasingly driven by the same smoothing operator, which can produce poorly separated, low-variance structure.

The individual components (MinHash, random projection, hierarchical grids, graph smoothing) are all established techniques. The contribution is their combination into a pipeline where each stage's output feeds naturally into the next, and where the stored intermediate representations support both efficient navigation and interactive parameter exploration.

Code

FileRole
Interactive viewerLoad datasets, adjust weights, explore zoom levels
bitzoom-algo.jsPure algorithm functions: MinHash, projection, blend, quantization, level building
bitzoom-pipeline.jsSNAP parsers, graph building, tokenization, full projection pipeline
bitzoom-canvas.jsStandalone embeddable component: canvas, interaction, rendering
bitzoom-viewer.jsViewer application: UI, workers, data loading, detail panel, URL state
bitzoom-renderer.js5-layer canvas rendering, heatmaps, edge sampling, hit testing
bitzoom-worker.jsWeb Worker coordinator, fans out to projection sub-workers
SPEC.mdAlgorithm specification: similarity model, blend formula, complexity, open questions
<bz-graph>Web component: embed a graph with one HTML tag
GitHubSource, datasets, tests, SNAP converters

Embed a graph with a single HTML element:

<script type="module" src="bz-graph.js"></script>

<bz-graph edges="data/karate.edges" nodes="data/karate.nodes"
          level="2" heatmap="density" legend
          weights="group:8" label-props="group">
</bz-graph>