BitZoom

Optional WebGPU compute acceleration for MinHash projection and topology blending.

Overview

WebGPU acceleration is entirely optional. BitZoom works fully without it — the CPU + Web Worker path handles all dataset sizes. When WebGPU is available, BitZoom offloads the two most expensive pipeline stages to compute shaders: MinHash signature + projection and topology blending. Both are embarrassingly parallel — each node's computation is independent — making them ideal for GPU dispatch.

The entire GPU implementation lives in a single file: bitzoom-gpu.js (671 lines). No .wgsl files — both shaders are embedded as tagged template strings for zero-fetch deployment. The GPU path is a drop-in replacement for the CPU path: same inputs, same outputs, transparent fallback when WebGPU is unavailable.

WebGPU requires Chrome 113+, Edge 113+, or a recent Chromium-based browser. Firefox requires an explicit flag. Safari has partial support. BitZoom detects availability at runtime and falls back to CPU + Web Workers automatically.

Performance

GPU vs CPU timings measured with scripts/bench-gpu.js on an Intel integrated GPU (Arc A-series). Median of 5 runs after 2 warmup iterations (1 run for Amazon due to its size).

Projection (tokenize + MinHash + project)

DatasetNodesGroupsCPUGPUSpeedup
Karate Club3441.5ms18ms0.08x
Epstein364520ms19ms1.1x
BitZoom Source4331024ms19ms1.2x
Synth Packages1,868885ms22ms3.9x
MITRE ATT&CK4,73610358ms61ms5.9x
Amazon367,000424.1s1.7s14.3x

The GPU crossover is around 400 nodes. Below that, buffer allocation and dispatch overhead exceed the compute savings. At 367K nodes, the GPU is 14x faster. The GPU time includes CPU-side tokenization and hashing (unavoidable string work); the actual GPU dispatch accounts for a fraction of the total.

Blend (α=0.5, 5 passes)

DatasetNodesCPUGPUSpeedup
Karate Club34139µs13ms0.01x
Epstein3640.8ms14ms0.06x
BitZoom Source4331.3ms14ms0.10x
Synth Packages1,8682.4ms17ms0.15x
MITRE ATT&CK4,73613ms34ms0.40x
Amazon367,0001.88s407ms4.6x

The CPU blend is simple arithmetic over flat arrays — fast even at moderate sizes. The GPU blend pays ~13ms of fixed overhead (CSR adjacency build, 7 buffer allocations, 5 dispatches with ping-pong, readback) that dominates until the dataset is large enough for compute to overtake. The crossover is between 5K and 367K nodes. At Amazon scale, the GPU is 4.6x faster.

These numbers are hardware-dependent. Discrete GPUs with faster memory bandwidth will show higher speedups. The relative pattern — projection benefits first, blend benefits only at large scale — will hold across hardware.

Architecture

The GPU and CPU paths share the same data flow, diverging only at the compute-intensive stages:

Parse Build Graph Tokenize Hash (uint32) MinHash + Project Blend Quantize
Purple-bordered stages run on the GPU. Gray stages remain on the CPU main thread.

The split is deliberate. Tokenization involves string manipulation and hash-table lookups — serial, branchy work that GPUs handle poorly. MinHash and blend are pure arithmetic over flat arrays — exactly what compute shaders excel at. Quantization is O(n) and fast enough on CPU that the transfer overhead of an additional GPU pass would negate any speedup.

CPU (main thread) GPU (compute shaders) Tokenize hashToken() Uint32 Upload Bufs 5 storage buffers MinHash + Project @workgroup_size(64) readback Topology Blend ping-pong, 5 passes Readback mapAsync Quantize Persistent GPU State device — GPUDevice (1 per session) pipeline, blendPipeline — compiled once Per-Operation Buffers Created before dispatch, destroyed after readback. No persistent allocations between operations.

The MinHash + Projection Shader

The first compute shader handles both MinHash signature computation and 2D Gaussian projection in a single dispatch. Each GPU thread processes one (node, property group) task.

Shader Bindings

BindingTypeContents
0storage<read>Flat token array — all pre-hashed uint32 values concatenated
1storage<read>Task metadata — [offset, count, groupIdx] packed as 3 × u32 per task
2storage<read>Hash parameters — [A[0..127], B[0..127]] (256 i32 values)
3storage<read>Projection matrices — G groups × 2 × 128 floats
4storage<read_write>Output — 2 floats (px, py) per task

Execution Flow

1
Empty check. If a task has zero tokens (undefined property value), output [0, 0] immediately. This matches the CPU's NaN-sentinel convention: neutral projection, no false clustering.
2
MinHash signature. For tasks with fewer than 12 tokens: standard k-hash MinHash (K=128 hash evaluations per token). For 12+ tokens: One Permutation Hashing with DOPH densification — one hash per token into K=128 bins, then fill empty bins from occupied donors using Fibonacci hashing (i * 2654435761).
3
Z-score normalization. Compute mean and standard deviation of the 128-slot signature. Degenerate signatures (near-zero variance) short-circuit to [0, 0] — matching the CPU's fallback-std behavior.
4
Gaussian random projection. Dot the normalized signature with the group's 2×128 projection matrix to produce the final (px, py) pair. Each property group has a distinct matrix seeded deterministically.

The OPH Threshold

The crossover at 12 tokens is the same as the CPU. Below 12, standard MinHash is cheaper: tc × K hash evaluations vs OPH's tc hashes plus K densification probes. At tc = 12, K = 128: standard = 1,536 hashes; OPH = 12 + ~116 probes ≈ 128. The GPU benefits even more from OPH since the standard path's tc × K inner loop has poor warp utilization when token counts vary across threads in a workgroup.

The Topology Blend Shader

The second compute shader implements the iterative convex blend: (1-α) × property_anchor + α × neighbor_mean. This is the same Jacobi iteration as the CPU, where each pass reads from the previous pass's positions and writes new positions.

Shader Bindings

BindingTypeContents
0storage<read>Property anchor X — precomputed weighted centroid per node
1storage<read>Property anchor Y
2storage<read>CSR adjacency offsets — [N+1] u32 values
3storage<read>CSR adjacency targets — flat neighbor index array
4storage<read>Position input — interleaved [px, py] from previous pass
5storage<read_write>Position output — interleaved [px, py] for this pass
6uniformParameters — { alpha: f32, N: u32 }

CSR Adjacency

The CPU builds a compressed sparse row (CSR) representation of the adjacency list before uploading. This replaces the JavaScript object-of-arrays adjacency list with two flat typed arrays that the GPU can traverse efficiently:

adjOffsets: [0, 3, 5, 8, ...]   // node i's neighbors start at adjOffsets[i]
adjTargets: [2, 5, 7, 0, 4, ...] // flat list of neighbor indices

Node i's neighbors are adjTargets[adjOffsets[i] .. adjOffsets[i+1]). This is the standard graph-on-GPU representation: O(1) degree lookup, O(degree) neighbor iteration, cache-friendly sequential access.

Buffer Strategy

Ping-Pong Blend Passes

The blend shader runs 5 iterative passes. Each pass reads positions from the previous pass and writes new positions. A naive single-buffer approach would create a read-write race: thread A reading a neighbor's position while thread B overwrites it in the same pass.

The solution is two position buffers (A and B) with alternating bind groups:

Pass 0 Pass 1 Pass 2 Buf A read Buf B write Buf A read Buf B write Buf A read Buf B write
Ping-pong: even passes read A/write B, odd passes read B/write A. No read-write races.

Two bind groups are created at setup, pre-wired with the swapped buffer assignments. Each pass selects the bind group for its parity. The final result lives in whichever buffer was last written: passes % 2 === 1 ? bufB : bufA.

Buffer Lifecycle

Only two objects persist across operations: the GPUDevice and the two compiled GPUComputePipeline objects. All data buffers are created before dispatch and destroyed after readback. This prevents GPU memory leaks across dataset changes and keeps the memory footprint proportional to the active dataset.

All buffers enforce a 256-byte minimum size for GPU alignment requirements:

const size = Math.max(256, data.byteLength);
const buf = device.createBuffer({ size, usage, mappedAtCreation: true });

Overflow-Safe Modular Arithmetic

The MinHash universal hash function computes (a × token + b) mod p where p = 231 - 1. On the CPU, JavaScript's 64-bit floats give 53 bits of safe integer precision, and the code splits the token into two 16-bit halves to keep partial products under 247.

On the GPU, WGSL provides only u32 arithmetic — 32 bits. The product a × token can reach 262, far exceeding u32 range. The shader solves this by splitting both operands into 16-bit halves and reducing each partial product through the Mersenne fast-mod before accumulating:

fn mulMod(a: u32, b: u32) -> u32 {
  let bHi = b >> 16u;  let bLo = b & 0xFFFFu;
  let aHi = a >> 16u;  let aLo = a & 0xFFFFu;

  // Step 1: hi = (a * bHi) mod P
  let p1 = aHi * bHi;                       // < 2^32
  var hi = mersMod(p1 << 16u);              // low bits
  hi = mersMod(hi + (p1 >> 16u) * 2u);     // carry: 2^32 ≡ 2 (mod P)
  let p2 = aLo * bHi;
  hi = mersMod(hi + mersMod(p2));           // reduce before add

  // Step 2: (hi * 2^16 + a * bLo) mod P
  // ... same decomposition for the low half
  return r;
}

The key identity exploited throughout: for the Mersenne prime p = 231 - 1, 232 ≡ 2 (mod p). When a partial product overflows 32 bits, the carry bits (extracted via x >> 16) are multiplied by 2 and added back. Every intermediate sum is reduced by mersMod before the next addition, ensuring no u32 overflow at any step.

The CPU's mersMod uses float division (x / 0x80000000) | 0 instead of x >>> 31 because JavaScript bitwise operators truncate to 32 bits, losing high bits for values above 232. The GPU doesn't have this problem — WGSL u32 is truly 32-bit — so the shader uses standard bit shifts: (x & P) + (x >> 31u).

Precision: float32 vs float64

GPU compute uses f32 throughout. The CPU pipeline uses Float64Array for projections and blend. This precision gap has measurable but controlled consequences:

DatasetNodesGroupsMax Δ (projection)
Karate Club3440.000031
Epstein36450.003945
BitZoom Source433100.000183
MITRE ATT&CK4,736100.000053

Maximum deltas are under 0.004 across all tested datasets. After Gaussian quantization into 65,536 grid cells, this produces zero cell-level mismatches: the float32 error is smaller than the quantization step size.

Rank Quantization Exception

Rank quantization sorts all node projections and assigns grid coordinates by rank order. Small float32 perturbations can swap the order of nodes with near-identical projections, causing visible cell jumps. For this reason, rank quantization forces the CPU path — the GPU is used only with Gaussian quantization, where the continuous CDF mapping absorbs small precision differences.

Integration

The GPU integrates at three levels:

Initialization

import { initGPU } from './bitzoom-gpu.js';

const gpuAvailable = await initGPU();
// Requests adapter + device, compiles both shader modules.
// Returns false on any failure — caller uses CPU path.

The initGPU() call is idempotent: subsequent calls return true immediately if the device is already acquired. Both compute pipelines (MinHash and blend) are compiled at init time, not on first use.

Projection (Full Pipeline)

import { computeProjectionsGPU } from './bitzoom-gpu.js';

// Same interface as CPU computeProjections:
const { projBuf, groupNames } = await computeProjectionsGPU(
  nodeArray, adjGroups, groupNames, hasEdgeTypes, extraPropNames, numericBins
);
// projBuf: Float64Array wrapping GPU Float32 results
// groupNames: unchanged

Tokenization and hashing happen on the CPU. The function packs all (node × group) tasks into flat typed arrays, dispatches one GPU call, and unpacks the results into the same projBuf format the CPU produces.

Blend (Weight Changes)

import { gpuUnifiedBlend } from './bitzoom-gpu.js';

// Drop-in for CPU unifiedBlend — same signature:
await gpuUnifiedBlend(
  nodes, groupNames, propWeights, smoothAlpha,
  adjList, nodeIndexFull, passes, quantMode, quantStats
);
// Modifies nodes[i].px, .py, .gx, .gy in place

The blend wrapper computes property anchors and builds CSR adjacency on the CPU, runs the iterative Jacobi passes on the GPU, reads back the positions, and quantizes on the CPU. The caller doesn't need to know which path ran.

Fallback Strategy

The GPU is never required. The system has two complete execution paths:

GPU available? yes GPU main-thread path computeProjectionsGPU + gpuUnifiedBlend no CPU Web Worker path runPipeline + unifiedBlend (3 sub-workers)

Path A (GPU on): Parse and build graph on CPU main thread. Tokenize and hash on CPU. Project on GPU. Blend on GPU. Quantize on CPU. The viewer's GPU toggle (G button) controls this path.

Path B (GPU off or unavailable): Ship the entire pipeline to a Web Worker coordinator, which fans out to up to 3 projection sub-workers. Blend runs on CPU. This path existed before the GPU implementation and remains the default for browsers without WebGPU.

The toggle is live: the user can switch between GPU and CPU at any time. Switching to GPU re-projects the current dataset through computeProjectionsGPU; switching to CPU re-dispatches to the worker pool. Subsequent weight/alpha changes use whichever path is active.

Prior Art

GPU-accelerated MinHash implementations exist primarily in the CUDA ecosystem. We are not aware of any prior WebGPU/WGSL implementation of MinHash.

CUDA

The most established GPU MinHash projects run on NVIDIA hardware:

ProjectDescription
minhashcuda Weighted MinHash on multi-GPU CUDA. Reports 600–1000x speedup over numpy+MKL.
vminhash Vectorized MinHash with optional GPU support via CuPy.
cuLSH CUDA LSH library covering MinHash (Jaccard), random projection (cosine), and p-stable (Euclidean) hash families.
FED GPU-accelerated dataset deduplication framework using MinHash. Scaled to trillion-token deduplication.
RAPIDS cuML Includes MinHash-based approximate nearest neighbor search as part of NVIDIA's RAPIDS data science toolkit.

GPU MinHash also appears in bioinformatics, where CUDA implementations of the Mash algorithm accelerate genome similarity estimation over k-mer sets.

Other GPU APIs

No MinHash implementations were found for Vulkan compute, Metal, or OpenCL. A few OpenCL LSH implementations exist in bioinformatics, but they focus on the index lookup side rather than signature computation. Academic work on GPU locality-sensitive hashing and GPU LSH for beam search covers related ground but not MinHash specifically.

WebGPU

As of early 2026, no public WebGPU or WGSL MinHash implementation has been identified in open-source repositories, academic papers, or technical blog posts. The WebGPU compute ecosystem is still young — the spec reached 1.0 in 2023 and browser support remains incomplete — so most non-graphics compute work has focused on ML inference and physics simulation.

BitZoom's shader combines MinHash signature computation, OPH densification, z-score normalization, and Gaussian random projection in a single compute dispatch. The CUDA projects above typically separate these into distinct kernel launches.

Limitations

LimitationImpactMitigation
float32 precision only Rank quantization produces visible cell jumps GPU path disabled for rank quantization; Gaussian quantization unaffected
Buffers created per-operation Allocation overhead on every weight change Acceptable for interactive use (<5ms allocation for 5K nodes); persistent buffers would add complexity for marginal gain
No render pipeline Rendering stays on Canvas 2D Canvas 2D is sufficient for current dataset sizes; GPU rendering would help only for 100K+ visible nodes
Browser support Firefox, older Safari excluded Automatic CPU fallback; GPU is an acceleration, not a requirement
Single GPU dispatch per operation Cannot overlap MinHash and blend Blend depends on projection results; pipeline ordering is inherent

How It Works · Developer Guide · GPU/CPU Visual Comparison