Optional WebGPU compute acceleration for MinHash projection and topology blending.
WebGPU acceleration is entirely optional. BitZoom works fully without it — the CPU + Web Worker path handles all dataset sizes. When WebGPU is available, BitZoom offloads the two most expensive pipeline stages to compute shaders: MinHash signature + projection and topology blending. Both are embarrassingly parallel — each node's computation is independent — making them ideal for GPU dispatch.
The entire GPU implementation lives in a single file: bitzoom-gpu.js (671 lines). No .wgsl files — both shaders are embedded as tagged template strings for zero-fetch deployment. The GPU path is a drop-in replacement for the CPU path: same inputs, same outputs, transparent fallback when WebGPU is unavailable.
GPU vs CPU timings measured with scripts/bench-gpu.js on an Intel integrated GPU (Arc A-series). Median of 5 runs after 2 warmup iterations (1 run for Amazon due to its size).
| Dataset | Nodes | Groups | CPU | GPU | Speedup |
|---|---|---|---|---|---|
| Karate Club | 34 | 4 | 1.5ms | 18ms | 0.08x |
| Epstein | 364 | 5 | 20ms | 19ms | 1.1x |
| BitZoom Source | 433 | 10 | 24ms | 19ms | 1.2x |
| Synth Packages | 1,868 | 8 | 85ms | 22ms | 3.9x |
| MITRE ATT&CK | 4,736 | 10 | 358ms | 61ms | 5.9x |
| Amazon | 367,000 | 4 | 24.1s | 1.7s | 14.3x |
The GPU crossover is around 400 nodes. Below that, buffer allocation and dispatch overhead exceed the compute savings. At 367K nodes, the GPU is 14x faster. The GPU time includes CPU-side tokenization and hashing (unavoidable string work); the actual GPU dispatch accounts for a fraction of the total.
| Dataset | Nodes | CPU | GPU | Speedup |
|---|---|---|---|---|
| Karate Club | 34 | 139µs | 13ms | 0.01x |
| Epstein | 364 | 0.8ms | 14ms | 0.06x |
| BitZoom Source | 433 | 1.3ms | 14ms | 0.10x |
| Synth Packages | 1,868 | 2.4ms | 17ms | 0.15x |
| MITRE ATT&CK | 4,736 | 13ms | 34ms | 0.40x |
| Amazon | 367,000 | 1.88s | 407ms | 4.6x |
The CPU blend is simple arithmetic over flat arrays — fast even at moderate sizes. The GPU blend pays ~13ms of fixed overhead (CSR adjacency build, 7 buffer allocations, 5 dispatches with ping-pong, readback) that dominates until the dataset is large enough for compute to overtake. The crossover is between 5K and 367K nodes. At Amazon scale, the GPU is 4.6x faster.
The GPU and CPU paths share the same data flow, diverging only at the compute-intensive stages:
The split is deliberate. Tokenization involves string manipulation and hash-table lookups — serial, branchy work that GPUs handle poorly. MinHash and blend are pure arithmetic over flat arrays — exactly what compute shaders excel at. Quantization is O(n) and fast enough on CPU that the transfer overhead of an additional GPU pass would negate any speedup.
The first compute shader handles both MinHash signature computation and 2D Gaussian projection in a single dispatch. Each GPU thread processes one (node, property group) task.
| Binding | Type | Contents |
|---|---|---|
| 0 | storage<read> | Flat token array — all pre-hashed uint32 values concatenated |
| 1 | storage<read> | Task metadata — [offset, count, groupIdx] packed as 3 × u32 per task |
| 2 | storage<read> | Hash parameters — [A[0..127], B[0..127]] (256 i32 values) |
| 3 | storage<read> | Projection matrices — G groups × 2 × 128 floats |
| 4 | storage<read_write> | Output — 2 floats (px, py) per task |
[0, 0] immediately. This matches the CPU's NaN-sentinel convention: neutral projection, no false clustering.
i * 2654435761).
[0, 0] — matching the CPU's fallback-std behavior.
The crossover at 12 tokens is the same as the CPU. Below 12, standard MinHash is cheaper: tc × K hash evaluations vs OPH's tc hashes plus K densification probes. At tc = 12, K = 128: standard = 1,536 hashes; OPH = 12 + ~116 probes ≈ 128. The GPU benefits even more from OPH since the standard path's tc × K inner loop has poor warp utilization when token counts vary across threads in a workgroup.
The second compute shader implements the iterative convex blend: (1-α) × property_anchor + α × neighbor_mean. This is the same Jacobi iteration as the CPU, where each pass reads from the previous pass's positions and writes new positions.
| Binding | Type | Contents |
|---|---|---|
| 0 | storage<read> | Property anchor X — precomputed weighted centroid per node |
| 1 | storage<read> | Property anchor Y |
| 2 | storage<read> | CSR adjacency offsets — [N+1] u32 values |
| 3 | storage<read> | CSR adjacency targets — flat neighbor index array |
| 4 | storage<read> | Position input — interleaved [px, py] from previous pass |
| 5 | storage<read_write> | Position output — interleaved [px, py] for this pass |
| 6 | uniform | Parameters — { alpha: f32, N: u32 } |
The CPU builds a compressed sparse row (CSR) representation of the adjacency list before uploading. This replaces the JavaScript object-of-arrays adjacency list with two flat typed arrays that the GPU can traverse efficiently:
adjOffsets: [0, 3, 5, 8, ...] // node i's neighbors start at adjOffsets[i]
adjTargets: [2, 5, 7, 0, 4, ...] // flat list of neighbor indices
Node i's neighbors are adjTargets[adjOffsets[i] .. adjOffsets[i+1]). This is the standard graph-on-GPU representation: O(1) degree lookup, O(degree) neighbor iteration, cache-friendly sequential access.
The blend shader runs 5 iterative passes. Each pass reads positions from the previous pass and writes new positions. A naive single-buffer approach would create a read-write race: thread A reading a neighbor's position while thread B overwrites it in the same pass.
The solution is two position buffers (A and B) with alternating bind groups:
Two bind groups are created at setup, pre-wired with the swapped buffer assignments. Each pass selects the bind group for its parity. The final result lives in whichever buffer was last written: passes % 2 === 1 ? bufB : bufA.
Only two objects persist across operations: the GPUDevice and the two compiled GPUComputePipeline objects. All data buffers are created before dispatch and destroyed after readback. This prevents GPU memory leaks across dataset changes and keeps the memory footprint proportional to the active dataset.
All buffers enforce a 256-byte minimum size for GPU alignment requirements:
const size = Math.max(256, data.byteLength);
const buf = device.createBuffer({ size, usage, mappedAtCreation: true });
The MinHash universal hash function computes (a × token + b) mod p where p = 231 - 1. On the CPU, JavaScript's 64-bit floats give 53 bits of safe integer precision, and the code splits the token into two 16-bit halves to keep partial products under 247.
On the GPU, WGSL provides only u32 arithmetic — 32 bits. The product a × token can reach 262, far exceeding u32 range. The shader solves this by splitting both operands into 16-bit halves and reducing each partial product through the Mersenne fast-mod before accumulating:
fn mulMod(a: u32, b: u32) -> u32 {
let bHi = b >> 16u; let bLo = b & 0xFFFFu;
let aHi = a >> 16u; let aLo = a & 0xFFFFu;
// Step 1: hi = (a * bHi) mod P
let p1 = aHi * bHi; // < 2^32
var hi = mersMod(p1 << 16u); // low bits
hi = mersMod(hi + (p1 >> 16u) * 2u); // carry: 2^32 ≡ 2 (mod P)
let p2 = aLo * bHi;
hi = mersMod(hi + mersMod(p2)); // reduce before add
// Step 2: (hi * 2^16 + a * bLo) mod P
// ... same decomposition for the low half
return r;
}
The key identity exploited throughout: for the Mersenne prime p = 231 - 1, 232 ≡ 2 (mod p). When a partial product overflows 32 bits, the carry bits (extracted via x >> 16) are multiplied by 2 and added back. Every intermediate sum is reduced by mersMod before the next addition, ensuring no u32 overflow at any step.
mersMod uses float division (x / 0x80000000) | 0 instead of x >>> 31 because JavaScript bitwise operators truncate to 32 bits, losing high bits for values above 232. The GPU doesn't have this problem — WGSL u32 is truly 32-bit — so the shader uses standard bit shifts: (x & P) + (x >> 31u).
GPU compute uses f32 throughout. The CPU pipeline uses Float64Array for projections and blend. This precision gap has measurable but controlled consequences:
| Dataset | Nodes | Groups | Max Δ (projection) |
|---|---|---|---|
| Karate Club | 34 | 4 | 0.000031 |
| Epstein | 364 | 5 | 0.003945 |
| BitZoom Source | 433 | 10 | 0.000183 |
| MITRE ATT&CK | 4,736 | 10 | 0.000053 |
Maximum deltas are under 0.004 across all tested datasets. After Gaussian quantization into 65,536 grid cells, this produces zero cell-level mismatches: the float32 error is smaller than the quantization step size.
Rank quantization sorts all node projections and assigns grid coordinates by rank order. Small float32 perturbations can swap the order of nodes with near-identical projections, causing visible cell jumps. For this reason, rank quantization forces the CPU path — the GPU is used only with Gaussian quantization, where the continuous CDF mapping absorbs small precision differences.
The GPU integrates at three levels:
import { initGPU } from './bitzoom-gpu.js';
const gpuAvailable = await initGPU();
// Requests adapter + device, compiles both shader modules.
// Returns false on any failure — caller uses CPU path.
The initGPU() call is idempotent: subsequent calls return true immediately if the device is already acquired. Both compute pipelines (MinHash and blend) are compiled at init time, not on first use.
import { computeProjectionsGPU } from './bitzoom-gpu.js';
// Same interface as CPU computeProjections:
const { projBuf, groupNames } = await computeProjectionsGPU(
nodeArray, adjGroups, groupNames, hasEdgeTypes, extraPropNames, numericBins
);
// projBuf: Float64Array wrapping GPU Float32 results
// groupNames: unchanged
Tokenization and hashing happen on the CPU. The function packs all (node × group) tasks into flat typed arrays, dispatches one GPU call, and unpacks the results into the same projBuf format the CPU produces.
import { gpuUnifiedBlend } from './bitzoom-gpu.js';
// Drop-in for CPU unifiedBlend — same signature:
await gpuUnifiedBlend(
nodes, groupNames, propWeights, smoothAlpha,
adjList, nodeIndexFull, passes, quantMode, quantStats
);
// Modifies nodes[i].px, .py, .gx, .gy in place
The blend wrapper computes property anchors and builds CSR adjacency on the CPU, runs the iterative Jacobi passes on the GPU, reads back the positions, and quantizes on the CPU. The caller doesn't need to know which path ran.
The GPU is never required. The system has two complete execution paths:
Path A (GPU on): Parse and build graph on CPU main thread. Tokenize and hash on CPU. Project on GPU. Blend on GPU. Quantize on CPU. The viewer's GPU toggle (G button) controls this path.
Path B (GPU off or unavailable): Ship the entire pipeline to a Web Worker coordinator, which fans out to up to 3 projection sub-workers. Blend runs on CPU. This path existed before the GPU implementation and remains the default for browsers without WebGPU.
The toggle is live: the user can switch between GPU and CPU at any time. Switching to GPU re-projects the current dataset through computeProjectionsGPU; switching to CPU re-dispatches to the worker pool. Subsequent weight/alpha changes use whichever path is active.
GPU-accelerated MinHash implementations exist primarily in the CUDA ecosystem. We are not aware of any prior WebGPU/WGSL implementation of MinHash.
The most established GPU MinHash projects run on NVIDIA hardware:
| Project | Description |
|---|---|
| minhashcuda | Weighted MinHash on multi-GPU CUDA. Reports 600–1000x speedup over numpy+MKL. |
| vminhash | Vectorized MinHash with optional GPU support via CuPy. |
| cuLSH | CUDA LSH library covering MinHash (Jaccard), random projection (cosine), and p-stable (Euclidean) hash families. |
| FED | GPU-accelerated dataset deduplication framework using MinHash. Scaled to trillion-token deduplication. |
| RAPIDS cuML | Includes MinHash-based approximate nearest neighbor search as part of NVIDIA's RAPIDS data science toolkit. |
GPU MinHash also appears in bioinformatics, where CUDA implementations of the Mash algorithm accelerate genome similarity estimation over k-mer sets.
No MinHash implementations were found for Vulkan compute, Metal, or OpenCL. A few OpenCL LSH implementations exist in bioinformatics, but they focus on the index lookup side rather than signature computation. Academic work on GPU locality-sensitive hashing and GPU LSH for beam search covers related ground but not MinHash specifically.
As of early 2026, no public WebGPU or WGSL MinHash implementation has been identified in open-source repositories, academic papers, or technical blog posts. The WebGPU compute ecosystem is still young — the spec reached 1.0 in 2023 and browser support remains incomplete — so most non-graphics compute work has focused on ML inference and physics simulation.
BitZoom's shader combines MinHash signature computation, OPH densification, z-score normalization, and Gaussian random projection in a single compute dispatch. The CUDA projects above typically separate these into distinct kernel launches.
| Limitation | Impact | Mitigation |
|---|---|---|
| float32 precision only | Rank quantization produces visible cell jumps | GPU path disabled for rank quantization; Gaussian quantization unaffected |
| Buffers created per-operation | Allocation overhead on every weight change | Acceptable for interactive use (<5ms allocation for 5K nodes); persistent buffers would add complexity for marginal gain |
| No render pipeline | Rendering stays on Canvas 2D | Canvas 2D is sufficient for current dataset sizes; GPU rendering would help only for 100K+ visible nodes |
| Browser support | Firefox, older Safari excluded | Automatic CPU fallback; GPU is an acceleration, not a requirement |
| Single GPU dispatch per operation | Cannot overlap MinHash and blend | Blend depends on projection results; pipeline ordering is inherent |