K-means GPU implementation

2025-12-05 19:50 UTC gpt-5-1 Open in ChatGPT ↗

Think of this repo as “how you’d write a Unity ComputeShader-based k-means color quantizer”, just done in Rust + wgpu + WGSL instead of C# + HLSL.

Because my access to the actual source files is limited by the environment, I can’t quote the exact shader code. But from the README and the references the author links (especially the WebGPU prefix-sum articles and the kmeans-colors crate) citeturn6view2turn7search2turn10search0 we can reconstruct how it works and how you’d approach the same thing in Unity.

1. Algorithm refresher (for color images)

Standard k-means for color quantization:

Take all pixels as points in color space (often RGB or CIELAB).
Choose k initial cluster centers (centroids).
Repeat for N iterations or until convergence:
1. Assignment step: For each pixel, find the nearest centroid and assign the pixel to that cluster.
2. Update step: For each cluster, average all pixels assigned to it → new centroid color.
Replace each pixel’s color with its cluster’s centroid color (and optionally dither).

On CPU, you just loop over all pixels and do this in nested loops. On GPU, you break it into a few compute passes.

2. What this project uses (mapping to Unity concepts)

The project uses:

Rust host code + wgpu (Rust’s WebGPU wrapper) to manage GPU resources.
WGSL compute shaders to do the heavy lifting on the GPU. citeturn6view2turn5search6

For a Unity dev, the mapping is roughly:

This project	Unity equivalent
Rust + `wgpu`	C# + `ComputeShader`, `CommandBuffer`
WGSL compute shader	HLSL compute shader (`.compute`)
Bind groups, layouts	Shader variables / `SetBuffer`, `SetTexture`
2D sampled texture	`Texture2D` / `RenderTexture`
Storage / uniform buffer	`ComputeBuffer` / constant buffer

The README notes it loads the image as a texture and is limited by GPU texture size (8192×8192) citeturn6view2 — exactly like Unity: you upload a Texture2D to the GPU and have a shader read from it.

3. Likely data layout on the GPU

For k-means on colors you usually need these GPU-side resources:

Input image texture
- 2D texture (e.g. rgba8) containing the original image colors.
- In Unity: Texture2D or RenderTexture bound as Texture2D<float4>.
Per-pixel assignment buffer
- One uint per pixel: which cluster this pixel belongs to.
- In Unity: RWStructuredBuffer<uint> _Assignments;
Cluster centers buffer
- k colors (float3 or float4), the current centroids.
- In Unity: RWStructuredBuffer<float4> _Centroids;
Accumulation buffers for update step
To compute new centroids you need, for each cluster:
- Sum of all pixel colors in that cluster (float3/float4).
- Count of pixels in that cluster (uint).
- In Unity:
  RWStructuredBuffer<float4> _ClusterSums;
  RWStructuredBuffer<uint> _ClusterCounts;
Config / constants buffer
- Image size, k, iteration count, etc.

He also mentions reading about prefix sums citeturn6view2turn5search6, so there’s probably at least one pass that uses a parallel scan (e.g. for compaction, dithering, or more efficient reductions), but you can implement a simpler version with atomics for cluster sums.

4. Core GPU passes (in Unity terms)

Pass 1: Assign pixels to nearest centroid

Idea: 1 thread per pixel. Read pixel color, loop over all k centroids, find the closest, write cluster index, atomically accumulate into that cluster’s sum & count.

Pseudo-HLSL (Unity compute style):

// [numthreads(16, 16, 1)]
#pragma kernel AssignPixels

Texture2D<float4> _Input;
SamplerState       _Sampler;

RWStructuredBuffer<float4> _Centroids;     // length = k
RWStructuredBuffer<uint>   _Assignments;   // length = width * height
RWStructuredBuffer<float4> _ClusterSums;   // length = k
RWStructuredBuffer<uint>   _ClusterCounts; // length = k

cbuffer Params {
    uint _Width;
    uint _Height;
    uint _K;
    uint _Iteration;
};

[numthreads(16, 16, 1)]
void AssignPixels(uint3 id : SV_DispatchThreadID)
{
    if (id.x >= _Width || id.y >= _Height) return;

    uint index = id.y * _Width + id.x;

    float2 uv = (float2(id.xy) + 0.5) / float2(_Width, _Height);
    float3 color = _Input.SampleLevel(_Sampler, uv, 0).rgb;

    // Find nearest centroid
    float bestDist = 1e20;
    uint  bestIdx  = 0;

    [loop]
    for (uint i = 0; i < _K; i++)
    {
        float3 c = _Centroids[i].rgb;
        float3 d = color - c;
        float dist = dot(d, d); // Euclidean squared

        if (dist < bestDist)
        {
            bestDist = dist;
            bestIdx = i;
        }
    }

    _Assignments[index] = bestIdx;

    // Accumulate in cluster sums
    InterlockedAdd(_ClusterCounts[bestIdx], 1);
    // For floats you'd normally do atomics via RWByteAddressBuffer or group-shared reductions;
    // conceptually:
    // AtomicAdd(_ClusterSums[bestIdx].rgb, color);
}

In WGSL, the author does the same thing: get global invocation ID, compute UV, sample the texture, loop over k cluster centers, write the index, and use atomics or parallel reduction to build cluster statistics.

Unity mental model: exactly like a GPGPU “per-pixel” pass where each thread handles one pixel.

Pass 2: Recompute cluster centers from sums

Once every pixel is assigned, you compute the new centroid for each cluster:

// [numthreads(64, 1, 1)]
#pragma kernel UpdateCentroids

RWStructuredBuffer<float4> _Centroids;
RWStructuredBuffer<float4> _ClusterSums;
RWStructuredBuffer<uint>   _ClusterCounts;

cbuffer Params {
    uint _K;
};

[numthreads(64, 1, 1)]
void UpdateCentroids(uint3 id : SV_DispatchThreadID)
{
    uint i = id.x;
    if (i >= _K) return;

    uint count = _ClusterCounts[i];
    if (count == 0) return; // or keep old centroid

    float3 sum = _ClusterSums[i].rgb;
    float3 newCenter = sum / max(1.0, (float)count);

    _Centroids[i] = float4(newCenter, 1.0);
}

After this pass, the centroids are updated. Then you:

Clear _ClusterSums and _ClusterCounts (another tiny compute kernel).
Run Assign → Update → Clear in a loop for N iterations from the CPU side.

In Rust + wgpu this is done by submitting multiple compute passes per iteration; in Unity you’d use:

for (int iter = 0; iter < iterations; ++iter)
{
    compute.SetInt("_Iteration", iter);

    // Clear sums
    compute.Dispatch(clearKernel, ...);

    // Assign pixels
    compute.Dispatch(assignKernel, width / 16 + 1, height / 16 + 1, 1);

    // Update centroids
    compute.Dispatch(updateKernel, k / 64 + 1, 1, 1);
}

This matches how you’d design it in Unity even if the actual repo expresses it in Rust.

Pass 3: Apply palette (recolor or dither)

The README shows different modes:

Simple “replace with k-means color” output.
Dithered output.
Palette generation. citeturn6view2

These are all variations on a final compute pass:

Simple recolor:

// [numthreads(16,16,1)]
#pragma kernel ApplyPalette

RWTexture2D<float4>      _Output;
StructuredBuffer<uint>   _Assignments;
StructuredBuffer<float4> _Centroids;
uint _Width, _Height;

[numthreads(16,16,1)]
void ApplyPalette(uint3 id : SV_DispatchThreadID)
{
    if (id.x >= _Width || id.y >= _Height) return;

    uint index = id.y * _Width + id.x;
    uint cluster = _Assignments[index];
    float4 color = _Centroids[cluster];

    _Output[id.xy] = color;
}

Dithering:
- Same idea, but instead of a straight lookup, it uses a threshold matrix (Bayer / ordered dithering) or error diffusion to decide when to bump to a neighboring centroid, which is why the author links GPU dithering resources. citeturn6view2
Palette image output:
- Another compute or CPU pass that draws swatches for each centroid into a small texture.

5. Where the prefix sum (scan) fits

He explicitly cites reading material on prefix sums / scans in WGSL citeturn6view2. That usually means one of these patterns is in play:

Efficiently summing per-cluster contributions without hammering global atomics.
Compaction / reordering operations (less likely needed for simple color k-means).
Support for more advanced dithering / histogram steps.

In Unity/HLSL, you could:

Use a workgroup-local prefix sum to reduce per-group sums, then atomically add one value per group to global buffers (classic “reduce in shared memory, then global atomic once per group” pattern).
Implement the same scan algorithm as the WGSL examples he references, but in HLSL (same ideas; different syntax).

The important bit for you: k-means on GPU is basically “scan + reduce + update” repeated, and prefix sum is the standard building block to make that fast and parallel.

6. How you’d mirror this in Unity, step-by-step

If you wanted to replicate this repo in Unity:

Load the image
- Texture2D inputTex;
- Optionally convert to linear color space.

Allocate GPU resources

int pixelCount = width * height;

var assignments = new ComputeBuffer(pixelCount, sizeof(uint));
var centroids   = new ComputeBuffer(k, sizeof(float) * 4);
var sums        = new ComputeBuffer(k, sizeof(float) * 4);
var counts      = new ComputeBuffer(k, sizeof(uint));

var outputRT    = new RenderTexture(width, height, 0, RenderTextureFormat.ARGBFloat);
outputRT.enableRandomWrite = true;
outputRT.Create();

Initialize centroids
- Random pixels, k-means++ on CPU, or even uniform sampling.

Bind resources to the compute shader

var cs = yourComputeShader;

cs.SetTexture(assignKernel, "_Input", inputTex);
cs.SetBuffer(assignKernel, "_Centroids", centroids);
cs.SetBuffer(assignKernel, "_Assignments", assignments);
cs.SetBuffer(assignKernel, "_ClusterSums", sums);
cs.SetBuffer(assignKernel, "_ClusterCounts", counts);
cs.SetInts("_Width", width);
cs.SetInts("_Height", height);
cs.SetInt("_K", k);

// Similarly bind for updateKernel & applyKernel

Iterate: clear → assign → update

for (int iter = 0; iter < iterations; ++iter)
{
    cs.Dispatch(clearKernel, k / 64 + 1, 1, 1);                 // zero sums & counts
    cs.Dispatch(assignKernel, width / 16 + 1, height / 16 + 1, 1);
    cs.Dispatch(updateKernel, k / 64 + 1, 1, 1);                // recompute centers
}

Apply final palette to output texture

cs.Dispatch(applyKernel, width / 16 + 1, height / 16 + 1, 1);

Use outputRT as your final color-quantized image.

That’s essentially what kmeans-gpu is doing, just expressed with:

Rust structs + wgpu pipeline setup,
WGSL compute shaders instead of HLSL,
A CLI front-end instead of Unity’s editor/runtime.

7. TL;DR in Unity-speak

“They implemented k-means by loading the image as a GPU texture, then running several compute passes: one kernel that, for each pixel, finds the nearest cluster center and atomically accumulates per-cluster color sums and counts; another kernel that, for each cluster, divides sum by count to update the centroid; repeated for a few iterations. Finally, a pass replaces each pixel with its cluster color (with optional dithering). All of this is orchestrated from CPU code (Rust + wgpu), exactly like you’d orchestrate Unity ComputeShader dispatches from C#.”

Short answer:
There isn’t a single “right” order, but for most visual-quality + performance workflows I’d lean toward:

Build the quadtree on the original image, then run k-means on the leaf nodes (quantize after).

…but there are solid reasons to do it the other way if your priority is fewer quads / aggressive simplification. Let’s unpack the trade-offs in Unity / GPU terms.

Option 1 – k-means before the quadtree

Pipeline

Original image → GPU k-means → quantized image (global palette).
Run quadtree on the quantized image (split while a block isn’t “uniform enough” in cluster space).
Store palette index per leaf, etc.

Pros

Fewer quads.
Because colors are already snapped to a small set of centroids, large areas become perfectly uniform → your error metric (variance, max color difference) drops → the quadtree stops subdividing earlier.
Cheap quadtree decisions.
You’re effectively splitting on palette indices instead of high-precision colors. The homogeneity test is simpler (e.g. “all indices in this block are the same?”).
Global palette is fixed and coherent.
The quadtree doesn’t invent new colors – it just arranges existing ones. Nice if you want a strict palette (e.g. posterization, stylistic rendering).

Cons

Fine details can get erased before geometry is built.
If two neighboring colors get merged into the same centroid, the quadtree literally can’t “see” that there was an edge there. Thin lines or subtle gradients can vanish or become huge flat blocks.
Quadtree structure is driven by lossy data.
Your spatial partitioning decisions are now a function of your k-means parameters (k, iterations, init). Tweak the palette → you change the tree.
Cost: k-means is over all pixels.
You’re clustering width * height samples. On GPU this is fine for big images, but it’s still the most expensive step.

When this is “wiser”

You care most about small number of quads (compression, simple collision maps, tile maps).
You want a strong global stylistic palette and don’t mind losing subtle edges.
You’re comfortable paying the per-pixel k-means cost once on the GPU.

Option 2 – k-means after the quadtree

Pipeline

Original image → build quadtree based on raw color variance / error.
Each leaf has an average color (or small histogram).
Run k-means over leaf colors (each leaf is one “pixel”, weighted by its area).
Snap each leaf color to its cluster center (i.e., palette index per quad).

Pros

Quadtree sees the real image.
All splitting decisions use full-res colors, so edges and small structures are preserved as long as your error threshold is reasonable.
Huge reduction in k-means workload.
Instead of clustering millions of pixels, you cluster maybe thousands of leaves. You can:
- Try larger k (richer palette).
- Do more iterations / restarts.
- Potentially run it on CPU and still be fine.
Easy to weight importance.
Each leaf can be weighted by area: large regions matter more in k-means, small noisy leaves matter less.
More flexible palette logic.
You could cluster in (color + extra stuff) space: e.g. (L*, a*, b*, average edge strength, depth layer, etc.) to get smarter palettes.

Cons

More quads (in general).
Without prior quantization, the quadtree sees more variation and splits more often.
Local averaging happens first.
Thin features inside a leaf can already be blurred into an average before they ever reach k-means. (Though you can mitigate this by using a stricter split threshold.)
Palette isn’t per-pixel optimal.
You’re quantizing leaf averages, not original pixels. The palette is “optimal for the region means”, which is slightly different.

When this is “wiser”

You care most about spatial detail / structure being represented in the quadtree.
The quadtree itself is the important artifact (LOD, vectorization, gameworld partitioning).
You like the idea of a cheap, flexible k-means over a small number of regions.

How I’d choose (practical heuristics)

If I had to choose a default for a Unity quadtree workflow:

Default: Quadtree first, k-means on the leaves (Option 2).

Because:

It preserves edges and shapes better.
It gives you fewer, more meaningful points for k-means.
You can always merge leaves later or relax thresholds if you have “too many” quads.

Then, based on your priorities:

Choose k-means before the quadtree if:

You’re building a very coarse, stylized representation.
You really want minimal quads (e.g. data size or draw calls is everything).
You’re already happy with the k-means color simplification as your main look.

Choose k-means after the quadtree if:

You’re using the quadtree for LOD / gameplay / spatial queries and don’t want color preprocessing to distort the structure.
You want high-quality edges around objects.
You like the idea of a small, weighted clustering problem on quads rather than a massive one on pixels.

A nice hybrid compromise

If you’re really torn, a simple hybrid that often works well:

Low-k k-means first (e.g. reduce to, say, 64 colors, not the final palette).
Build the quadtree using that reduced image (helps cut noise & reduce quads).
Run a second k-means on leaf colors for your final palette.

That way:

Step 1 denoises and makes the quadtree cheaper.
Step 2 still sees more structure than a very aggressive quantization.
Step 3 gives you a good final palette on a small set of meaningful regions.

If you tell me what you’re mainly using the quadtree for (compression, stylized rendering, tile map generation, collision, etc.), I can sketch a concrete pipeline tuned to that use case.

That’s actually a really reasonable compromise – you’re basically letting geometry and color co-evolve instead of betting everything on one order.

Let’s walk through what quad-tree → k-means → refine quad-tree really means, and how to make it pay off instead of just doing extra work.

Big picture: what this pipeline buys you

First quadtree pass (on raw image)
- Purpose: capture spatial structure and edges.
- Use a fairly strict error threshold, so you don’t accidentally smooth over important boundaries.
- Result: lots of leaves (small quads), but they’re good “atoms” of the image.
K-means on leaves, not pixels
- Each leaf is one sample: its average color (and maybe some extras) + a weight (leaf area).
- Purpose: find a global palette that respects structure (since the structure came from the raw image).
- Result: each leaf gets a cluster index / centroid color.
Refine quadtree using cluster labels
- Now you know which leaves “belong together” color-wise.
- Use that to:
  - Merge regions that are both spatially adjacent and in the same cluster.
  - Optionally split leaves whose quantized color is not a good fit.

You get:

Fewer quads than the raw-image tree (thanks to merging).
A meaningful global palette (thanks to k-means).
A tree that respects real edges (because the initial split saw the original image).

How to do the refine step

Think of refinement as two passes: merge and optional split.

1) Merge pass (bottom-up)

You already have a quadtree. For each non-leaf node:

Look at its 4 children:
- If all four are leaves, and
- They all have the same k-means cluster (or very similar centroid), and
- The error of merging them is below a threshold—

→ replace the 4 children with the parent node (a bigger leaf).

Error for merging can be:

Max per-pixel difference between original image and the cluster centroid inside the parent region.
Or simpler: compare the parent region’s average color vs centroid and check that’s “close enough”.

Do this bottom-up, so that small leaves can roll up into larger ones where the palette allows it.

Result:
Large uniform regions (in cluster space) collapse into big quads, while boundaries between clusters keep the tree detailed.

2) Optional split pass (fix bad leaves)

Some leaves might be too big and badly represented by their assigned centroid:

For each leaf:

Compute quantization error: e.g. variance of (original color – assigned centroid) over that leaf.
If:
- Leaf is large and
- Error > threshold

→ split that leaf into 4 children (like a regular quadtree step), and:

Either:
- Re-use the same cluster index for children initially and leave them for the next k-means run, or
- Directly reassign children to nearest centroids.

This gives you extra detail only where needed after quantization.

You can repeat:

k-means → refine (merge/split) a couple of times if you want to converge more, but often one refine after one k-means is enough.

Is this better than just “before” or “after”?

Conceptually:

Quad-tree → k-means only (no refine)
Good palette, good structure; but you might still have more tiny leaves than you need.
k-means → quad-tree only
Great compression (few quads), but you risk losing edges because the tree never saw the original colors.
Quad-tree → k-means → refine
- First tree: preserves edges.
- k-means: gives you a global notion of “which colors go together”.
- Refinement: uses that to merge aggressively where it’s safe, without destroying important boundaries.

So yeah, your idea is basically: “Use structure, then color, then structure again.”
That’s very sensible.

Practical tips / gotchas

Choose different thresholds for first tree vs refine:
- First quadtree: use a fairly low error threshold → detailed structure.
- Merge phase: use a higher allowed error when considering merging children that share a cluster → be more aggressive to reduce leaf count.
Weight k-means by area:
When clustering leaves, give each sample a weight proportional to its pixel count. That way, big regions influence the palette more than small noisy ones.
How many passes?
I’d start with:
1. Quadtree on raw image.
2. One global k-means on leaves.
3. One merge pass (plus maybe a light split pass). And stop there unless you see obvious artifacts that more iterations would fix.
GPU vs CPU split:
- Quadtree building and refinement are often easier on CPU (lots of irregular branching and pointers).
- K-means is great for GPU (especially for the large initial leaf set).

TL;DR

Yes, quad-tree → k-means → refine quad-tree is a very good idea.

First tree: grab edges and spatial structure from the real image.

K-means: learn a global palette from leaf colors (weighted by area).

Refine: merge adjacent leaves that share a cluster and keep error low; optionally split big high-error leaves.

You end up with fewer quads than raw-tree-only and better structure than k-means-before-tree-only – a pretty sweet middle ground.

They’re not using Jump Flood because it doesn’t really fit the kind of “nearest neighbour” they need for this k-means, and for their use-case it wouldn’t actually be faster or simpler.

Let’s break it down.

1. What Jump Flood is good at (and what it isn’t)

Jump Flooding Algorithm (JFA) is designed for:

A 2D grid (image / texture space).
A small set of seed pixels.
Compute, for every pixel, which seed is closest in Euclidean image space → i.e. a Voronoi diagram or distance field. citeturn0search1turn0search3

It’s brilliant for things like:

Voronoi diagrams / borders.
Distance fields for fonts / shapes.
Procedural effects where distance to some points in 2D is needed.

That’s why most public JFA projects are Voronoi / distance field demos or libraries, for example:

fast_gpu_voronoi – OpenCL Voronoi via JFA/JFA+/JFA* on GPU citeturn0search2
Observable notebook “GPU Voronoi diagrams using the Jump Flooding Algorithm” (WebGL) citeturn0search8
CPU reference implementations like ncherel/jump-flood for Voronoi citeturn0search16

And in research, JFA is used as a building block for centroidal Voronoi tessellations and related things (which are very k-means-like, but still in 2D domain space, not color space). citeturn0search9turn0search13

2. Why that k-means implementation doesn’t use JFA

The repo you’re looking at is color k-means: each pixel is a point in color space (RGB/LAB), and each cluster center is a color. The “nearest neighbour” they care about is:

“Which centroid color is closest to this pixel’s color?”

Key point: that’s a distance in 3D color space, not in 2D image coordinates.

JFA assumes your seeds live in the same 2D grid as your pixels and spreads information by looking at 2D neighbors. For color k-means:

Two adjacent pixels can have totally different colors (think: checkerboard).
Two far-apart pixels can be very similar in color.

So a 2D neighbor-propagation algorithm doesn’t give you the correct “nearest in color space” – it only gives you nearest in screen space.

To use JFA “properly” for color you’d need a 3D grid over color space and then do a 3D distance transform there, which is:

Much more memory hungry.
Extra indirection (pixel → quantized color voxel → seed).
Still has JFA’s approximation errors.

In practice, for color k-means on GPU it’s just way simpler and plenty fast to do this per pixel in one kernel:

// Conceptually:
for each pixel:
    color = read texture
    bestIdx = 0
    bestDist = +inf
    for each centroid i in [0..K):
        dist = ||color - centroid[i]||^2
        if dist < bestDist: bestDist = dist; bestIdx = i
    write bestIdx

This is O(N·K) per iteration, but:

K is small (often 8–64).
GPUs love tight inner loops over a small K.
You must do this every k-means iteration anyway, because centroids change.

JFA gives you O(N log N) per iteration, but only for 2D spatial distance, and it’s approximate. You’d be trading:

A simple, exact, small-K loop
For a more complex, approximate, multi-pass algorithm
That doesn’t even solve the right metric (color space vs 2D).

So from the author’s POV:

“JFA is cool, but it doesn’t match the distance metric I actually need, and brute forcing 8–32 clusters per pixel on GPU is trivial.”

That’s almost certainly why they didn’t bother.

3. Are there k-means-like things that do use JFA?

Yes, but they look different:

In centroidal Voronoi tessellation (CVT), you iteratively:
1. Compute a Voronoi diagram of seeds (in 2D domain).
2. Move seeds to the centroids of their cells.
That’s basically k-means in 2D space, and several GPU CVT papers use JFA or similar GPU Voronoi tricks internally (e.g. GPU-assisted CVT and discrete periodic CVT work). citeturn0search9turn0search13

So JFA is used in things that are algorithmically “like” k-means, but always where the clustering space is a 2D domain (or a grid of voxels), not a high-level attribute space like RGB.

4. TL;DR

JFA is great for 2D nearest-seed in screen space (Voronoi, distance fields).
Color k-means needs nearest centroid in color space, per pixel, every iteration.
For small K, a per-pixel loop over centroids is:
- Simpler.
- Exact.
- Fast enough on GPU.
That’s why the k-means-gpu project doesn’t use Jump Flood.

If you ever do k-means where your “points” are actually positions in a 2D grid (e.g. cluster world positions on a heightmap), then a JFA-style Voronoi + centroid step starts to make sense. But for pure color quantization, it’s the wrong hammer.