Think of this repo as “how you’d write a Unity ComputeShader-based k-means color quantizer”, just done in Rust + wgpu + WGSL instead of C# + HLSL.
Because my access to the actual source files is limited by the environment, I can’t quote the exact shader code. But from the README and the references the author links (especially the WebGPU prefix-sum articles and the kmeans-colors crate) citeturn6view2turn7search2turn10search0 we can reconstruct how it works and how you’d approach the same thing in Unity.
Standard k-means for color quantization:
- Take all pixels as points in color space (often RGB or CIELAB).
- Choose k initial cluster centers (centroids).
- Repeat for N iterations or until convergence:
- Assignment step: For each pixel, find the nearest centroid and assign the pixel to that cluster.
- Update step: For each cluster, average all pixels assigned to it → new centroid color.
- Replace each pixel’s color with its cluster’s centroid color (and optionally dither).
On CPU, you just loop over all pixels and do this in nested loops. On GPU, you break it into a few compute passes.
The project uses:
- Rust host code +
wgpu (Rust’s WebGPU wrapper) to manage GPU resources.
- WGSL compute shaders to do the heavy lifting on the GPU. citeturn6view2turn5search6
For a Unity dev, the mapping is roughly:
| This project | Unity equivalent |
|---|
Rust + wgpu | C# + ComputeShader, CommandBuffer |
| WGSL compute shader | HLSL compute shader (.compute) |
| Bind groups, layouts | Shader variables / SetBuffer, SetTexture |
| 2D sampled texture | Texture2D / RenderTexture |
| Storage / uniform buffer | ComputeBuffer / constant buffer |
The README notes it loads the image as a texture and is limited by GPU texture size (8192×8192) citeturn6view2 — exactly like Unity: you upload a Texture2D to the GPU and have a shader read from it.
For k-means on colors you usually need these GPU-side resources:
-
Input image texture
- 2D texture (e.g.
rgba8) containing the original image colors.
- In Unity:
Texture2D or RenderTexture bound as Texture2D<float4>.
-
Per-pixel assignment buffer
- One
uint per pixel: which cluster this pixel belongs to.
- In Unity:
RWStructuredBuffer<uint> _Assignments;
-
Cluster centers buffer
k colors (float3 or float4), the current centroids.
- In Unity:
RWStructuredBuffer<float4> _Centroids;
-
Accumulation buffers for update step
To compute new centroids you need, for each cluster:
- Sum of all pixel colors in that cluster (float3/float4).
- Count of pixels in that cluster (uint).
- In Unity:
RWStructuredBuffer<float4> _ClusterSums;
RWStructuredBuffer<uint> _ClusterCounts;
-
Config / constants buffer
- Image size,
k, iteration count, etc.
He also mentions reading about prefix sums citeturn6view2turn5search6, so there’s probably at least one pass that uses a parallel scan (e.g. for compaction, dithering, or more efficient reductions), but you can implement a simpler version with atomics for cluster sums.
Idea: 1 thread per pixel. Read pixel color, loop over all k centroids, find the closest, write cluster index, atomically accumulate into that cluster’s sum & count.
Pseudo-HLSL (Unity compute style):
// [numthreads(16, 16, 1)]
#pragma kernel AssignPixels
Texture2D<float4> _Input;
RWStructuredBuffer<float4> _Centroids; // length = k
RWStructuredBuffer<uint> _Assignments; // length = width * height
RWStructuredBuffer<float4> _ClusterSums; // length = k
RWStructuredBuffer<uint> _ClusterCounts; // length = k
void AssignPixels(uint3 id : SV_DispatchThreadID)
if (id.x >= _Width || id.y >= _Height) return;
uint index = id.y * _Width + id.x;
float2 uv = (float2(id.xy) + 0.5) / float2(_Width, _Height);
float3 color = _Input.SampleLevel(_Sampler, uv, 0).rgb;
for (uint i = 0; i < _K; i++)
float3 c = _Centroids[i].rgb;
float dist = dot(d, d); // Euclidean squared
_Assignments[index] = bestIdx;
// Accumulate in cluster sums
InterlockedAdd(_ClusterCounts[bestIdx], 1);
// For floats you'd normally do atomics via RWByteAddressBuffer or group-shared reductions;
// AtomicAdd(_ClusterSums[bestIdx].rgb, color);
In WGSL, the author does the same thing: get global invocation ID, compute UV, sample the texture, loop over k cluster centers, write the index, and use atomics or parallel reduction to build cluster statistics.
Unity mental model: exactly like a GPGPU “per-pixel” pass where each thread handles one pixel.
Once every pixel is assigned, you compute the new centroid for each cluster:
// [numthreads(64, 1, 1)]
#pragma kernel UpdateCentroids
RWStructuredBuffer<float4> _Centroids;
RWStructuredBuffer<float4> _ClusterSums;
RWStructuredBuffer<uint> _ClusterCounts;
void UpdateCentroids(uint3 id : SV_DispatchThreadID)
uint count = _ClusterCounts[i];
if (count == 0) return; // or keep old centroid
float3 sum = _ClusterSums[i].rgb;
float3 newCenter = sum / max(1.0, (float)count);
_Centroids[i] = float4(newCenter, 1.0);
After this pass, the centroids are updated. Then you:
- Clear
_ClusterSums and _ClusterCounts (another tiny compute kernel).
- Run Assign → Update → Clear in a loop for N iterations from the CPU side.
In Rust + wgpu this is done by submitting multiple compute passes per iteration; in Unity you’d use:
for (int iter = 0; iter < iterations; ++iter)
compute.SetInt("_Iteration", iter);
compute.Dispatch(clearKernel, ...);
compute.Dispatch(assignKernel, width / 16 + 1, height / 16 + 1, 1);
compute.Dispatch(updateKernel, k / 64 + 1, 1, 1);
This matches how you’d design it in Unity even if the actual repo expresses it in Rust.
The README shows different modes:
- Simple “replace with k-means color” output.
- Dithered output.
- Palette generation. citeturn6view2
These are all variations on a final compute pass:
-
Simple recolor:
#pragma kernel ApplyPalette
RWTexture2D<float4> _Output;
StructuredBuffer<uint> _Assignments;
StructuredBuffer<float4> _Centroids;
void ApplyPalette(uint3 id : SV_DispatchThreadID)
if (id.x >= _Width || id.y >= _Height) return;
uint index = id.y * _Width + id.x;
uint cluster = _Assignments[index];
float4 color = _Centroids[cluster];
-
Dithering:
- Same idea, but instead of a straight lookup, it uses a threshold matrix (Bayer / ordered dithering) or error diffusion to decide when to bump to a neighboring centroid, which is why the author links GPU dithering resources. citeturn6view2
-
Palette image output:
- Another compute or CPU pass that draws swatches for each centroid into a small texture.
He explicitly cites reading material on prefix sums / scans in WGSL citeturn6view2. That usually means one of these patterns is in play:
- Efficiently summing per-cluster contributions without hammering global atomics.
- Compaction / reordering operations (less likely needed for simple color k-means).
- Support for more advanced dithering / histogram steps.
In Unity/HLSL, you could:
- Use a workgroup-local prefix sum to reduce per-group sums, then atomically add one value per group to global buffers (classic “reduce in shared memory, then global atomic once per group” pattern).
- Implement the same scan algorithm as the WGSL examples he references, but in HLSL (same ideas; different syntax).
The important bit for you: k-means on GPU is basically “scan + reduce + update” repeated, and prefix sum is the standard building block to make that fast and parallel.
If you wanted to replicate this repo in Unity:
-
Load the image
Texture2D inputTex;
- Optionally convert to linear color space.
-
Allocate GPU resources
int pixelCount = width * height;
var assignments = new ComputeBuffer(pixelCount, sizeof(uint));
var centroids = new ComputeBuffer(k, sizeof(float) * 4);
var sums = new ComputeBuffer(k, sizeof(float) * 4);
var counts = new ComputeBuffer(k, sizeof(uint));
var outputRT = new RenderTexture(width, height, 0, RenderTextureFormat.ARGBFloat);
outputRT.enableRandomWrite = true;
-
Initialize centroids
- Random pixels, k-means++ on CPU, or even uniform sampling.
-
Bind resources to the compute shader
var cs = yourComputeShader;
cs.SetTexture(assignKernel, "_Input", inputTex);
cs.SetBuffer(assignKernel, "_Centroids", centroids);
cs.SetBuffer(assignKernel, "_Assignments", assignments);
cs.SetBuffer(assignKernel, "_ClusterSums", sums);
cs.SetBuffer(assignKernel, "_ClusterCounts", counts);
cs.SetInts("_Width", width);
cs.SetInts("_Height", height);
// Similarly bind for updateKernel & applyKernel
-
Iterate: clear → assign → update
for (int iter = 0; iter < iterations; ++iter)
cs.Dispatch(clearKernel, k / 64 + 1, 1, 1); // zero sums & counts
cs.Dispatch(assignKernel, width / 16 + 1, height / 16 + 1, 1);
cs.Dispatch(updateKernel, k / 64 + 1, 1, 1); // recompute centers
-
Apply final palette to output texture
cs.Dispatch(applyKernel, width / 16 + 1, height / 16 + 1, 1);
-
Use outputRT as your final color-quantized image.
That’s essentially what kmeans-gpu is doing, just expressed with:
- Rust structs +
wgpu pipeline setup,
- WGSL compute shaders instead of HLSL,
- A CLI front-end instead of Unity’s editor/runtime.
“They implemented k-means by loading the image as a GPU texture, then running several compute passes: one kernel that, for each pixel, finds the nearest cluster center and atomically accumulates per-cluster color sums and counts; another kernel that, for each cluster, divides sum by count to update the centroid; repeated for a few iterations. Finally, a pass replaces each pixel with its cluster color (with optional dithering). All of this is orchestrated from CPU code (Rust + wgpu), exactly like you’d orchestrate Unity ComputeShader dispatches from C#.”