GPGPU 3D Game of Life in HTML5

TL;DR: See it here.

Ever wondered what it's like to have a processor with hundreds of cores? Wonder no more, because if you're reading this using a graphical browser like Google Chrome, you already own (at least) one of these powerful machines.

That machine I am talking about is, of course, your Graphics Processing Unit (GPU). It is what easily pushes out millions of pixels at rates that are typically about every 16ms. Compared to what the CPU does, perhaps it's dumb work but it is still by no means a mean feat.

The manner these GPUs achieve this is by using sheer numbers: a combination of extremely high parallelism across hundreds of cores and faster than average, purpose built high bandwidth memory. A kind of broadband architecture, if you will.

Teaching a Relatively New Dog Newer Tricks

GPUs were never meant for general computation, but that does not mean they are incapable of such tasks. They call this GPGPU or General Purpose Computing on GPU.

Diagram credit: MIT

By (ab)using the programmable stages of the graphical rendering pipeline, you can exploit massive hardware parallelism to accelerate your programs to extraordinary speedups.

With the advent of WebGL, it is now possible to take the next evolutionary step: bring GPGPU to your browsers. This is not really a brand new thing, but what I hope to achieve in this article is to demystify the arcane details.

The Implementation

The original Conway's Game of Life (not to be confused with the board game by Milton Bradley)

The 3D Game of Life is a modification of the original 1970 Conway's Game of Life, which is a "zero-player game", and a kind of cellular automation. In the 3D version, there is an added third dimension in the grid of cells and two particular rules that can be configured using parameters r1, r2, r3 and r4:

  1. A new cell will appear if the number of neighbours is equal or more than r1 and equal or less than r2 (as though through reproduction).
  2. A cell will die if the number of neighbours is more than r3 (as though through suffocation) or less than r4 (as though through loneliness)

Encoding the Grid

We use textures to marshall data into the graphical rendering pipeline. We do this by simply encoding the state of each element in the grid as a color in a texture.

Illustration credit: GPU Gems 2

However, immediately we face some challenges: unlike OpenGL, WebGL does not support 3D textures. Thankfully, emulating 3D textures is not difficult. We could simply store each 2D slice for every Z depth side-by-side. The result is a rectangle of all slices in one big 2D texture.

// First we get a gl context somehow. You can easily
// get one from a canvas dom element. Probably a good
// idea to read Mozilla's WebGL documentation if you
// don't know how
var gl = ...

// Our starting grid of size*size*size 1s and 0s.
var grid = [...];

// We encode our texture colors here
var pixels = [];  
for (var i=0; i<grid.length; i++) {  
    // In fixed point, 1.0 is encoded as 255 in unsigned 8-bit integer
    var x = grid[i] * 255;

    // A color is made of 4 components: r,g,b,a
    // For sake of convienience we just splat into all components
    // even though we only really need one.
    pixels.push(x, x, x, x);
}
var pixelArray = new Uint8Array(pixels);

// We set up 2 textures to do double buffer computation
for (var i=0; i<2; i++) {  
    // Our texture is a rectangle
    var w = size * size;
    var h = size;

    // Create textures of size w x h
    textures[i] = gl.createTexture();
    gl.bindTexture(gl.TEXTURE_2D, textures[i]);
    gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, w, h, 0, gl.RGBA, gl.UNSIGNED_BYTE, pixelArray);

    // Disable interpolation as that will interfere with GPGPU
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);

    // The geomeotry is a NPOT, so we are forced to enable this:
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, gl.CLAMP_TO_EDGE);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, gl.CLAMP_TO_EDGE);
}

Writing the Kernel using GLSL

The programmable parts of the graphical rendering pipeline is mainly the vertex shader and the fragment shader. You can program these components by using a language called GLSL. Our goal is to 'trick' the renderer to do our math for us.

The vertex shader doesn't do anything interesting. It simply passes on the information to the fragment shader.

// OpenGL ES requires us to specify the percision
precision highp float; 

// This attribute stores the interpolated positions
attribute vec2 aPos;

// This attribute stores the interpolated texture coordinates
attribute vec2 aTexCoord;

// This varying attribute simply passes the texture coordinate to the fragment shader
varying vec2 vTexCoord;

void main(void) {  
   gl_Position = vec4(aPos, 0, 1);
   vTexCoord = aTexCoord;
}

The business happens in the fragment shader:

precision highp float;

// This sampler stores the input texture
uniform sampler2D uTexSamp;

// This uniform stores the size of the grid
uniform float size;  
varying vec2 vTexCoord;

// This generates the index vector that will allow us
// to decode our emulation of 3D textures
vec2 index(float x, float y, float z) {  
    float s = x + z * size;
    float t = y;

    return vec2(s / (size*size), t / size); 
}

// This helps us to read the element at (x, y, z).
//
// We use floating points because pretty much
// everything in the graphical rendering pipeline
// uses floats.
//
// Returns 0.0 if dead and 1.0 if alive.
float getElement(float x, float y, float z) {  
    if (x<0.0 || x>=size
        || y<0.0 || y>=size
        || z<0.0 || z>=size) {
        return 0.0; 
    }

    return texture2D(uTexSamp, index(x,y,z)).r;
}

void main(void) {  
    // First, we un-normalize our texture coordinates
    // to the full size in pixels (s, t)
    float s = vTexCoord.s * size*size;
    float t = vTexCoord.t * size;

    // Then, we decode (x, y, z) from (s, t)
    float x = mod(s, size);
    float y = t;
    float z = floor(s / size);

    // Now is only a matter of counting all neighbouring
    // elements alive. The loop is intentionally unrolled
    // to allow better performance.
    float current = getElement(x, y, z);
    float neighbours = 0.0;

    neighbours += getElement(x-1.0, y-1.0, z-1.0);
    neighbours += getElement(x,     y-1.0, z-1.0);
    neighbours += getElement(x+1.0, y-1.0, z-1.0);
    neighbours += getElement(x-1.0, y,     z-1.0);
    neighbours += getElement(x,     y,     z-1.0);
    neighbours += getElement(x+1.0, y,     z-1.0);
    neighbours += getElement(x-1.0, y+1.0, z-1.0);
    neighbours += getElement(x,     y+1.0, z-1.0);
    neighbours += getElement(x+1.0, y+1.0, z-1.0);

    neighbours += getElement(x-1.0, y-1.0, z);
    neighbours += getElement(x,     y-1.0, z);
    neighbours += getElement(x+1.0, y-1.0, z);
    neighbours += getElement(x-1.0, y,     z);
    neighbours += getElement(x+1.0, y,     z);
    neighbours += getElement(x-1.0, y+1.0, z);
    neighbours += getElement(x,     y+1.0, z);
    neighbours += getElement(x+1.0, y+1.0, z);

    neighbours += getElement(x-1.0, y-1.0, z+1.0);
    neighbours += getElement(x,     y-1.0, z+1.0);
    neighbours += getElement(x+1.0, y-1.0, z+1.0);
    neighbours += getElement(x-1.0, y,     z+1.0);
    neighbours += getElement(x,     y,     z+1.0);
    neighbours += getElement(x+1.0, y,     z+1.0);
    neighbours += getElement(x-1.0, y+1.0, z+1.0);
    neighbours += getElement(x,     y+1.0, z+1.0);
    neighbours += getElement(x+1.0, y+1.0, z+1.0);

    // We hardcode parameters r1,r2,r3,r4 for better performance.
    float r1 = 3.0;
    float r2 = 4.0;
    float r3 = 3.0;
    float r4 = 2.0;

    // Based on our 2 rules, we decide the outcome.
    // This is the direct literal translation into code.
    if (current < 1.0 && neighbours >= r1 && neighbours <= r2) {
        gl_FragColor = vec4(1.0);
    } else if (current > 0.0 && (neighbours > r3 || neighbours < r4)) {
        gl_FragColor = vec4(0.0);
    } else {
        gl_FragColor = vec4(current);
    }
}

The hardware scheduler in your GPU will automatically create the massive amount of threads running a copy of these shaders to produce the individual pixels of the output texture.

Setting up the Driver

Unfortunately, setting up the graphical rendering pipeline takes a lot of ceremony, so lets get it out of the way:

// Dimensions of our encoded textures
var w = size * size;  
var h = size;

// First we compile our GLSL program.
// You can easily find getShader() on Mozilla's documentation of WebGL
var program = gl.createProgram();  
gl.attachShader(program, getShader("shader-vs"));  
gl.attachShader(program, getShader("shader-fs"));  
gl.linkProgram(program);  
gl.useProgram(program);

// The viewport must be set to the dimensions of the textures
gl.viewport(0, 0, w, h);

// We pass in the vertices and the texture coordinates that covers
// the entire viewport using 2 triangles. The coordinates are
// normalized appropriately.
var vertices = new Float32Array([  
    -1, -1, 
    1, -1, 
    -1, 1, 
    1, 1]);
var texCoords = new Float32Array([  
    0.0, 0.0, 
    1.0, 0.0, 
    0.0, 1.0, 
    1.0, 1.0]);
var texCoordOffset = vertices.byteLength;  
var buffer = gl.createBuffer();  
gl.bindBuffer(gl.ARRAY_BUFFER, buffer);  
gl.bufferData(gl.ARRAY_BUFFER, vertices.byteLength + texCoords.byteLength, gl.STATIC_DRAW);  
gl.bufferSubData(gl.ARRAY_BUFFER, 0, vertices);  
gl.bufferSubData(gl.ARRAY_BUFFER, texCoordOffset, texCoords);

// Here we link our attributes and uniforms to the shaders
var aPosLoc = gl.getAttribLocation(program, "aPos");  
gl.enableVertexAttribArray(aPosLoc);  
gl.vertexAttribPointer(aPosLoc, 2, gl.FLOAT, gl.FALSE, 0, 0);  
var aTexLoc = gl.getAttribLocation(program, "aTexCoord");  
gl.enableVertexAttribArray(aTexLoc);  
gl.vertexAttribPointer(aTexLoc, 2, gl.FLOAT, gl.FALSE, 0, texCoordOffset);  
gl.uniform1i(gl.getUniformLocation(program, "uTexSamp"), 0);  
gl.uniform1f(gl.getUniformLocation(program, "size"), size);

// We create a FBO so that we can divert the output to a texture
framebuffer = gl.createFramebuffer();  
framebuffer.width = size*size;  
framebuffer.height = size;

// Don't output to the screen, output to the FBO
gl.bindFramebuffer(gl.FRAMEBUFFER, framebuffer);

// The main computation loop with a delay so that we can observe the results
var iter = 0;  
var delay = 1000; //ms  
setInterval(function() {  
    // Set the input texture
    gl.bindTexture(gl.TEXTURE_2D, textures[iter]);

    // Set the output texture
    gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, textures[1-iter], 0);
    gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);

    // We can read the result here
    readPixels();

    // Ping-pong between 0 and 1 to swap input/output textures
    iter = 1-iter;
}, delay);

Decoding the Result

The decoding is simply the encoding in reverse. We read from the attached FBO and translate it to a form we can use in Javascript.

function readPixels() {  
    // Appropriately sized unsigned array
    var pixels = new Uint8Array(size*size*size*4);

    // This function blocks until the result is ready.
    // It is a kind of implicit CPU-GPU synchronisation
    gl.readPixels(0, 0, size*size, size, gl.RGBA, gl.UNSIGNED_BYTE, pixels);
    for (var i=0; i<visGrid.length; i++) {
        // We only need to read one of the r,g,b,a
        var elem = pixels[i*4];
        if (elem > 0) {
            // Element is alive
        } else {
            // Element is dead :'(
        }
    }   
}

The Result

This is an actual live demonstration of the program in action:

Generation Count1
Generation Delay 1000ms
Camera ControlsClick and drag to rotate, scroll to zoom.

Performance Analysis

In the sequential version of 3D Game of Life, for a grid of size , it is obvious that the complexity of calculating one generation will be .

However, as the work is shared among a number of processors, , our parallel version works at per generation. On a modern GPU, is typically in the hundreds. For example, on my 2012 Macbook Pro Retina, there are cores. Imagine how many there would be on a newer desktop GPU!

In fact, we could take it further by removing from our order of growth:

Amazing! Effectively, we have solved our problem with a solution; it's much more scalable now (assuming we could fit everything into memory, that is).

Benchmarks

I've benchmarked the program using a grid for generations on a AMD HD 7950. Only the th generation is output while the rest are silenced.

Time (s)
AMD HD 795010.631
2.67GHz Intel i7-92017.833

That's a modest speedup of about times on my desktop. It appears there is some overhead associated with setting up the graphical rendering pipeline.

The overhead is currently way too significant, so what we need to do is increase the computational workload of the program to make overhead smaller by proportion. So, I've increased the number of generations to instead:

Time (s)
AMD HD 795011.569
2.67GHz Intel i7-920173.518

That's a speedup of times! The GPU is barely any slower, but the CPU was unable to keep up this time round. This is akin to a race between an F1 car and a fighter jet: the F1 car is faster at drag distances, but once the the fighter jet fully overcomes the overhead of warming up, the jet dominates hands down.

Now, considering that WebGL wasn't even made for this kind of computations, this is quite an achievement.

Possible Improvements

  1. We encoded the 3D texture into a 2D rectangle. However, the maximum size of a texture in WebGL is typically meaning we could reach this limit rather easily. A better way is to encode it by using a square which could fit a considerably larger cube.

  2. We can't use the built-in texture wraparound for the Game of Life simulation, but we could implement manual wraparound in the shader.

  3. The texture packing is not very space efficient resulting in larger than necessary memory transfers between CPU-GPU.

Relevant Modules in NUS

Interested to harness the power of parallel computing?

You can learn about the concepts in the following modules. I highly recommend them:

  • CS3211 Parallel and Concurrent Programming (Sem 2 only)
  • CS3210 Parallel Computing (Sem 1 only)
  • CS4345 General-Purpose Computation on GPU (Sem 1 only)

I recommend taking CS3211 before CS3210 as the content of the former precedes the latter (because obviously, right /s). Also, you may take CS4345 alongside CS3210 without any problems (in fact, it might be beneficial to do so). Bear in mind, though, you might need CS3241 Computer Graphics as a prerequisite for CS4345.

Want to learn GPGPU on HTML5? Too bad. There's no such module! Instead, you should learn how to apply the concepts in the modules that are available. Application is a really useful industry skill.


I hope you've enjoyed this demonstration.

If you have any feedback/questions or if you noticed any mistakes in my article, please contact me at fazli[at]sapuan[dot]org.

Comment section for placebo effect only. Please use email to contact the author