Thursday, 4 June 2015

Dual Contouring with OpenCL

I've spent the last few weeks rewriting the core component of my engine, the voxel pipeline. This encapsulates the whole process from density function to a triangle mesh. In terms of functionality the pipeline is essentially the same, but now is (almost) entirely implemented using OpenCL. In fact the OpenGL/OpenCL interop is used so that the generated mesh lives and dies exclusively on the GPU.

The pipeline contains a number of distinct stages: generating the density field, discovering active voxels and finally constructing an octree from the voxels. The octrees are then used to produce another intermediate structure and that in turn produces the final mesh. In this post I'll cover the current pipeline implementation from density function to polygons, and typically showing the OpenCL kernels since those tend to be the interesting parts. Even if you're not familiar with OpenCL I think you should be able to follow the process.

Before we get started a quick caveat: this is my first OpenCL project, so don't think I'm presenting this as ideal implementation -- if you're familiar with OpenCL and spot something daft let me know!

Density Field

The first step in the pipeline is the density function. This a function which takes a 3D worldspace position as input and produces a single scalar value as output: negative density indicates that the point is inside some volume (e.g. a sphere, the terrain, a wall etc) and a positive density indicates that the point is outside all volumes defined by the density function. The surface lies where the density function is zero. In theory the density function is all that's required to construct the voxels but storing the field's data means it can be altered later.

The DensityField stores the Hermite data required for the voxel generation, that means that for every voxel edge that intersects with the surface the position of the intersection and the normal of the surface at that point must be stored. The DensityField struct is initialised by exhaustively checking every edge in the grid of voxels which are contained in the chunk volume. In my current implementation this is 64x64x64 voxels, but the Hermite field data needs to be 1 unit bigger in each dimension as each voxel needs to check all 12 edges of its bounding box. Similarly to a technique from this GPU Gems 3 article each is checked exactly once by checking 3 edges on each Hermite index, one for each axis.

The first step is to initialise the Hermite indices, to determine whether each end of each edge is inside or outside the volume. This data could be stored as a sparse array with any missing index mapping to 'air'/'empty' but that would require using a hashtable. Using a hashtable is viable in OpenCL but not without cost, apart from the time taken to construct the hashtable, each lookup (successful and otherwise) is more costly. For now the Hermite indices are stored in a 3D array flattened to 1D. The first kernel in the pipeline is responsible for filling the array.

kernel void GenerateDefaultField(
 read_only image2d_t permTexture,
 const int4 offset,
 const int defaultMaterialIndex,
 global int* field_materials)
 const int x = get_global_id(0);
 const int y = get_global_id(1);
 const int z = get_global_id(2);

 const float4 world_pos = { x + offset.x, y + offset.y, z + offset.z, 0 };
 const float density = DensityFunc(world_pos, permTexture);
 const int material = density < 0.f ? defaultMaterialIndex : MATERIAL_AIR;

 const int4 local_pos = { x, y, z, 0 };
 const int index = field_index(local_pos);

 field_materials[index] = material;

Each kernel will be invoked with a unique three dimensional ID, (x, y, z), which corresponds to one of the Hermite indices for the field, i.e. 0 <= x, y, z <= 65. This ID is used to both calculate the world space position of the index (by offsetting against the field's bounding box min and the flattened 1D index into the field_materials array.

(There are some problems with this approach which means that some edges won't be discovered with intricate implicit surfaces, I'm hoping to fix this and if/when I do I'll cover it in another post.)

With the Hermite indices calculated the edges of the field can be discovered by examining pairs of indices, marking the start & end of the edge. Due to storing an extra Hermite index along each axis (i.e. 65 values, not 64) each of the field edges can be checked by examining three edges starting from each of the Hermite indices. Note that to generate the edges for any given index the next set of data needs to be available, which means the Hermite indices is actually an array of 66x66x66 values, which allows 65 edges along each axis, which in turn allows 64 voxels along each axis (hence the indices from the kernel being <= 65 not just < 65).

kernel void FindFieldEdges(
 const int4 offset,
 global int* field_materials,
 global int* edgeOccupancy,
 global int* edgeIndices)
 const int x = get_global_id(0);
 const int y = get_global_id(1);
 const int z = get_global_id(2);
 const int4 pos = { x, y, z, 0 };
 const int index = (x + (y * HERMITE_INDEX_SIZE) + (z * HERMITE_INDEX_SIZE * HERMITE_INDEX_SIZE));
 const int edgeIndex = index * 3;

 const int CORNER_MATERIALS[4] = 
  field_materials[field_index(pos + (int4)(0, 0, 0, 0))],
  field_materials[field_index(pos + (int4)(1, 0, 0, 0))],
  field_materials[field_index(pos + (int4)(0, 1, 0, 0))],
  field_materials[field_index(pos + (int4)(0, 0, 1, 0))],

 const int hermiteIndex = pos.x | (pos.y << HERMITE_INDEX_SHIFT) | (pos.z << (HERMITE_INDEX_SHIFT * 2));

 for (int i = 0; i < 3; i++)
  const int e = 1 + i;
  const int signChange = 

  edgeOccupancy[edgeIndex + i] = signChange;
  edgeIndices[edgeIndex + i] = signChange ? ((hermiteIndex << 2) | i) : -1;

This kernel is also invoked with a 3D ID but this time the values fall in the 0 <= x,y,z <= 64 range which allows safe access to the field_materials buffer plus one unit on each axis. The ID is used as a 3D position which allows the position of the end of each edge to be calculated and the then converted into a flattened index via the field_index function. That in turn allows the material value to be looked up for both ends of all three edges (since each edge starts at the ID position).

Each edge can be uniquely identified by two pieces of information: the Hermite index it starts at, and which axis it belongs to. That information in encoded by first creating a unique hermiteIndex which is each of the local space coordinates (so in the [0,64] range) shifted and bitwise OR'd into an int. Each unique edge index can then be created by shifting the hermite index again and writing the axis index (0, 1, or 2) to the bottom two bits of the int. This allows the edges to be looked up later by constructing an appropriate ID.

The last thing to note is that although the kernel examines every edge in the field typically on a small number of edges will actually contain a sign change (i.e. the surface intersects). It's not possible to know which edges will contain the surface before hand so during the call the buffer edgeOccupancy is filled out. The buffer has 65 * 65 * 65 * 3 (823875) elements, one for each of the 3 edges on each of the Hermite edges. If the edge contains the surface then a 1 is written, otherwise a 0. This allows a common GPGPU technique of a Scan to be applied to the edgeOccupancy buffer which in turns allows only the valid edges from the edgeIndices buffer to be selected. That means that only the active edges need to be stored, not the whole buffer.

Now the edges that contain the surface have been discovered but not the actual data. For each edge the intersection position and the surface normal at that position need to be calculated.


kernel void FindEdgeIntersectionInfo(
 read_only image2d_t permTexture,
 const int4 worldSpaceOffset,
 global int* encodedEdges,
 global float4* edgeInfo)
 const int index = get_global_id(0);
 const int edge = encodedEdges[index];

 const int axisIndex = edge & 3;
 const int hermiteIndex = edge >> 2;

 const int4 local_pos =
  (hermiteIndex >> (VOXEL_INDEX_SHIFT * 0)) & VOXEL_INDEX_MASK,
  (hermiteIndex >> (VOXEL_INDEX_SHIFT * 1)) & VOXEL_INDEX_MASK,
  (hermiteIndex >> (VOXEL_INDEX_SHIFT * 2)) & VOXEL_INDEX_MASK,

 const int4 EDGE_END_OFFSETS[3] =
  (int4)(1, 0, 0, 0),
  (int4)(0, 1, 0, 0),
  (int4)(0, 0, 1, 0),

 const int4 world_pos = local_pos + worldSpaceOffset;
 const float4 p0 = convert_float4(world_pos);
 const float4 p1 = convert_float4(world_pos + EDGE_END_OFFSETS[axisIndex]);
 float minValue = FLT_MAX;
 float currentT = 0.f;
 float t = 0.f;
 for (int i = 0; i <= FIND_EDGE_INFO_STEPS; i++)
  const float4 p = mix(p0, p1, currentT);
  const float d = fabs(DensityFunc(p, permTexture));
  if (d < minValue)
   t = currentT;
   minValue = d;


 const float4 p = mix(p0, p1, t);

 const float h = 0.001f;
 const float4 xOffset = { h, 0, 0, 0 };
 const float4 yOffset = { 0, h, 0, 0 };
 const float4 zOffset = { 0, 0, h, 0 };

 const float dx = DensityFunc(p + xOffset, permTexture) - DensityFunc(p - xOffset, permTexture);
 const float dy = DensityFunc(p + yOffset, permTexture) - DensityFunc(p - yOffset, permTexture);
 const float dz = DensityFunc(p + zOffset, permTexture) - DensityFunc(p - zOffset, permTexture);

 const float3 normal = normalize((float3)(dx, dy, dz));
 edgeInfo[index] = (float4)(normal, t);

Since the edge buffer was compact this kernel is called with a 1D index. The encoding performed in the previous kernel is undone to recover the axis index and the Hermite index. The local position is reconstructed and used to calculate the worldspace end points of the edge. The actual process of finding the crossing position and normal is the same as in the CPU implementation so I won't go into it here. Once the kernel has finished running a new buffer edgeInfo has been created which contains the actual data. We now have two buffers, edgeIndices and edgeInfo, which together record all the data required to construct the voxels.

Voxel Generation

The voxel generation process is actually split into two parts, the first is voxel discovery:

kernel void FindActiveVoxels(
 global int* materials,
 global int* voxelOccupancy,
 global int* voxelEdgeInfo,
 global int* voxelPositions,
 global int* voxelMaterials)
 const int x = get_global_id(0);
 const int y = get_global_id(1);
 const int z = get_global_id(2);
 const int index = x + (y * VOXELS_PER_CHUNK) + (z * VOXELS_PER_CHUNK * VOXELS_PER_CHUNK);
 const int4 pos = { x, y, z, 0 };

 const int cornerMaterials[8] = 
  materials[field_index(pos + CHILD_MIN_OFFSETS[0])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[1])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[2])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[3])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[4])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[5])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[6])],
  materials[field_index(pos + CHILD_MIN_OFFSETS[7])],

 // record the on/off values at the corner of each voxel
 int cornerValues = 0;
 cornerValues |= (((cornerMaterials[0]) == MATERIAL_AIR ? 0 : 1) << 0);
 cornerValues |= (((cornerMaterials[1]) == MATERIAL_AIR ? 0 : 1) << 1);
 cornerValues |= (((cornerMaterials[2]) == MATERIAL_AIR ? 0 : 1) << 2);
 cornerValues |= (((cornerMaterials[3]) == MATERIAL_AIR ? 0 : 1) << 3);
 cornerValues |= (((cornerMaterials[4]) == MATERIAL_AIR ? 0 : 1) << 4);
 cornerValues |= (((cornerMaterials[5]) == MATERIAL_AIR ? 0 : 1) << 5);
 cornerValues |= (((cornerMaterials[6]) == MATERIAL_AIR ? 0 : 1) << 6);
 cornerValues |= (((cornerMaterials[7]) == MATERIAL_AIR ? 0 : 1) << 7);

 // record which of the 12 voxel edges are on/off
 int edgeList = 0;
 for (int i = 0; i < 12; i++)
  const int i0 = EDGE_MAP[i][0]; 
  const int i1 = EDGE_MAP[i][1]; 
  const int edgeStart = (cornerValues >> i0) & 1;
  const int edgeEnd = (cornerValues >> i1) & 1;

  const int signChange = (!edgeStart && edgeEnd) || (edgeStart && !edgeEnd);
  edgeList |= (signChange << i);

 voxelOccupancy[index] = cornerValues != 0 && cornerValues != 255;
 voxelEdgeInfo[index] = edgeList;
 voxelPositions[index] = CodeForPosition(pos, MAX_OCTREE_DEPTH);

 int materialIndex = FindDominantMaterial(cornerMaterials);
 voxelMaterials[index] = (materialIndex << 8) | cornerValues;

The 3D kernel ID is used as the position as before, but this time the position is for the voxels and so is the range is 0 <= x,y,z < 64. The voxel position is used as the min of the bounding box and so the 8 corners of the cube can be found by adding CHILD_MIN_OFFSETS. The corner materials are used to determine which corner points of the cube are "active" -- i.e. they have any material other than air/empty. The active edges for the voxel are recorded to save recalculating the values in the second kernel. As with the edges the voxelOccupancy is calculated to allow the generated buffer to be compacted. Along with the occupancy three other values are recorded for each voxel: the edgeList is stored to be used in the second kernel, the voxelPosition records an encoded version of the local 3D position via CodeForPosition and finally the "dominant" (i.e. most common) material is stored along with the cornerValues, which is used to generate the mesh data. The encoded position is an important part of building the octree, but I'll cover that later.

Once the active voxels have been discovered by the kernel and the buffers are compacted the second kernel can be executed to generate the leaf nodes in the octree for each voxel.

// the values here are indices into the CHILD_MIN_OFFSETS array
constant int EDGE_VERTEX_MAP[12][2] = 
 {0,4},{1,5},{2,6},{3,7}, // x-axis 
 {0,2},{1,3},{4,6},{5,7}, // y-axis
 {0,1},{2,3},{4,5},{6,7}  // z-axis

kernel void CreateLeafNodes(
 global int* voxelPositions,
 global int* voxelEdgeInfo,
 global float4* edgeData,
 global float4* vertexNormals,
 global QEFData* leafQEFs,
 global ulong* cuckoo_table,
 global ulong* cuckoo_stash,
 const uint   cuckoo_prime,
 global uint* cuckoo_hashParamsA,
 global uint* cuckoo_hashParamsB,
 const int    cuckoo_checkStash)
 const int index = get_global_id(0);

 const int encodedPosition = voxelPositions[index];
 const int4 position = PositionForCode(encodedPosition);
 const float4 position_f = convert_float4(position);

 const int edgeInfo = voxelEdgeInfo[index];
 const int edgeList = edgeInfo;

 float4 edgePositions[12], edgeNormals[12];
 int edgeCount = 0;

 for (int i = 0; i < 12; i++)
  const int active = (edgeList >> i) & 1;
  if (!active)

  const int e0 = EDGE_VERTEX_MAP[i][0];
  const int e1 = EDGE_VERTEX_MAP[i][1];

  const float4 p0 = position_f + convert_float4(CHILD_MIN_OFFSETS[e0]);
  const float4 p1 = position_f + convert_float4(CHILD_MIN_OFFSETS[e1]);
  // this works due to the layout EDGE_VERTEX_MAP, the first 4 elements are the X axis
  // the next 4 are the Y axis and the last 4 are the Z axis
  const int axis = i / 4;

  const int4 hermiteIndexPosition = position + CHILD_MIN_OFFSETS[e0];
  const int edgeIndex = (EncodeVoxelIndex(hermiteIndexPosition) << 2) | axis;

  const uint dataIndex = Cuckoo_Find(edgeIndex,
   cuckoo_table, cuckoo_stash, cuckoo_prime,
   cuckoo_hashParamsA, cuckoo_hashParamsB, cuckoo_checkStash);

  if (dataIndex != ~0U)
   const float4 edgeData = edgeData[dataIndex];
   edgePositions[edgeCount] = mix(p0, p1, edgeData.w);
   edgeNormals[edgeCount] = (float4)(edgeData.x, edgeData.y, edgeData.z, 0.f);


 QEFData qef;
 qef_create_from_points(edgePositions, edgeNormals, edgeCount, &qef);
 leafQEFs[index] = qef;

 float4 normal = { 0.f, 0.f, 0.f, 0.f };
 for (int i = 0; i < edgeCount; i++)
  normal += edgeNormals[i];
  normal.w += 1.f;
 normal /= normal.w;
 normal.w = 0.f;

 vertexNormals[index] = normal;

The kernel operates over the compacted voxel data buffers generated in the previous kernel, and so is called with a 1D index. The local 3D position is recovered via PositionForCode which undoes the encoding by CodeForPosition. The edgeList allows the inactive edges to be skipped which saves doing an calculation for those edges. The data for the edge was calculated in the FindEdgeIntersectionInfo kernel, but in order to create the appropriate edge ID some calculations are required.

EDGE_VERTEX_MAP maps each of the 12 edge IDs to the pairs of vertices which are the edge end points, giving two indices in the [0, 8] range which then are used to offset the voxel's bounding box min to find the local space positions of the two edge vertices. In order to lookup the normal and intersection data for the edge the unique ID must recreated. To calculate the axis value (i.e. 0, 1 or 2) we can take advantage of the EDGE_VERTEX_MAP layout, dividing the edge index, range [0, 11], by 4 gives the correct axis index (i.e. in the [0, 2] range). Using the first edge vertex ID the local space position of the edge allows that position to be encoded along with the axis index to produce the unique edgeIndex.

The Cuckoo_Find function is a hashtable lookup. That's probably worth a post of its own so I won't go into any detail here, the important point is the this function maps the unique edgeIndex to an offset into the edgeData buffer, which we generated and compacted earlier. That lookup should never fail, but since an out-of-bounds read in an OpenCL kernel can/will kill the video driver I've guarded against it anyway. Retrieving the edgeData allows the edgePositions and edgeNormals arrays to be filled out.

Once all the active edges have been processed the QEF data structure is initialised with the edge data, but is not solved. The vertex normal for the node is calculated however, to save looking up the edgeData table in a later kernel (e.g. where the position is calculated).

Octree Construction

The GPU octree is quite different from the standard CPU approach. Rather than storing the nodes as structs with 8 pointer nodes for the node's children, a unique identifier is assigned to each node encoding the node's position in the octree (i.e. the 'min' bounding box value used frequently in the kernels) and the node's depth in the octree. Any node in the octree can then be found via a hashtable lookup (assuming the unique identifier for the node can be calculated).

The technique I use for this is lifted straight from this paper Fast generation of pointerless octree duals. The key idea is recognising that the octree is regular we can describe the position of any node relative to its parent in terms of which side of each axis's dividing plane the child lies on. Or in other words, it can be represented as a binary choice in each of the three axes. In the CPU implementation I use this regular structure to calculate the position of a node's child n like this:

 const ivec3 childPos = parentPos + (halfParentSize * CHILD_MIN_OFFSETS[n]);

const ivec3 CHILD_MIN_OFFSETS[8] =
 ivec3(0, 0, 0),
 ivec3(0, 0, 1),
 ivec3(0, 1, 0),
 ivec3(0, 1, 1),
 ivec3(1, 0, 0),
 ivec3(1, 0, 1),
 ivec3(1, 1, 0),
 ivec3(1, 1, 1)

Those are 3D coordinates, but if you take a closer look you should see that they are also the bit patterns that make up the range [0, 7] -- that means that we can represent the choice of any of the possible children as a 3-bit integer. Since we can record the choice of child at each branch we can generate a code to represent a node of any size and any depth. The size is implicitly defined by the depth of the node and the depth is encoded as the number of 3-bit "choice"/child numbers are in the bit string. The bit string (or code) is therefore a unique identifier made by appending 3-bit numbers to a bit string.

This encoding has a number of handy properties. To find the code for a given node's parent we simply left shift the code by three bits, removing the last branch choice. Similarly to construct the code for a node's child, simply right shift the code 3 bits and bitwise OR the 3-bit number corresponding to the child. The depth any given node (and so its size) can be calculated by counting the number of 3-bit numbers encoded in the bit string, i.e. depth = msb(code) / 3.

uint CodeForPosition(int4 p, int nodeDepth)
 uint code = 1;
 for (int depth = MAX_OCTREE_DEPTH - 1; depth >= (MAX_OCTREE_DEPTH - nodeDepth); depth--)
  int x = (p.x >> depth) & 1;
  int y = (p.y >> depth) & 1;
  int z = (p.z >> depth) & 1;
  int c = (x << 2) | (y << 1) | z;
  code = (code << 3) | c;

 return code;

The node depth is required when calculating the code as multiple nodes (but all with different sizes) have the same position, e.g. at (0,0,0) there would be a size=64 node, a size=32, etc. The code is generated top-down from the root so the depth is used to make sure the number of iterations is correct, without this then two nodes with the same position but different sizes would generate identical codes. For each level of the octree the appropriate bit is extracted from the 3D position and then used to generate the 3-bit value. These are all appended to the code as children as explained before. One thing to note is that the root node has a special value of simply '1', which means that the root node is always the most significant bit in any bit string.

int4 PositionForCode(uint code)
 const int msb = MSB(code);
 const int nodeDepth = (msb / 3);

 int4 pos = { 0, 0, 0, 0 };
 for (int i = MAX_OCTREE_DEPTH - nodeDepth; i < MAX_OCTREE_DEPTH; i++)
  uint c = code & 7;
  code >>= 3;

  int x = (c >> 2) & 1;
  int y = (c >> 1) & 1;
  int z = (c >> 0) & 1;

  pos.x |= (x << i);
  pos.y |= (y << i);
  pos.z |= (z << i);

 return pos;

The position is retrieved by first calculating the depth of the node and then doing the reverse of the previous function: bits for each axis are extracted from the bit string and the appropriate bit in the 3D position variable's values are set.

The leaf nodes for the octree have already been generated so it makes sense to build the octree upward. The idea here is the same as in the CPU version, each level of the octree is processed in turn to produce the parent node level until the root node is created. Calculating the parent codes is very simple:

kernel void GenerateParentCodes(
 const int childOffset,
 global int* nodeCodes,
 global int* parentCodes) 
 const int index = get_global_id(0);
 const uint childCode = nodeCodes[childOffset + index];
 parentCodes[index] = childCode >> 3;

There is a problem, however. It is extremely likely that any given node on any level has a sibling node, which means that multiple nodes will generate the same parent code. The duplicate values need to be removed from the array. On the CPU that's not much of a problem -- e.g. use a std::unordered_set -- but on the GPU that type of processing can be a bit fiddly. I won't go into detail here but I implemented the technique from Guy Blelloch's very useful paper "Parallel Algorithms", which allows a new buffer of unique values to be generated. Once the unique values are generated they can be concatenated to the buffer containing all the octree node codes.

 // generate the codes for every octree node by processing each level at a time.
 // each node generates a parent code, but as some nodes may be siblings the parent
 // codes will contain duplicates. remove those and then write the unique parent
 // codes to the buffer, and then process the newly generated codes to get *THEIR* parents
 for (int depth = MAX_OCTREE_DEPTH; depth > 0; depth--)
  nodeDepthOffsets[MAX_OCTREE_DEPTH - depth] = childOffset;

  cl::Kernel genParentCodes(g_octreeProgram, "GenerateParentCodes");
  CL_CALL(genParentCodes.setArg(0, childOffset));
  CL_CALL(genParentCodes.setArg(1, d_nodeCodes));
  CL_CALL(genParentCodes.setArg(2, d_parentCodes));
  CL_CALL(ctx->queue.enqueueNDRangeKernel(genParentCodes, cl::NullRange, numGenerated, cl::NullRange));

  unsigned int numUniqueParentCodes = 0;
  cl::Buffer d_uniqueParentCodes = RemoveDuplicates(ctx->queue, d_parentCodes, numGenerated, &numUniqueParentCodes);

  LVN_ALWAYS_ASSERT("Node buffer too small!", (numOctreeNodes + numUniqueParentCodes) < MAX_OCTREE_NODES);

  cl::Kernel writeParentCodes(g_octreeProgram, "WriteParentCodes");
  CL_CALL(writeParentCodes.setArg(0, numOctreeNodes));
  CL_CALL(writeParentCodes.setArg(1, d_nodeCodes));
  CL_CALL(writeParentCodes.setArg(2, d_uniqueParentCodes));
  CL_CALL(ctx->queue.enqueueNDRangeKernel(writeParentCodes, cl::NullRange, numUniqueParentCodes, cl::NullRange));

  childOffset = numOctreeNodes;
  numGenerated = numUniqueParentCodes;
  numOctreeNodes += numUniqueParentCodes;

Once the octree codes are inserted into a hashtable we implicitly have the connectivity data for the tree, i.e. for any given node a hashtable lookup can determine if any of the node's children exist. That's important as every non-leaf node's data is constructed from its children. All the generate parent node codes are written to the same buffer so that at the end of the loop the d_nodeCodes buffer contains the code of every node in the octree.

Constructing the octree one level/depth at a time also means that the d_nodeCodes buffer is implicitly sorted by node size, since the node of each level are appended each iteration. To take advantage of this the nodeDepthOffsets array is constructed along with the buffer. That allows the first node of any size to be found and the number of nodes with that size. This means its simple to process each level of the octree as a contiguous block in the buffer.

struct GPUOctree
  : numNodes(0)
  , numDrawableNodes(0)
  for (int i = 0; i < MAX_OCTREE_DEPTH; i++)
   nodeDepthOffsets[i] = 0;

 cl::Buffer  d_nodeCodes;
 cl::Buffer  d_vertexPositions, d_vertexNormals, d_nodeMaterials;

 int    nodeDepthOffsets[MAX_OCTREE_DEPTH];  
 int    numNodes, numDrawableNodes;

The GPU octree is represented by 4 buffers: the octree node codes/unique ids, and the "draw info", a vertex position and normal and the material info. Not all the nodes in the octree actually generate the data for the latter three buffers so numDrawableNodes records the size of these buffers separately from the over total numNodes.

kernel void ConstructBranchNodes(
 const int nodeOffset,
 global uint* octreeNodeCodes,
 global int* octreeMaterials,
 global QEFData* qefs,
 global float4* vertexNormals,
 global ulong* cuckoo_table,
 global ulong* cuckoo_stash,
 const uint   cuckoo_prime,
 global uint* cuckoo_hashParamsA,
 global uint* cuckoo_hashParamsB,
 const int    cuckoo_checkStash)
 const int index = nodeOffset + get_global_id(0);
 const uint code = octreeNodeCodes[index];
 QEFData nodeQEF;

 float4 normal = { 0.f, 0.f, 0.f, 0.f };
 int midSign = MATERIAL_NONE;
 int materials[8];
 int signs[8];

 for (int i = 0; i < 8; i++)
  materials[i] = MATERIAL_NONE;
  signs[i] = -1;

  const uint childCode = (code << 3) | i;
  const uint childIndex = Cuckoo_Find(childCode,
   cuckoo_table, cuckoo_stash, cuckoo_prime,
   cuckoo_hashParamsA, cuckoo_hashParamsB, cuckoo_checkStash);

  if (childIndex != ~0)
   QEFData childQEF = qefs[childIndex];
   qef_add(&nodeQEF, &nodeQEF, &childQEF);   

   float4 n = vertexNormals[childIndex];
   normal.x += n.x;
   normal.y += n.y;
   normal.z += n.z;
   normal.w += 1.f;

   int childMaterial = octreeMaterials[childIndex];
   int childCorners = childMaterial & 0xff;
   midSign = (childCorners >> (7 - i)) & 1;
   signs[i] = (childCorners >> i) & 1;

   materials[i] = childMaterial >> 8;

 int corners = 0;
 for (int i = 0; i < 8; i++)
  if (signs[i] == -1)
   // Undetermined, use centre sign instead
   corners |= (midSign << i);
   corners |= (signs[i] << i);

 qefs[index] = nodeQEF;

 normal /= normal.w;
 normal.w = 0.f;
 vertexNormals[index] = normal;

 const int materialIndex = FindDominantMaterial(materials);
 octreeMaterials[index] = (materialIndex << 8) | corners;

This is basically a straight port of what Octree_Simplify does in the Dual Contouring sample, so I won't go into to much detail. The main difference is that rather than using recursion to examine the child node pointers, the child code is generated and a hashtable lookup is performed. As with GenerateParentCodes this kernel is called in a loop with each iteration generating the next iteration's data. Notice that index is calculated with an offset, this is because the data for the nodes is stored in a single buffer with each level being appended as a contiguous block.

If you've being paying attention you'll have noticed that the vertex position has not yet been calculated. In both the leaf and branch node generation the QEF was initialised, but not solved. There's one final kernel to complete the octree generation.

kernel void SolveQEFs(
 const float4 worldSpaceOffset,
 global QEFData* qefs,
 global float4* solvedPositions)
 const int index = get_global_id(0);

 QEFData qef = qefs[index];
 float4 pos = { 0.f, 0.f, 0.f, 0.f };
 qef_solve(&qef, &pos);
 solvedPositions[index] = (pos * LEAF_SIZE_SCALE) + worldSpaceOffset;

Mesh Generation

As with the CPU implementation once the octree is generated we have all the data necessary to generate the mesh: the octree nodes provide the connectivity data required to generate the triangles and the each drawable node has an associated vertex and normal. The octree structure is used to generate the meshes for each level of detail -- I'll start with the simplest case which is generated LOD0 which is producing a mesh using the leaf nodes of a single chunk.

This is actually a simpler case than the reference CPU implementation (and so my sample) solves. That version allows configurations where leafs of different sizes are used to generate the mesh. That enables the vertex clustering approach to simplifying the mesh but isn't required in this use case where the simplification is a separate process. Instead of recursively processing an octree we can simply iterate over every leaf node in the octree and process an edge along each axis.

constant int4 EDGE_NODE_OFFSETS[3][4] =
 { (int4)(0, 0, 0, 0), (int4)(0, 0, 1, 0), (int4)(0, 1, 0, 0), (int4)(0, 1, 1, 0) },  // x axis
 { (int4)(0, 0, 0, 0), (int4)(1, 0, 0, 0), (int4)(0, 0, 1, 0), (int4)(1, 0, 1, 0) },  // y axis
 { (int4)(0, 0, 0, 0), (int4)(0, 1, 0, 0), (int4)(1, 0, 0, 0), (int4)(1, 1, 0, 0) },  // z axis

kernel void GenerateMesh(
 global uint* octreeNodeCodes,
 global int* octreeMaterials,
 global int* meshIndexBuffer,
 global int* trianglesValid,
 global ulong* cuckoo_table,
 global ulong* cuckoo_stash,
 const uint   cuckoo_prime,
 global uint* cuckoo_hashParamsA,
 global uint* cuckoo_hashParamsB,
 const int    cuckoo_checkStash)
 const int index = get_global_id(0);
 const uint code = octreeNodeCodes[index];
 const int triIndex = index * 3;
 const int4 offset = PositionForCode(code);
 const int pos[3] = { offset.x, offset.y, offset.z };

 int nodeIndices[4] = { ~0, ~0, ~0, ~0 };
 for (int axis = 0; axis < 3; axis++)
  trianglesValid[triIndex + axis] = 0;

  // need to check that the position generated when the offsets are added wont exceed 
  // the chunk bounds -- if this happens rather than failing the octree lookup 
  // will actually wrap around to 0 again causing weird polys to be generated
  const int a = pos[(axis + 1) % 3];
  const int b = pos[(axis + 2) % 3];
  const int isEdgeVoxel = a == (VOXELS_PER_CHUNK - 1) || b == (VOXELS_PER_CHUNK - 1);
  if (isEdgeVoxel)

  nodeIndices[0] = index;
  for (int n = 1; n < 4; n++)
   const int4 p = offset + EDGE_NODE_OFFSETS[axis][n];
   const uint c = CodeForPosition(p, MAX_OCTREE_DEPTH);

   nodeIndices[n] = Cuckoo_Find(c,
    cuckoo_table, cuckoo_stash, cuckoo_prime,
    cuckoo_hashParamsA, cuckoo_hashParamsB, cuckoo_checkStash);

  if (nodeIndices[1] != ~0 &&
   nodeIndices[2] != ~0 &&
   nodeIndices[3] != ~0)
   const int bufferOffset = (triIndex * 6) + (axis * 6);
   const int trisEmitted = 
    ProcessEdge(&nodeIndices[0], octreeMaterials[index], axis, &meshIndexBuffer[bufferOffset]);
   trianglesValid[triIndex + axis] = trisEmitted;

There are two input params for the kernel: octreeNodeCodes allows the 3D position to be recovered and octreeMaterials contains the "corner values" which is used to determine which edges of the voxel are active. Two output buffers are generated: when a pair of triangles is generated their indices are written to indexBuffer and the trianglesValid buffer will contain a 1 for every edge that emitted a pair of triangles and 0 zero otherwise. As with other kernels this allows the indexBuffer to be compacted so that it only contains the valid triangle indices. Since each kernel looks at 3 edges, and each active edge will generate 6 indices there are a few offset calculated in the kernel to make sure each kernel writes to the correct location.

The isEdgeVoxel test is a bit odd looking, so I'll explain the problem. First have a look at the EDGE_NODE_OFFSETS arrays, and you'll notice that in the first the X value is always 0, in the second the Y value is always zero and in the third Z is always 0. That means there is no possibility for the active axis to exceed the chunk bounds, but the other axis's values will be offset by 1 which means "the other two" axes need to be checked for the out-of-bounds, not the active axis. To do that I use modulo arithmetic to wrap the indices back around to 0 and so always select from a valid axis.

When I initially wrote this code I was aware of generating an out-of-bounds position being a possibility but I didn't think it would cause any problems. I thought that CodeForPosition on an out of bounds position would generate some identifier that wasn't in the hashtable and so the lookup would fail. That turned out to not be the case though. CodeForPosition only looks at a specific set of bits when generating the code (i.e. MAX_OCTREE_DEPTH bits) and in the case of an out-of-bounds position the bits CodeForPosition does look at may well lead to an entirely valid code being produced. I.e. if the coord == 63 (11111b) and the offset position is 64 (100000b) then CodeForPosition only examines the bottom 6 bits which mean the same code would be generated for 64 as for 0. That in turn allows some of the lookups to succeed and weird wrapping triangles are generated which go from one end of the mesh to the other.

As before where the CPU implementation uses the child pointers on the octree nodes the GPU implementation uses a hashtable lookup. The EDGE_NODE_OFFSETS table specifies the offset positions of the 4 nodes that share each edge and these are used to calculate the code for each node, which enables the the lookup. Note that the 4 nodes being in the octree isn't the only precondition for generating the triangle pair, and ProcessEdge may yet fail to emit any triangles into the buffer. This needs to be taken into account so that trianglesValid only has a 1 written when the triangle pair is actually emitted.

int ProcessEdge(
 const int* nodeIndices,
 const int nodeMaterial,
 const int axis,
 global int* indexBuffer)
 const int edge = (axis * 4) + 3;
 const int c1 = EDGE_VERTEX_MAP[edge][0];
 const int c2 = EDGE_VERTEX_MAP[edge][1];

 const int corners = nodeMaterial & 0xff;
 const int m1 = (corners >> c1) & 1;
 const int m2 = (corners >> c2) & 1;

 const int signChange = (m1 && !m2) || (!m1 && m2);
 if (!signChange)
  return 0;

 // flip the winding depending on which end of the edge is outside the volume
 const int flip = m1 != 0;
 const uint indices[2][6] = 
  // different winding orders depending on the sign change direction
  { 0, 1, 3, 0, 3, 2 },
  { 0, 3, 1, 0, 2, 3 },

 indexBuffer[0] = nodeIndices[indices[flip][0]];
 indexBuffer[1] = nodeIndices[indices[flip][1]];
 indexBuffer[2] = nodeIndices[indices[flip][2]];
 indexBuffer[3] = nodeIndices[indices[flip][3]];
 indexBuffer[4] = nodeIndices[indices[flip][4]];
 indexBuffer[5] = nodeIndices[indices[flip][5]];

 return 1;

To understand the edge lookup calculation we need to go back to the CPU implementation:
const int processEdgeMask[3][4] = {{3,2,1,0},{7,5,6,4},{11,10,9,8}};

void ContourProcessEdge(OctreeNode* node[4], int dir, ContourContext& ctx)
 int minSize = INT_MAX;
 int minIndex = 0;
 int indices[4] = { -1, -1, -1, -1 };
 bool flip = false;
 bool signChange[4] = { false, false, false, false };

 for (int i = 0; i < 4; i++)
  if (node[i]->type != Node_Internal)
   const int edge = processEdgeMask[dir][i];
   const int c0 = edgevmap[edge][0];
   const int c1 = edgevmap[edge][1];
   const int corners = node[i]->drawInfo->materialInfo & 0xff;
   const int m0 = (corners >> c0) & 1;
   const int m1 = (corners >> c1) & 1;

   if (node[i]->size < minSize)
    minSize = node[i]->size;
    minIndex = i;
    flip = m1 != 1; 

   indices[i] = node[i]->drawInfo->index;
   signChange[i] = (m0 && !m1) || (!m0 && m1);

 if (!signChange[minIndex])

 if (!flip)
  ctx.buffer->triangles_[ctx.buffer->numTriangles_++] = MeshTriangle(indices[0], indices[1], indices[3]);
  ctx.buffer->triangles_[ctx.buffer->numTriangles_++] = MeshTriangle(indices[0], indices[3], indices[2]);
  ctx.buffer->triangles_[ctx.buffer->numTriangles_++] = MeshTriangle(indices[0], indices[3], indices[1]);
  ctx.buffer->triangles_[ctx.buffer->numTriangles_++] = MeshTriangle(indices[0], indices[2], indices[3]);

Remember that the CPU implementation deals with octrees where the drawable nodes may have different sizes, and so there is a bit of code to find the smallest node, and use that that's signChange flag to determine if a triangle pair should be emitted. The GPU implementation only deals with the case where the nodes are of uniform size, in that case only the first node's flip and signChange flags matter. That means the kernel can skip the loop altogether and just use the material info for the first node which is passed in directly via the nodeMaterial parameter. If you take a look at the processEdgeMask arrays you'll see that the first values in each of the arrays are 3, 7, and 11. Since the X, Y and Z axes map to 0, 1, and 2 respectively the edge index can be calculated simply by (axis * 4) + 3. Using the edge indices we can determine if the edge contains a sign change by checking the materials either end of the edge -- if they are both the same then the surface doesn't cross this particular edge and no triangles are emitted. If there is a sign change the emitting the triangles is just a matter of emitting the node indices to the indexBuffer in the correct winding order.

Once the GenerateMesh kernel is finished there are two more steps required. Firstly the indexBuffer is compacted using the triangleValid buffer. As before this allows a new buffer to be created containing only the indexBuffer values that were actually generated. The newly generated buffer contains the index buffer data to be sent to OpenGL to render the mesh. Currently I use an interleaved vertex buffer in my renderer, so the final step is to generate a new buffer and write the existing data to it in the interleaved layout expected by my renderer.

struct MeshVertex
 float4  position, normal, colour;

kernel void GenerateMeshVertexBuffer(
 global float4* vertexPositions,
 global float4* vertexNormals,
 global int* nodeMaterials,
 const float4 colour,
 global struct MeshVertex* meshVertexBuffer)
 const int index = get_global_id(0);
 const int material = nodeMaterials[index];
 meshVertexBuffer[index].position = vertexPositions[index];
 meshVertexBuffer[index].normal = vertexNormals[index];
 meshVertexBuffer[index].colour = (float4)(colour.x, colour.y, colour.z, (float)(material >> 8));

The colour is specified when making the high-level call to generate the mesh, and is determined by the chunk size. Only three components are required for that so I also write the material index to the W component. This allows for the appropriate texture to be selected in the fragment shader.

That covers the process of generating a mesh for LOD0. For higher levels of detail the mesh is generated by selecting some of the nodes from multiple octrees. At each level of detail the depth of the nodes selected from the octree decreases such that for LOD1 the size=2 nodes are selected, for LOD2 the size=4 nodes are selected and so on. As the size of the selected nodes increases so does the number of octrees used, LOD1 uses (up to) 8 octrees, LOD2 uses (up to) 64, etc. The net effect of this is that we can consider the nodes selected for the mesh to be the leaf nodes of a hypothetical octree, and as such use the existing GenerateMesh kernel to produce the mesh.

There is a problem with this approach however, because the nodes selected for any LOD other than 0 will not actually be leaf nodes. In fact the only information we have about the selected nodes is the code which is in terms of the chunk octree, not this hypothetical octree we're using to generate the mesh. In order for the GenerateMesh kernel to work the selected nodes will need to be "rebased" into the hypothetical octree.

kernel void UpdateNodeCodes(
 const int4 chunkOffset,
 const int selectedNodeSize,
 global uint* nodeCodes)
 const int index = get_global_id(0);
 const int code = nodeCodes[index];
 int4 position = PositionForCode(code);
 position /= selectedNodeSize;
 position += chunkOffset;
 nodeCodes[index] = CodeForPosition(position, MAX_OCTREE_DEPTH);

The solution is to recover the position from the selected node's code and then calculate the position the node would have relative to the new hypothetical octree. Firstly the position is divided by the selectedNodeSize. This remaps the index from the [0, 63] range to the [0, (64/selectedNodeSize) - 1] range. I.e. for LOD1 the nodes position values will be multiples of two, and the selected node size is two so the division changes a position like (12, 24, 36) to (6, 12, 18). The chunkOffset is a local space offset relative to the hypothetical octree for the octree that the nodes belong to. This remaps the positions from the [0, (64/selectedNodeSize) - 1] range back into the [0, 63] range as if the nodes were leaf nodes on a single octree, not branch nodes on multiple octrees. E.g. the (6, 12, 18) coordinate from before may end up being offset against (0, 32, 0) and so be remapped to (6, 44, 18).

In order to actually generate the mesh the rest of the data has to be selected from the octrees too. The whole process in run in a loop generating a new set of buffers is created appending the data from octree in contiguous chunks. The nodeDepthOffsets array created during the octree construction allows the nodes of the desired size to be easily selected from each octree.

 int bufferOffset = 0;
 for (auto& pair: octrees)
  GPUOctree& octree = pair.second;
  const int level = glm::log2(octreeCountPerAxis); // i.e. 2, 4, 8, ...
  const int nodeOffset = octree.nodeDepthOffsets[level];
  const int nodeCount = octree.nodeDepthOffsets[level + 1] - nodeOffset;

  CL_CALL(ctx->queue.enqueueCopyBuffer(octree.d_nodeCodes, nodeBuffer->d_nodeCodes, 
   sizeof(cl_int) * nodeOffset, sizeof(cl_int) * bufferOffset, sizeof(cl_int) * nodeCount));
  CL_CALL(ctx->queue.enqueueCopyBuffer(octree.d_nodeMaterials, nodeBuffer->d_nodeMaterials, 
   sizeof(cl_int) * nodeOffset, sizeof(cl_int) * bufferOffset, sizeof(cl_int) * nodeCount));
  CL_CALL(ctx->queue.enqueueCopyBuffer(octree.d_vertexPositions, nodeBuffer->d_vertexPositions, 
   sizeof(cl_float4) * nodeOffset, sizeof(cl_float4) * bufferOffset, sizeof(cl_float4) * nodeCount));
  CL_CALL(ctx->queue.enqueueCopyBuffer(octree.d_vertexNormals, nodeBuffer->d_vertexNormals, 
   sizeof(cl_float4) * nodeOffset, sizeof(cl_float4) * bufferOffset, sizeof(cl_float4) * nodeCount));

  const ivec3& offset = pair.first;
  cl_int4 d_offset = { offset.x * childRegionSize, offset.y * childRegionSize, offset.z * childRegionSize, 0 };

  // the nodes' codes need to be re-encoded as if they were leaf nodes in an octree 
  // covering the clipmap node's volume to allow the mesh generation to work
  int index = 0;
  cl::Kernel k_UpdateNodeCodes(g_octreeProgram, "UpdateNodeCodes");
  CL_CALL(k_UpdateNodeCodes.setArg(index++, d_offset)); 
  CL_CALL(k_UpdateNodeCodes.setArg(index++, clipmapNodeSize / CHUNK_SIZE)); 
  CL_CALL(k_UpdateNodeCodes.setArg(index++, nodeBuffer->d_nodeCodes)); 
  CL_CALL(ctx->queue.enqueueNDRangeKernel(k_UpdateNodeCodes, bufferOffset, nodeCount, cl::NullRange));

  bufferOffset += nodeCount;


At this point we run into the same problem I covered in my first blog post: there are gaps between the meshes. To fix this seam meshes need to be generated between the neighbouring chunks. There are a few problems with generating the seams on the GPU. Firstly the mesh generation only works when the nodes are of uniform size, this is often not the case when generating seams (since neighbouring clipmap nodes may have different sizes). Supporting non-uniform node sizes would complicate the mesh generation code so I would rather avoid that. Additionally the seam mesh nodes are often required even when the mesh itself is not updated, e.g. when a neighbouring node changes LOD the seam will be regenerated but using existing seam nodes for any nodes that didn't change LOD. Finding the seam nodes for a LOD0 node would only involve looking up one GPU octree but still be quite slow, when you consider LOD1 would require 8 octree lookup and LOD2 64 lookups then this gets quite unmanageable quickly. Finally there's an overhead involved in doing any OpenCL calls and large data sizes are required to offset this, seam meshes often contain only a small number of nodes (e.g. a few hundred) and so doing the mesh generation of the GPU would be slow even if it was possible.

To avoid these problems I instead export the seam nodes when a mesh is generated. The nodes are then associated with the clipmap node that the mesh belongs to. This means that the clipmap approach covered in my engine overview post functions just as before. The seam nodes already exist on the CPU so they can be gathered and used to construct new octrees which are then contoured using the same reference implementation from my Dual Contouring sample.

To export the seam nodes they first have to be found. That's actually fairly simple, just check if any of the node's min is 0 or VOXEL_PER_CHUNK-1.

kernel void FindSeamNodes(
 global uint* nodeCodes,
 global int* isSeamNode)
 const int index = get_global_id(0);
 const uint code = nodeCodes[index];
 int4 position = PositionForCode(code);
 int xSeam = position.x == 0 || position.x == (VOXELS_PER_CHUNK - 1);
 int ySeam = position.y == 0 || position.y == (VOXELS_PER_CHUNK - 1);
 int zSeam = position.z == 0 || position.z == (VOXELS_PER_CHUNK - 1);
 isSeamNode[index] = xSeam | ySeam | zSeam;
The isSeamNode buffer is used to generate a new buffer, selecting nodes from the octree. Once all the data is selected the GenerateMesh kernel can be used as before.

But wait, there's more!

There are some other things to cover but this post is already huge so I'm going to stop here. Some of the details not covered are pretty basic, like the OpenCL/OpenGL interop code so I won't bother with those. The glaring omission from this post is the mesh simplification on the GPU and that is a whole other can of worms so I'll cover that in a dedicated post. The GPU hashtable and remove duplicates implementations are also worth a post too.

Thanks for reading!


  1. Rock on guy! That was an excellent read. Thanks for the effort on this one, I'm looking forward to the next couple.

  2. So am I reading it correctly, that you're still Generating the Octree Nodes all the way down to size of 1, reguardless of the LOD? I know it would make you generate the Octree again every time you need to change LOD levels, rather than just regenerate the mesh using a different level. But Couldn't you speed it up by Creating the Octree, and stopping where size is 2, or 4 for the leaf nodes?

    1. Yeah that's right, so I still store a 3D grid of chunks, each chunk has its own density field (not stored in memory unless there's been an edit currently) and octree.

      Well, I think I know what you're getting at. I don't actually construct the octree down, its constructed upward from the size=1 nodes. All the other nodes are constructed from the data in those nodes not from the density field, so that's why I create full octree each time.

      I did think about doing something like you suggested, i.e. if I only need the size=2 nodes for a chunk mesh why bother constructing the size=1 nodes first? The problem I was worried about was sampling inconsistencies. E.g. you might have two size=1 edges that are (empty, full) (full, empty), if you sampled those as a single size=2 edge then you'd get (empty, empty) and so no edge. You could perhaps work around it by doing multiple samples along the edge but it still seemed somewhat dodgy to me.

  3. Ya, turns out you're exactly right! If you go to far down that road, you end up with a minecraft / blocky terrain on the outside. Those (empty / full), (full / empty) nodes end up taking huge and oddly enough SQUARE divots out of the terrain. Unless i just completely screwed up the impl. I knew that i would loose data, but I figured it wouldn't matter if the mountain was a little skinnier than it would be when you got closer. People are used to seeing LOD artifacts as you get closer to them. but WOW at the result. It would make mojang proud.

    It really does speed up the creation though (i'm still at the very beginning here, so i'm still creating octrees top down). I made a classic clipmap 4x4, and as you go out in rings, they doubled in size. In Unity you're biggest hog is the actual creation of the mesh. I found a mesh size i could live with (16) and ran it out with no seams so that it was 8x8 and it rendered up the scene in 4 seconds. after i applied the idea of stopping the octree at successive node depths, i was able to cover a 64 x 64 area with the same ammount of proccessing time.

    Not a total failure, but not a completely successful one either. I'll put the clipmaps on a side bar for now, and get back to Bashing my head against getting seams working.

    Looking forward to the next posting!

    1. Hello there. I spotted this blog about a week ago, and I too have been working on adapting a Unity implementation. While my adaptation thus far very much still resembles ngildea's original implementation, I have been making changes.

      One change is that I've moved quite a bit of the chunk generation process to a Compute Shader. However... Quite a bit of it is still on the CPU. Such as the upward Octree construction, and generating the mesh (the vertices, normals, and indices), and the assignment of the mesh to the mesh renderer/collider components.

      I will also note that I too thought it would be wise to replace the simplification system and instead just lower the voxel resolution of the chunk during generation. It works, and it's faster, but it's definitely something worth changing at some point down the line.

      I also decided to get whatever I could of the chunk generation process off the main thread, putting all that octree construction and mesh generation stuff into another thread.

      I actually tried seeing if I could get the Compute Shader stuff into it as well, but nothing from the Unity API agrees with being on any other thread besides the main thread. And I also tried seeing if I could run the Compute Shader on the main thread in parallel to the other thread... It might of been a mistake I made, but doing so would cause it to get stuck in a dead lock.

      But still... Even though I couldn't figure out how to use multi-threading to speed up chunk generation speed, it did still off-load some of the chunk generation off the main thread, which prevents chunk generation from lagging the game play as much.

      Also want to note that for the highest resolution chunk (64 * 64 * 64), it spends about 0.02 to 0.04 seconds in the GPU, 0.10 to 0.13 seconds on octree construction and mesh generation, and about 0.02 to 0.03 seconds on assigning the mesh to the renderer/collider components. So if I manage to do something about the octree construction and mesh generation, I should have chunk generation at a good speed.

      Though I wouldn't say it's that bad right now anyway. I have it generating a fairly large range of chunks with a (admittedly really poorly implemented) procedural generation system and a really basic mesh modification system that has each chunk store an array of primitive based modifications (spheres and boxes right now). Whenever you make a modification to a chunk, I have it regenerate the entire chunk. Which has a short but still noticeable delay.

      This is my latest screenshot of the project.

    2. Nicely Done! I haven't even tried shaders yet. Its like a whole different language. I'd love to take a peek if you don't mind? It seems like you're getting CRAZY speed ups using the compute shader.

      I have a stupidly complicated system to multithread the chunk creation, mostly to get around Unity. I first, in the terrain manager (main thread), instantiate the chunk class, which creates a mesh object, all the components, etc (very fast as long as you dont assign data to it yet). Then the chunk spins off a thread to do the density functions, and octree generation. After this is done, it adds itself to a queue in the meshbuilder (which runs on the main thread again). The mesh builder cycles through all the chunks in its queue and actually assigns the verts / tris / etc to the mesh obj that was created before hand. I have this limited so that it only runs on Nth frame, and only 1 per cycle, so that you don't get a bunch back to back and lag it out.

      Its convoluted, and stupid, but it pulls the entire creation off the main thread. The only lag I ever get is when it assigns the data to the mesh. If I can somehow limit that... That was where my idea for using a "set" size for the clipmaps came from. find the largest that was acceptable, and then use multiples of it, so that the ones twice as big still had the same ammount of verts and tris...

      Keep us updated on the progress, ya?

      Nick, you need a forum! ;-)

    3. Tuck: you might be able to mitigate the sampling problems by doing multiple samples along each edge to determine if its active rather than just checking the ends. Something like how the crossing point is found now, but record a maxD and minD and then if maxD >= 0 and minD < 0 the edge is active. That will also make it quite a bit slower :)

      Re: the forum, interesting idea. A Dual Contouring support group :P I post this stuff to which is good but a lot of the discussions are about minecraft style engines which don't have much overlap with DC.

      Cody: sounds like you're making good progress :) By far the biggest speed up was when I did all the density field querying on the GPU, the difference was night and day.

    4. You know there's a reddit for DC too?

    5. Yeah, I posted a couple of links to it. There's only 34 users though compared to 1600+ on voxelgamedev so its pretty dead.

  4. In your code, you're doing some bitwise stuff for generating the field edges. I think i've got the code figured out for the most part. but some of your global variables are throwing me. Is your HERMITE_INDEX_SIZE = 6 ? and what about VOXEL_INDEX_SHIFT & VOXEL_INDEX_MASK?

    1. I should have included those in the post, sorry! HERMITE_INDEX_SIZE is actually (voxels per chunk dimension) + 1 so 65 for me currently. That explains the next bit, VOXEL_INDEX_SHIFT needs to be 7 since the hermite index is in the range [0, 64] not [0,63]. VOXEL_INDEX_MASK needs to mask the bottom 7 bits then so its (1 << VOXEL_INDEX_SHIFT) - 1.

      Here's the full set of defines in case that's not clear:

      // Not sure if I want to use the nodes with size==1 for the min size
      const int LEAF_SIZE_LOG2 = 1;
      const int LEAF_SIZE_SCALE = 1 << LEAF_SIZE_LOG2;

      const int VOXELS_PER_CHUNK_LOG2 = 6;
      const int VOXELS_PER_CHUNK = 1 << VOXELS_PER_CHUNK_LOG2;

      const int CHUNK_SIZE = 1 << CHUNK_SIZE_LOG2;



      // each material index has 3 possible edges (X, Y and Z)

      const int FIELD_DIM = HERMITE_INDEX_SIZE + 1;

      const int VOXEL_INDEX_SHIFT = (CHUNK_SIZE_LOG2 + 1);
      const int VOXEL_INDEX_NUM = (1 << VOXEL_INDEX_SHIFT);
      const int VOXEL_INDEX_MASK = (VOXEL_INDEX_NUM - 1);

  5. Thought I'd pop in and update the progress on the Unity version I've been working on.

    I've switched to Compute Shaders in Unity to do the work on the GPU, although this is definately out of my comfort zone. c# and c++ I'm pretty familiar with. OpenCL and whatever the Compute Shaders are written in (its like a mashup so far as i can tell.. more like HLSL or the ilk). those are definately new territory. And your OpenCL code doesn't port nicely at all.. lol.

    To get the speed ups in Unity you're seeing, i'm really going to have to learn a lot more, so i can tweak it. Things like how many thread groups, of how many dimensions.. Right now i'm just concentrating on using 1 TG, at 1x1 just to get the code to compile and i'll work on speeding it up later. Also, I loose out a lot of the Unity stuff if i leave the mesh completely on the GPU. It works. and its fast.. but things like unity physics and such.. it doesn't have a collider, or mesh bounds to work with. So i basically have to use the shader to build the verts and tris, pull them back out of the GPU and build the mesh in Unity. Since the largest ammount of time is spent on the Octree, this will still be a huge boost in performance when its all working.. but it introduces some limitations like how much data can be passed between the CPU and GPU and at what rate.. so i'll likely have to recode to accept multiple chunk sizes so that i can test what size meshes are optimal...


    1. Ah.... My apologies Tuck. I'll admit my focus has wavered a little but that doesn't excuse my neglect in checking back on this blog.

      I'll be honest here that I didn't change anything really with how chunks load since I lasted posted here. All I changed is these things.

      The density function to now compute noise to generate up to 3 different kinds of terrain, rolling plains, something resembling a hilly/bumpy terrain, and a mountain terrain. I also found that making it generate a spherical planet was quite easy.

      And I also significantly sped up my procedural generation process. It was a rather big cause for lag before, but with my changes it's at least good enough for now.

      And I started though haven't bothered finishing putting the seam filling into the procedural generation process. I didn't bother doing it correctly, just wanted to put in something quick to let me see if it was working at all.

      At the moment I have a 16 * 16 * 16 chunk grid for loading in surrounding chunks. That is an 8 chunk radius around the camera.

      These are two more recent screenshots.

      I had to make the "planet" small in the one showing the planetiness of what I'm generating because I needed it small enough to fit generally into the chunk loading range.

      In terms of taking a peak at my implementation, my only question is how do you think it would be best for me to share it? I've never really shared code before. And in speaking of which, it would probably be a fair warning that I don't really comment my code very often, since I personally have never had to work beside another programmer, and always remember my own code very well.

    2. In case I wasn't particularly clear. What I mean by haven't changed the chunk loading is that I still don't have the octree generation/contouring steps in the GPU yet. Those are still happening on a secondary CPU thread.

  6. You said you were working in Compute shaders though? if you're not doing the octree generation what have you put in the shader? When I disected my impl down, it seemed the octree generation was about 80% of the time it took to generate the terrain and attach a mesh. Of course, my development computer is the one that i have with me, not my desktop, so its possible i'm using skewed info here...

    I started working on a compute shader of my own after you mentioned it before. I have a perlin noise generator working completely on the GPU now. Its rough, but it fills out a field density that I can work with on the next stage. Flipping from Top down to bottom up generation of octrees has been a huge undertaking.. (Nick, you must be a genious ;-) )

    As for sharing of code there are several options. You can use if you want to share a snippet, or a single class / shader etc.. If you want to share a unity package / source code of many files you can upload to, or All that's left is to drop a line wherever you'd like to share your code and tell people how to get to it. For example, my code that I shared with the blog here is located at

    1. Okay so I put it up on Github. It was my first time setting up a repository on Github, but I hope I haven't made too much of a fool of myself.

    2. Ah... I should note that what I have there is the full project. And should also say that a few scripts in there that I didn't bother deleting yet are actually useless right now.

      For one The QEF.compute doesn't actually have anything in it. I just sorta made that as I was first trying to start using Compute Shaders and sorta forgot to delete it.

      And the QEF.cs SVD.cs PerlinNoise.cs and SimpleNoiseGenerator are also useless since I moved all the QEF/SVD and noise stuff to the compute shader.

    3. *Facepalm* Oh and the DensityFunctions.cs is also useless.

    4. Actually, looking at my code from this perspective, I'm now realizing how much of an utter mess it is.

      You might notice I have 3 different functions in the octree class calling the compute shader stuff. The first one is actually my old
      non-multithreaded solution that, while I have at least been some what updating it as I changed the others, it probably doesn't actually
      work anymore. The other two are actually pretty much identical. The only difference is that one tells the secondary thread to "reload"
      the chunk rather than load it up like new. I really should have changed it up to a large portion of that code by now rather than having
      pretty much the exact same code copied like that.

    5. to re-use a large portion* did that portion get deleted, I swear I had re-use typed in there.

    6. Thats a whole bunch of work you've done there. Mine is not nearly as far along. I like the spheroid density function you're using. The compute shader is great reference though. I've been needing something to get me really going.. time to disect one that works and see where I need to be starting! Thanks for the upload!

  7. Hey Nick, any chance you can share your FindDominantMaterial function? OpenCL is really slow when you do branching logic.. I've tried several times to get a function out that does it without doing if statements, but i'm comming up short. I'm basically doing 2 nested for loops, where i do a "count = (currentMat == cornerMaterials[i] ? count : count++);" and then another where i check to see if count is bigger than prevCount, and if it is i re-assign the currentMat to the larger count.. just seems really convoluted when I stare at it, even if it does work..

    1. Sorry I don't remember seeing this comment and it's a year old but here's the code:

      It's not particularly great tbh, but it works.

  8. I think there's one glaring problem with your approach. Which is the dependency on having the density function available in two separate kernels. That creates some pretty substantial limitations in what you can do. Barring adding some sort of atlasing scheme of SDFs, you're basically not going to have triangle meshes as inputs ... you're definitely not multipassing the field construction either. (With some changes to FindIntersectionInfo it's probably possible though)

    Since I was only concerned about "works well as a design tool" and not "real-time in a game" I took a different approach.

    - First, I don't store material, I store the actual density value (I handle materials differently for my tool [particle spray paint]).

    - I generate the density field at higher resolution than I need in a single kernel (or multiple passes as needed against that 129x129x129 grid).

    - Then I scan the voxels and output their corners bitmask based on the density field (pretty much exactly as done in the CPU reference implementation) which is a 64x64x64 grid.

    This alone brings it down to 500ms on an Intel HD4000 in a debug build because CPU side can early out when building the octree.

    - For the last speed up I run one last kernel to calculate the edge intersection points, not much differently than you have only I'm working entirely off of the higher resolution density field instead of resampling density.

    Along with some restructuring (eventually just invoking delete on nodes became 11% of the time running) it was pretty easy to hit 250ms with only light work (which is as fast as I care to be because that's fast enough for user editing).

    If I cared to go faster I'd recursively divide and run a kernel to calculate early outs for octree nodes. (Eventually I will do that though, just for "padding")

    Again, that's 250ms on an Intel HD4000 in a 1st gen Surface Pro ... that's basically instant on "real" hardware.

    Compared to the 1.4 seconds I was stuck at in a CPU implementation a huge win for very little work. Probably could've been faster there had I moved everything to SIMD/SSE sooner than I did.


    For the density function I use the visitor pattern to visit my CSG tree and generate the shader code for the density function whenever the structural nature of the tree changes (ie a sphere is changed to a cube, something is added, changed to subtractive instead of additive, etc), the parameters and transforms of the shapes are written into a buffer. The density functions for each shape increment an index through that buffer as they're called.

    I don't have to rebuild the shader code that often since most changes are usually just sizes, or positional - and that data comes from the stream of float input data.

    On the CPU evaluating density was over 70% of my time. When I first moved to OpenCL my performance took a complete shit (trilinear filtering is expensive), so the summary is I took a different road of providing using OpenCL to help me find escape routes for the CPU side rather than actually porting and adapting things to OpenCL.

    I appreciate this detailed overview of what you've done. I hadn't considered the QEF to be worth moving to OpenCL but I might give it a try now, since looking at your code it looks easier than I had expected.

    Great post.

    1. Thanks for the feedback. The impl described here is out of date since I no longer construct octrees, instead I just create what would be the leaf nodes for the tree. This makes the whole thing much, much faster -- you can see this in the latest video.

      Storing the density value is probably a bad idea. It'll work some times for simple shapes but once you start introducing rotations etc I expect you'll start to get lots of artifacts. Here's some shapes to try with your renderer to see what I mean:

      The CSG tree thing sounds interesting, that's something I've not looked at currently and the main bottleneck in my impl is that I'm just brute forcing the density functions for all the shapes each time so this starts to get slow as the number of functions increases. mediaMolecule's Dreams engine solves that by reducing the number of operations required for each voxel to the absolute minimum but the implementation is (as I understand it) quite tricky when you start including things like the 'smooth min' function for blending shapes.

      I think No Man's Sky work around the multiple density functions problem (somewhat) by using a super elliopsid type algorithm, that's something you could maybe have a look at. That would let you control the shape of the objects as well as their size via params.

    2. And this post might have some more useful info if you've not read it:

    3. Yeap, I've got some issues. Not so bad on the 129^3 grid (ribbing along the edges), but if I drop it down to 65^3 the PaniqQ shape's edges turn into a saw toothed mess. Might have to give something else a whirl.

      Actually haven't run into problems with transforms that weren't grid related (which is still a problem in the CPU version), most of the issues I've run into have pertained to small shapes (< 15 units [on the grid]).

      Building the shader probably wouldn't be that bad using linking. Right now I just feed three sets of sources (density functions, the generated density function from the CSG tree, and the file for the kernel that builds the field), so it's not quick enough to consider outside of a GUI editor (around ~150-200ms). I'd assume that if just compiling a program for the generated density function and linking the three into a single program it wouldn't be that bad at all.

      While it's nice for the basic shapes, dealing with polygon objects (SDFs) is still doesn't fit in as naturally. It's tempting to just move everything over into actual voxels of the hermite data and just be done with it.

  9. Hi, I'm currently trying to write a GPU voxel engine for certain deformable objects in our game, and your code has been super useful (I'm still a bit fresh in GPU programming, but I'm getting there, super interesting!)

    I'm a bit stuck on a bit in your sample code and what it does exactly which is the function "EncodeVoxelIndex", it might be a tired brain, but I can't work out what this does with the bits from the hermite edge index, I see that the axis is added onto the end of the int, but how is the hermite edge index transformed?

    Thanks again, this has been a massive help!


    1. Hi, I'm glad it's been useful :)

      Here's the function since I never included it in the original post:

      uint EncodeVoxelIndex(int4 pos)
      uint encoded = 0;
      encoded |= pos.x << (VOXEL_INDEX_SHIFT * 0);
      encoded |= pos.y << (VOXEL_INDEX_SHIFT * 1);
      encoded |= pos.z << (VOXEL_INDEX_SHIFT * 2);

      return encoded;

      VOXEL_INDEX_SHIFT should be enough bits to hold the index range for your voxel grid, plus 1. E.g. if you have 64 voxels per chunk then you need 6 bits to hold the 0-63 range, but you should set VOXEL_INDEX_SHIFT to 7 which I'll explain below.

      So EncodeVoxelIndex is used to build part of the unique ID for each edge in the octree which is (voxel_index << 2) | axis_index. That means that the edges in the grid can all be a) uniquely identified and b) easily queried since the key for any edge can be built like this (useful for finding which edges are active in a voxel, for instance).

      There's a small problem with this though, which is why you need the "plus 1" above: for the last row of voxels (e.g. with one X, Y or Z coord = (VOXELS_PER_CHUNK - 1)) then we can't encode some of the edge indices as they would fall outside the range.

      E.g. say you have VOXELS_PER_CHUNK = 64 then the voxel at (63, 0, 0) would need to look at 12 edges, not all of which are inside the 0-63 range. Using an extra bit in the shift means we can encode edges at (64, ?, ?) etc.

      That's it really, the unique IDs are created for the edges so they can be queried later.