Rendering HUGE Amounts of Voxels II

One year ago I have written a post with the same header. Now I am back to the project Monolith and the 20ms frame time I got before was just unbearable. The observation that there are components consisting of the same voxel patterns again and again helped a lot to implement a new hybrid approach: Geometry-Shader-spawned cubes with ray marched volume textures.

Ray Marching Voxel Textures

As before I use optimized geometry shader instancing for boxes. But now I draw boxes down to the component level only (a component is a functional unit in monolith e.g. a thruster or a laser). Below that the geometry is static and it is affordable to store dense information for that, i.e. volume textures.

The laser weapons in the front demonstrate the effect of neighbor-sensitve sampling

The laser weapons in the front demonstrate the effect of neighbor-sensitve sampling

Visualizing volume textures with ray marching is common sense and straight forward. Since I wanted to have a blocky look the ray can do steps from plane to plane in the uniform grid. The traversal can stop if any non-zero entry is found in the texture. It is possible to march through the texture without maintaining a floating point ray position. Therefore the initial texel position is determined and then in each iteration one of three dimensions is increased or decreased depending on the sign of the ray direction. The right dimension to modify is that which is closest to the next grid-plane in ray direction. Despite changing the texture coordinate, the plane distances must be updated which is a simple subtraction of the distance of the step. The chosen dimension in which the step was done has 0 distance then, so it is reset to the full projected distance between two planes.

The following fragment shader performs the ray marching. Additional to the explained ray marching  voxelMask  is introduced. This mask changes the appearance of the component dependent on the neighborhood. It is a code with a bit for each side and an additional one for non side dependent voxels. The texture contains a mask of the same kind too. Hence, a simple logic AND can decide between visible voxel or not. Additionally the geometry shader computes a LOD (mipmap) dependent on the view distance. Doing this in the fragment shader would cause artifacts, because a LOD-seem can run through a single component which would create little holes.


Results - Incredible!

The gain is much larger than hoped. The new hybrid approach takes only a few milliseconds (3-4) on the notbook GPU GTX850M. Before this took over 20ms. The new bottlenecks seem to be driver overhead (the GUI in main menu also takes 3ms) and bad scheduling. Well, I immediately increased the component resolution from 8 to 16 cubic which ended in a bit more than 4ms for really many voxels. Unfortunately the new approach is more sensitive to screen resolution but even so it is on the good side of 16ms. The image shows 62663 components (stone and water) with 16x16x16 voxels each. So the total theoretical amount of voxels in the image is around 256,667,000 (a quarter gigavoxel) whereby the effective number is much smaller (there are only 1M pixels). However the old render approach would have scheduled approximating 4 million voxels for the same image which is much more than 62 thousand.

Approximating 250,000,000 voxels in 4.45ms on a notbook GTX850M

Approximating virtually 250,000,000 voxels (not all visible) in 4.45ms on a notbook GTX850M

Performance Optimal Vector Swizzling in C++

In a shader we can type vec3 v0 = v1.xxy * 2 and any other combination of x, y, z and w depending on the length of the vector. The resulting vector must not have the same same size (in the example v1 could be a vec2) because components can be copied through. This is called swizzling and is really comfortable.
Vectors are everywhere in game projects not only in the shaders. How can we get the same behavior in C++? Can we get it without losing performance? I wanted to understand if and how this can be done. There are two solutions available: The glm library from G-Truc and the CxxSwizzle library. Anyway, I did not test the two libraries for their performance but if you wanna have swizzling you might take one of them instead of the header file I had written. The advantage is that they have implemented more functions so far. But, I did not found explanations about how to solve the problem so I will try to fill that gap.

Before we can start here are the problems to face:

  • (1) Access the elements in arbitrary order and count: v0.xxy + v1.xzy
  • (2) Write to a swizzled vector v1.yxwz = v0;  where doubled elements are explicit forbidden
  • (3) No Memory overhead: a vec3 should have the size of 3 times its base type
  • (4) No Computational overhead: a solution with multiple lines containing equivalent scalar operations should not be faster

First there are two different possibilities to a achieve the syntax v1.yxwz = v0; without brackets: macros and unions. You could also have a nested type but then the expression would not return any address and it is impossible to calculate things on the data of v without its address. In case of macros you can hide functions like yxwz() which do something you want. The problems with functions is that they get complicated on the left-hand-side where we want them to return references to swizzlings. The example (2) should fill the vector v1 in a swizzled order and not compute things on some copy of v1. You might be able to solve that with template meta programming or explicit proxy objects. These are objects of another type containing a reference to the original type. Operators on them will always access the original elements in some type-dependent way. However Returning proxies might be to complicated for a compiler to be optimized away. Further I do not like to have macros like x to pollute all my namespaces!

The union Solution

In a union all members work on the same space. If each member has a different type and if there are operators for each we can do everything we want.

The types must be trivially copyable, otherwise it would not be possible to put them into a union. It is possible but not feasible to write so many types, so we want the compiler to make this job: using templates.

The Swizzle-Proxy Template

The above class shows the basic idea of how to implement the operators for swizzling with exactly two elements: xx, xy, wx, ... . The template arguments A and B can be any index of elements in an underling real vector. For the swizzle wx  A  is 3 and  B  is 0 accessing two elements of a vec4.

Notice: the class itself does not have own members! Instantiating it would cause lots of access violations. Together with the union above the this  pointer becomes a pointer to m_data . That is why we can cast it so ugly without fear.
Unfortunately when compiling the compiler must create a new operator for each combination of swizzle types. This increases compile times heavily which cannot be avoided.

So far we can use the class the following way:

The second line would also compile but behave wired. It would add v2.z and v2.x to v1.x successively. To avoid that we can cause the compiler to fail by the following trick:

Depending on how the indices are chosen the return type is either SwizzleProxy2 as before or struct OperationNotAvailable which is nowhere defined. In the second case the compiler cannot create the function and will give you an error message which will contain "OperationNotAvailable" at some point.

To implement all the different operators for all SwizzleProxyX class I tried to create a template based collection of common operator implementations. The problem was that the compiler failed to optimize everything so we need to do that ourself for each of the (four) proxy templates. So the old CommonVectorOperator class currently contains the array access operator [] only. To still reduce the work a little bit I used macros for code generation. The macro is undefined at the end of the operator section such that from outside there are no unnecessary symbols. Just have a look into the code of the complete SwizzleProxy2 class.

Remark: The scalar-vector operators are implemented as friend . This is a trick in C++ to avoid having such functions in the global namespace. The compiler can still find the function by ADL (argument dependent lookup). For each different template argument setup of the proxy class there is exactly one such operator.

You might have noticed that the template takes a VectorType argument. This is required in the implementation of the non-assigning operators as a simple +. These must return a new copy which is only possible of the real vector type is known.

The Final Vector Class

If the final class would not inherit from the proxy class operations on normal vectors would not succeed. Instead it would be necessary to write additional operators which take vector-swizzle, vecot-vector and swizzle-vector arguments but fortunately inheritance is much easier.

Then the union is filled with all access patterns up to vec4. As you can see these are 30 for a vec2. For a vec4 itself this number grows to 340 because there are four instead of two indices for each element.

Before the last constructor we would not be able to use all the nice swizzling stuff fluently. Calling move(position.zyx) would fail because .zyx is not a vector (assuming move would like to have a vector). The implicit cast generated through this constructor is rounding off the whole implementation.

Full Header: swizzle.7z
Currently the implementation lacks functions like normalization... They might follow later.

Rendering Huge Amounts of Voxels

As it happens I try to render many voxels in my latest project Monolith (Project page is still pending, but you can look on github for it).

What is a Voxel? A voxel is a small volumetric element - it is a pixel in 3D. The nice things with them is that the logical arrangement in a 3D textures allows simple algorithms, e.g. for destructions. That is why I want to use them.

What is many? Well Monolith is a space strategy game based on voxels. Therefore spaceships and planets and solar systems and ... are made of voxels. To store the earth in 1 m³ large volume elements we would need 10^21 of them. Even if we store only one byte for each this requires 962081 PB RAM or disc space. So, many is as much as possible.

Which approaches do exist?

First of all I decided that I want to render voxels directly in the form of boxes. There are people how don't like that and yes there are algorithms as marching cubes to create continuous surfaces but I considered real cubes as the way to go. They allow much more dynamical updates because not every change causes a recomputation of complex meshes. Following Ideas can suffice for this task:

  • Ray marching (see Gigavoxels [1])
    It is possible to use octrees - which are the natural representation of voxel hierarchies - to efficiently ray-cast a 3D texture. But its not exactly what I want to do. All models in the game have there own dynamic transformation, hence there would be many relatively small 3D textures and I have to test which objects are hit by a ray before using a sparse octree traversal. This approach would probably not be realtime.
  • Static vertex buffers with many cubes
    One could create quads for each surface side of a cube. Drawing static meshes is just the fastest thing a GPU can do. Updating the information is somewhat tricky because we need to know a neighborhood. I assume this to be fastest anyway but it requires plenty of space. It is possible, but not probable, that I benchmark that later.
  • Instancing
    The advantage is that only a few data per cube is required on GPU. Since there is no need to store real vertices (only one real cube mesh and many instance information) this would be efficient. To update a local change only the instance information must be overwritten or masked and bus-traffic should be low.
    Indeed I had a reference implementation for many instanced cubes but I wanted to test if this can be outperformed.
  • Geometry shader
    This is what I did and I will explain it in detail.

My Geometry Shader Advanced Instancing

It is common practice to create billboards in form of quads inside a geometry shader. Why not whole cubes? Depending on the vertex normals (smoothed or per face) this requires either 16 or 24 vertices. Both are possible but effectively to large to be efficient. On the other hand we can decide if a cube is visible before any triangle is created. It happens that backface culling for a cube will always remove one of two opposite faces. Therefore my shader emits at most 12 vertices for 3 quads.

Spanning the cube

For a single voxel based model all cubes have the same size and rotation. One could compute some offset vectors to a voxel center in the shader, but since these vectors are equal for all instances they can also be precomputed. To avoid two vector-matrix transformations the vectors must be added in object space (before transformation) or in projection space (after transformation of the center). I did the second one.


In projection space frustum culling is easy. The homogeneous coordinate must be inside the unit volume: . Multiplying this with gives the simple equations: 

. The z coordinate equation depends on which clipping planes are used. These are different: [0,1] in DirectX and [-1,1] in OpenGL. Using this test on the center would cause voxels to disappear to early. So I added an offset and "made the frustum larger".

Then I added backface culling. This got a bit more complicated. I did not find a way to do this in projection space directly. A direction in projection space can be transformed to view space with a single multiplication (assuming you have the same projection matrix as I have) with (1/mProjection[0][0], 1/mProjection[1][1], 1/mProjection[2][2]) . As last trick I added a masking for rejecting faces which are "inside". If there are two neighbored voxels the two sides which are shared in between are never visible.

Remarks: Of course you need to emit normals, texture coordinates or other stuff you need to colorize your voxel. I do lightning on voxel basis to reduce the export bottle neck. Further the code above is only for two of the six sides. It must be replicated for the other two directions.


Geometry shaders aren't that fast, especially if the amount of data increases. I encoded the invisible masking and the position inside a single 32 bit integer value. There are 6 bits for the mask of each side and 3*8 bits for the position, the remaining 2 bits are not used currently. This allows to address 256^3 positions which leads to the need of a chunked approach for larger models.

Monolith is using a 32^3 chunk size only because, despite the technical necessity, this allows an efficient memory managment as well as easy LOD and large scale culling algorithms. More information might come in a future post.


The image was made with an Geforce 640M LE. It draws 32368 voxels per millisecond. Let me know if you achieve higher rates with another approach!

Each different colored area is its own voxel

Each different colored area is its own voxel