Loose Grids

Spatial data structures are an important part of game engines, with uses in various systems such as collision detection, rendering and gameplay. Octrees, BSP trees and bounding volume hierarchies are just a few of the options available. However, for a many of applications these complicated hierarchical structures aren’t a good choice. Tree structures structures often have poor cache performance and aren’t ideally suited for SIMD or other parallel processing techniques. They can also have considerable cost when updating adding, removing or updating the positions of the objects. Instead, setting up your data in a way that allows brute force processing with highly optimized code may be faster, even if you typically end up processing a lot more objects.

The most cache friendly data structure is an array (or possibly a structure of arrays). It also allows trivial updating of objects, but for some data sets, putting all objects into a single array may be too slow. Most likely, we can process a lot of items in an array very efficiently, but the total number of objects is an order of magnitude too high to make this practical.

The next simplest method is a regular grid, where each cell stores any array of the objects inside it. Typically you can make the cells pretty big (to amortize cache misses and maintain efficient processing over the contents) and for decently distributed objects we end up processing many fewer objects. However, regular grids have some drawbacks. The first draw back is that objects might straddle multiple grid cells and need special handling. A typical solution is to store the object in every cell it overlaps, but this makes the book-keeping more complicated and can affect your ability to process cells in parallel. The second draw back is that grids have finite size, so you need to either prevent any objects from moving outside the bounds of the grid, or resize the grid. Resizing can be ugly since it involves touching everything in the grid.

The first drawback can be solved by borrowing an idea from Thatcher Ulrich’s loose octree data structure. Like the loose octree, we expand each cell of our grid by a fixed amount so that each cell overlaps its neighbors. When inserting an object into the grid, we place the object into any of the cells it is contained within. As long the object is smaller than the overlap size, we only need to store it in a single cell.

This overlap scheme affects how we traverse the grid when performing operations on the object. If we wanted to find all of the objects within a sphere, we need to test all of the grid cells whose expanded versions fall within the sphere. This means we’ll end up testing more cells than in the regular grid, but remember we are assuming that processing the objects is relatively fast due to our efficient array layout.

This grid works nicely if all of the objects are smaller than the overlap size, but if an object exceeds it, it can still can fall into multiple cells. In most applications it will probably be reasonable to impose some sort of maximum size on objects. However, this hard limit can be transformed into a “soft” one by creating an additional cell where any objects above this maximum size are placed. In the example of querying for objects in a sphere, after we tested all of the cells overlapping the sphere, we also test this “overflow” cell.

The overflow cell also allows us to deal with objects moving outside the grid. Like with the soft limit of object size, we assume that most objects will be within our predefined grid bounds, and put any objects which violate that constraint in the overflow cell. Since the number of outliers stored in that cell is small, the additional testing should be negligible.

Loose grids may not be the best candidate for all situations, but if you are looking for an easy way to spatially partition a data set into manageable parts, it can be a good option. We’ve had success with this setup for frustum/occlusion culling, broadphase collision detection and gameplay queries in the engine for Natural Selection 2.

First person view model rendering

In a first-person game, the player’s hand or weapon is often represented by a model called the “view model” in the terminology established by first-person shooters. One important consideration is that this view model does not intersect the world geometry when the player is close to a wall. It’s not always possible to limit the player’s movement so that this doesn’t happen, so special rendering methods are used to draw the view model.

The most obvious method is to draw the world first, clear the z-buffer and then draw the view model. This has a few disadvantages. The biggest one is that at end of the drawing process you don’t have a single z-buffer describing the entire scene.

This is important if you are using a deferred rendering process where the light volumes will need to be tested against the z-buffer. This method also requires an additional clear of the z-buffer and doesn’t take advantage of any early z-rejection for pixels blocked by the view model.

Fortunately there’s a different method which is very simple and avoids these problems. By manipulating the projection matrix, we can map the view model and the world into mutually exclusive parts of the z-buffer. First a little refresher on projection and the z-buffer (assuming Direct3D conventions, OpenGL is slightly different):

The result of transforming a point (x, y, z, 1) by the projection matrix is a homogenous vector (x’, y’, z’, w’). The z-buffer stores the value z’/w’, which is in the range 0 to 1 after clipping. For your run of the mill projection matrix, w’ = z, which is what you’re probably saving to the G-buffer if you’re doing deferred rendering. Our manipulation of the projection matrix should not affect x’, y’ or w’ so that everything else should continue to work properly.

To map to a sub-region of the z-buffer, we we want to apply another transformation to remap z’/w’ from the range [a, b] instead of [0, 1]:

$z''/w'' = z'/w' \cdot (b - a) + a$

or, since we require that w” = w’:

$z'' = z' \cdot (b - a) + w' \cdot a$

Finally, this can be easily expressed as a matrix multiplication:

$\left[ {\begin{array}{c} x'' \\ y'' \\ z'' \\ w'' \\ \end{array} } \right] = \left[ {\begin{array}{c} x' \\ y' \\ z' \cdot (b - a) + w' \cdot a \\ w' \\ \end{array} } \right] = \left[ {\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & b - a & a \\ 0 & 0 & 0 & 1 \\ \end{array} } \right] \times \left[ {\begin{array}{c} x' \\ y' \\ z' \\ w' \\ \end{array} } \right]$

To compute the final projection matrix, just pre-multiply the original projection matrix P with this adjustment matrix M:

$\left[ {\begin{array}{c} x'' \\ y'' \\ z'' \\ w'' \\ \end{array} } \right] = M \times \left[ {\begin{array}{c} x' \\ y' \\ z' \\ w' \\ \end{array} } \right] = M \times P \times \left[ {\begin{array}{c} x \\ y \\ z \\ 1 \\ \end{array} } \right]$

Using a matrix that maps to [0, ε] for the view model and a matrix that maps to [ε, 1] for the world allows you to ensure they never overlap each other. Just keep in mind that you will be reducing the precision of your z-buffer this way, so try to make ε as small as possible.

Lua ‘local’ performance

The ‘local’ keyword in Lua is used to restrict the visibility of a variable in the Lua compiler, however it also affects how the code is executed by the VM. Ultimately this can have a significant impact on how efficiently the variable is accessed. Here are 4 simple examples with big performance differences due to the use of local.

Example 1

function bar() end   function foo() bar() end

In this case ‘bar’ is a global variable. Global variables are stored in a hash table, so every time bar is called a look up is performed. Although hash table look up is well optimized in Lua, this is not a simple operation.

Example 2

local function bar() end   function foo() bar() end

Here ‘bar’ is an up value of ‘foo’. If bar is only being used by foo (or by other functions in this file), this approach is better than Example 1 since we’re not polluting the global namespace. From a performance standpoint it’s also better; in this case bar will be stored as an up value to foo which allows it to be accessed directly by a pointer. This inner body of this function runs about 1.25x faster than Example 1.

Performance tuned Lua code will often cache a frequently used global function into a local variable to achieve a similar same benefit.

Example 3

function foo() local function bar() end bar() end

If ‘bar’ is used only by foo, it may be tempting to make it a local variable of foo as shown here. This however has some significant performance issues. Although bar is a local variable to the
function (which is efficiently accessed by index), it will be initialized every time the VM enters foo. This means a new function closure will be allocated every time foo is called and more
garbage will be created. The body of this function will run about 2.7x slower than Example 1.

Creating a new closure every time may seem unnecessary, however functions are mutable in Lua (by calling setfenv for example) so it’s required for correct semantics.

Example 4

do local function bar() end function foo() bar() end end

Finally, if you really want to make ‘bar’ only accessible by ‘foo’, you can do it as shown in Example 4. Here bar will be an up value to foo, so the performance is the same as in Example 2.

Text pre-processing

Text pre-processing is a useful technique for adding features to systems that use text files as input. Probably the most common example of pre-processing is the C pre-processor. This is the mechanism that handles comments, #includes and #defines before C/C++ code is feed into the compiler. These are features that could be used in many different types of text files and Boost Wave is one C pre-processor implementation that’s readily available.

In addition to using a “generic” pre-processor we can also create very domain specific pre-processing functions. Here are two examples.

We use the D3DX FX system for our shaders. One problem you can run into with shaders is requiring all sorts of different combinations of parameters, like does this shader use normal mapping, skinning, etc. Writing all of these combinations out by hand can be tedious and the FX system doesn’t directly have anything to support the concept of shader permutations.

technique WriteDeferred[AlphaTest][NormalMap][Skinned] { pass p0 { VertexShader = compile vs_2_0 WriteDeferredVS(Skinned); PixelShader = compile ps_2_0 WriteDeferredPS(AlphaTest, NormalMap); #if AlphaTest AlphaTestEnable = True; AlphaFunc = Greater; AlphaRef = 128; #endif } }

With this enhanced syntax, we instruct the pre-processor to generate all combinations of the WriteDeferred technique with the AlphaTest, NormalMap, and Skinned options. This results in 2^3 = 8 combinations:

WriteDeferred
WriteDeferredAlphaTest
WriteDeferredNormalMap
WriteDeferredAlphaTestNormalMap
WriteDeferredSkinned
WriteDeferredAlphaTestSkinned
WriteDeferredNormalMapSkinned
etc.

Here’s what it expands to after our pre-processor runs on it:

technique WriteDeferredNormalMap #define AlphaTest 0 #define NormalMap 1 #define Skinned 0 { pass p0 { VertexShader = compile vs_2_0 WriteDeferredVS(Skinned); PixelShader = compile ps_2_0 WriteDeferredPS(AlphaTest, NormalMap); #if AlphaTest AlphaTestEnable = True; AlphaFunc = Greater; AlphaRef = 128; #endif } } #undef AlphaTest #undef NormalMap #undef Skinned

Note that our custom pre-processor generates instructions for the standard C pre-processor that the D3DX FX system understands. Generating pre-processor #defines based on the permutation allows us to customize the body of the technique for each configuration. We also use to it alter the compilation of vertex and pixel shaders using compile-time constants.

This is a simple example, where each of the “parameters” is either on or off, but since we have complete control in our pre-processor, we can make things as complex as we like. Here’s another example from our system, which allows us to specify a set of values to choose from (parenthesis mean the parameter always has one of the values, brackets mean the parameter is optional).

technique Fill(Color,Textured)[Add,Multiply][UseStencil] { // ... }

One caveat to modifying the input to the FX compiler this way is that the line numbers will be incorrect if it reports an error. This easily solved by maintaining a separate mapping from old line numbers to new ones. Using this, any error messages that reference line numbers can be fixed up before logging them.

Lua Profiling

Here’s a piece of Lua code that’s marked up for profiling:

function Actor:UpdateBoneCoords()   PROFILE("Actor:UpdateBoneCoords")   local model = Shared.GetModel(self.modelIndex)   if (model ~= nil) then local poses = PosesArray() self:BuildPose(poses) return end   self:UpdatePhysicsModelCoords()   end

We use this same method for instrumenting our C++ code for profiling. In C++, PROFILE is a #define which creates a C++ object on the stack. When this object is constructed profiling begins, and when it’s destroyed (because it goes out of scope) profiling ends. In a ship build, the PROFILEs are #defined to nothing to eliminate any overhead from the final game.

In Lua, there’s no way to directly implement this because we don’t have control over an object’s lifetime. There’s also no built-in pre-processor, so there’s no way to completely eliminate the PROFILE markers. Both of these problems can be solved with a custom pre-processing step. Our pre-processor searches for the PROFILE token and inserts function calls at it’s location, and at every location where we exit the Lua block that contains it.

Here’s what the pre-processed output looks like:

function Actor:UpdateBoneCoords()   __profile_start(10)   local model = Shared.GetModel(self.modelIndex)   if (model ~= nil) then local poses = PosesArray() self:BuildPose(poses) __profile_end(10); return end   self:UpdatePhysicsModelCoords()   __profile_end(10); end

In addition to inserting the __profile_begin and __profile_end markers, the string “Actor:UpdateBoneCoords” was also inserted into a string table and the index was substituted. This allows the profiler to run with as little overhead as possible. Handling all of the different syntax cases in Lua is a little complex, but the code is only around 150 lines of code.

With the Lua pre-processor, it was possible to maintain the line numbers using Lua’s ability to have multiple statements on a single line. Keeping the line numbers identical eases error reporting and Decoda integration.

Error Checking with Dynamic Typing

One of the common complaints in our office about Lua is that the use of dynamic typing means that most problems in the code aren’t discovered until run-time.

Since the types of variables isn’t known until you’re actually running the program, you won’t know ahead of time if you called a function with the wrong number of arguments or if you tried to take the square root of a string.

The problem with this argument is that even when using a statically typed language (like C++), there are still all kinds of problems that won’t be discovered until run-time. For instance an off-by-one error in a loop, using an uninitialized variable, or any type of logical error. So while compile time reporting of type errors is helpful, it’s not sufficient.

The way to catch these types of problems at “compile time” is to have a suite of unit and functional tests that exercise all of the code’s features. These tests can be run every time you compile, check-in or make a refactoring to ensure that the code is working properly.

Writing unit tests takes programming time, while static type checking happens automatically. Fortunately, you don’t need to write explicit type checking tests, since the functional tests will do this implicitly.

Eliminating Exporting in an Asset Pipeline

An asset (like a model) undergoes a number of steps from in the journey from an artist’s tool to real-time display in a game engine. These steps are known collectively as the asset pipeline. The pipeline for a typical game engine looks something like this:

The proprietary format on the left is the format that the content creation tool natively saves. For 3D Studio MAX this is the .MAX format. This is an undocumented binary format that relies on installed plug-ins to load, so it’s not really possible or practical for our own tools to read it directly. In order for our tools to load the asset it first must be converted into an intermediate format.

The intermediate format is either a standard format like COLLADA or a custom defined format that is easy for our tools to read. This file is exported from the content creation tool and stores all of the information we could possibly desire about the asset, both now and in the future. Normally this export process is done by selecting Export for the tool’s menu and then choosing the name of the file to write. Since the intermediate format is designed to be general and easy to read, this intermediate format isn’t optimized for space or fast loading in-game. To create the game-ready format, we compile the intermediate format into a game format.

The game format is an optimized format that is loaded by the engine at run-time. The game format file is generated by our tools from the intermediate format. The main reason we don’t directly export from 3D Studio MAX into our game format is that we may want to change it in the future. If the game format was created by our exporter then any change to the format or creation process would require re-exporting all of our assets. Having an intermediate format also allows us to have different versions of the game format that are targeted at different release platforms (PC, console, etc.).

One annoyance with this setup is that the proprietary file and the intermediate file have to be kept in sync — that is whenever a change is made to the .MAX file, the artist needs to be re-export the intermediate format for the change to appear in game. Typically the proprietary file and the intermediate file would both be kept in some sort of revision control system such as Perforce. In addition to needing to be re-exported, the files need to be checked in as well. While these both seem like they would be easy, on a large team with tens of thousands of assets, mistakes will happen.

The ExportObject in 3D Studio MAX

A solution that we used at Iron Lore was to automatically export the intermediate format when the file was saved in 3D Studio MAX. The way this was done was with a special plug-in. This plug-in was an object called an ExportObject that the artists would place in the scene. The object looked like a floppy disk (how quaint!) and the only thing it really did was implement a custom save handler. As mentioned earlier, the proprietary .MAX format requires plug-ins to load it, and the reason for this is that each plug-in is responsible for saving its own data. In the case of the ExportObject, during the “save” it would export the entire scene to the proprietary file format.

There are a few options for how to save this data. The first is to save directly to a new file on disk. The second is to write the data in-place in the .MAX file (i.e. where the plug-in is supposed to save its data). The third is to append the data to the end of the .MAX file.

The first option is unattractive because we still have two files that would need to be kept in sync in Perforce. The third option is the one that we used at Iron Lore and is possible because 3D Studio MAX will ignore data at the end of the file when loading a .MAX file. Implement this system again today, I would chose the second option because it doesn’t rely on this quirk of the loader.

Once the data is auto-exported, it’s a simple matter to load it in our own tools. In the case of appending the data to the end of the .MAX file, I included the length of the intermediate data as the final 4 bytes. This makes it very simple to seek backwards and read only the required data.

If the data is written in-place, then a special token needs to be used to locate the appropriate block of data. This token is simply a string of bytes that is unlikely to appear anywhere else in the file and signifies the beginning of the data. If this string is long enough, the chance of it appearing elsewhere in the file is astronomically small. If this doesn’t seem robust, consider that there is roughly a 1 in 108 chance that you will be struck by lighting on any particular day; there is a 1 in 1068 chance that an arbitrary 8-byte string will appear in a typical sized .MAX file.

Since the ExportObject is a persistent member of the scene, it can also store options that might be necessary for exporting. Since the export options are stored along with the .MAX file, the file can be re-saved/exported without the operator having to know or re-input the correct options.

This system worked very well at Iron Lore — in the future I’ll describe our first approach which was much more problematic!. The only thing to keep in mind is that the export must be fast so that it doesn’t noticeably slow down the saving process for the artists.

Specular Color in Light Pre-pass Renderer

Light pre-pass and inferred lighting have some nice properties that make them popular alternatives to traditional deferred rendering. Like a deferred renderer, a light pre-pass renderer generates a G-buffer and then accumulates the lighting using screen space operations. A light pre-pass renderer differs from a deferred renderer by only computing part of the lighting equation in the passes. The full equation is then computed in a final pass.

Here is the basic equation for computing the lighting for a surface using the Blinn-Phong reflection model:

$D \sum C_i (N \cdot L_i) + S \sum C_i (N \cdot H_i)^G$

D is the diffuse reflectance of the surface, S is the specular reflectance, G is the glossiness and N is the surface normal. These are all properties of the material and are constant over all of the light passes. The light-dependent parameters are L (the direction to the light source), C (the color of the light source) and H (the half vector).

With a light pre-pass renderer, the calculations are performed as they are organized in this equation. First the renderer loops over all of the lights computing the summed terms. Then a final pass is performed in which the lighting results are combined with the material parameters D and S.

By doing the lighting this way you can reduce the width of the G-buffer, since you don’t need to store the diffuse or specular components (we do however need the glossiness). In the case of inferred lighting, you can also compute the lighting at a lower resolution than the final image while still having full resolution textures.

A light pre-pass renderer is typically implemented using a RGBA render target and each pass accumulates the diffuse part of the equation in the RGB channels and the specular part in the A channel. While we would ideally like to store the diffuse and specular components in 3 channels each, this requires writing into multiple render targets. Currently support for alpha blending into multiple render targets is uncommon and the result would require more bandwidth and memory. Since the specular component in our original equation is modulated by the light color we lose information when we store it in a single channel of the render target.

With this adjustment to the equation, here’s what we’re computing in the lighting passes:

$\begin{tabular}{ l c l } $${target}_{RGB}$$ & $$=$$ & $$\sum C_i (N \cdot L_i)$$ \\ $${target}_A$$ & $$=$$ & $$\sum {intensity}(C_i) (N \cdot H_i)^G$$ \end{tabular}$

One addition is the intensity function. This is used to convert the light color into a monochrome value so that dim lights will have less pronounced specular highlights. A photometric function could be used, or we could treat the color as a vector and compute the length; we’ll return to this later.

One way to reconstruct the lighting in the final pass is to scale the diffuse component by the material’s diffuse reflectance and the specular component by the specular reflectance and add them together:

$D * {target}_{RGB} + S * {target}_A$

Because the specular lighting component is monochrome, we’re computing the lighting as if the light sources were all white; the highlight will pickup the color of the material, but it won’t be affected by the original color of the light. If all of your lights are white — like in an office building perhaps — this is fine. However, when you have intense colored lights the results can look a little strange:

Fortunately, there’s a simple approximation we can apply to get much better results. Since the diffuse lighting includes the color of the light source encoded in it, with a little manipulation we can extract it and apply it to the specular component.

In the case of a single light source, the RGB portion of the render target simply contains the light color multiplied by diffuse attenuation term. If we treat the color as a vector, we can decompose it into a normal vector and a magnitude:

$\begin{tabular}{ l c l } $${target}_{RGB}$$ & $$=$$ & $$C_0 (N \cdot L_0)$$ \\ & $$=$$ & $$\hat{C_0}\left \| C_0 \right \| (N \cdot L_0)$$ \\ $$\hat{C_0}$$ & $$=$$ & $${normalize}({target}_{RGB}), \mbox{for} \; (N \cdot L_0) \neq 0$$ \end{tabular}$

The final equation shows the normalized light color extracted from the render target. Note that the required condition will always be true where we want the specular highlight to appear. If we use the length of the light color as our intensity function during the lighting passes, then multiplying by the normalized color in the final pass gives us the correct result in the case of a single light source.

$D * {target}_{RGB} + S * {normalize}({target}_{RGB}) * {target}_A$

Here’s what the result looks like when we extract the light color from the diffuse and apply it to the specular:

As mentioned, this only holds for a single light source. When there are multiple light sources our calculations become an approximation. Thankfully, the results look quite good in real world scenarios with this method and there’s very little cost to including it in a light pre-pass or inferred renderer. Comparing with a reference image, it’s easy to see the discrepancies caused by this approximation, but when viewed on its own the artifacts are not very noticeable.

Blending Terrain Textures

One of the most common methods for texturing a height map terrain is with multiple tiling layers.  These texture layers are stacked on top of each other and blended together to create the final textured look. In addition to the tiling texture, each layer also has an opacity map which controls how much of the texture is blended in at any point on the terrain. Unlike the tiling texture this opacity map is stretched over the entire terrain and is therefore very low detail.

Textured Terrain in Titan Quest

In a simple example, our terrain could have two layers: sand and grass. Anywhere the opacity map for the grass layer was painted 1 (opaque) we would just see grass. Where the opacity map was 0 we would see sand, and where it was 0.5 we’d see a mix of sand and grass. Note that the bottom layer doesn’t have an opacity map since the base layer must appear everywhere. This blending operation for two textures is implemented like this:

float4 blend(float4 texture0, float4 texture1, float layerOpacity) { return lerp(texture0, texture1, layerOpacity); }

The problem with this basic technique is that alpha blending a grass texture with 50% opacity over a sand texture doesn’t really look very good when you look at a closely:

The blended result on the bottom resembles green sand more than a patch of sandy grass.  If the textures were more closely related this would work better, so this can be solved by adding additional artist crafted transition textures.  This however creates more work for the artists, and requires us to render an additional texture layer. Finally, for every combination of textures that we’re going to layer we’d need a different transition texture, and ultimately we might need several.

There is however what I consider to be a better an easier solution which we successfully used on Titan Quest. The idea is that instead of drawing 50% of each texture at every pixel, at 50% of the pixels we draw one texture and at the other 50% we draw the second. This is what it looks like when we do this:

To implement this, we store a mask in the alpha channel of each of the tiling textures. This mask uses the entire range of values and determines whether or not a texel will be displayed when the layer has a particular opacity. Specifically, a texel is displayed if the mask is less than the opacity value.  This new, splotchy blending technique is simply implemented like this:

float4 blend(float4 texture0, float4 texture1, float layerOpacity) { if (texture1.a < layerOpacity) { return texture1; } else { return texture0; } }

The result has a more natural look, and the splotching pattern can be tailored for each type of texture individually. While this grass texture uses a noisy pattern, textures with more structure — such as stones — could be setup so that the blending follows the structure.  With stones, that the alpha mask can be created with high values in the cracks and low values on the flat surfaces.  When this is blended on top of a grass texture, the grass will first appear only in the cracks as the layer opacity is increased.

Because this is a binary decision, the result has hard edges and suffers from aliasing.  It looks best if we smooth out the edges between the textures by doing a small amount of regular alpha blending between the textures. Here’s one way to implement it, although other methods can certainly be used (for Titan Quest we used a simpler method to fit within the instruction limits of ps1.0):

float4 blend(float4 texture0, float4 texture1, float layerOpacity) {   const float blendRange = 0.1;   if (texture1.a < layerOpacity - blendRange) { return texture1; } else if (texture1.a > layerOpacity + blendRange) { return texture0; } else { float f = (texture1.a - layerOpacity + blendRange) / (2 * blendRange); return lerp(texture1, texture0, f); } }

Here’s what the final result looks like:

Although these examples show two textures, any number of layers can be blended this way, just like with the basic technique. This type of blending also isn’t limited to terrain textures, and interesting effects can be produced when the layer opacity is animated. For example,you can see the same type of blending at work in this prototype of dynamic infestation for Natural Selection 2: