DirectX Next Early Preview

Introduction

While the next major revision for DirectX is not expected until Longhorn’s launch, Microsoft’s DirectX group has been briefing developers on what’s in store for DirectX Next with presentations at Microsoft Meltdown and other developer conferences. Recently, this presentation was made available to the public via Microsoft’s Developer Network. The intent of this article is to give a more thorough treatment of the features listed for inclusion with DirectX Next and hence explore the types of capabilities that DirectX Next may be offering.

While the first 7 revisions for DirectX were met with mostly evolutionary enhancements and additions focused on very specific graphics features, such as environment and bump mapping, DirectX8 broke the trend and introduced a number of new, general-purpose systems. Most notably being the programmable pipeline, where vertex and pixel shading was no longer controlled by simple tweaking a few parameters or toggling specific features on and off. Instead, with DirectX8 you were given a set number of inputs and outputs and were pretty much allowed to go nuts in between and do whatever you wished, as long as it was within the hardware’s resources. At least, this is how vertex shading worked. Pixel shading, on the other hand, was extremely limited, and really not even all that programmable. You could do a handful of vector operations on a handful of inputs (via vertex shader outputs, pixel shader constants, and textures) with one output, the frame buffer, but that’s about it. DirectX8.1 provided some aide in this respect, but it wasn’t until DirectX9 that everything really started to come into place.

Unified Shader Model

DirectX9 introduced versions 2.0, 2.x, and 3.0 of the pixel and vertex shader programming models. 2.0 was intended as the minimum specification that all DirectX9 graphics processors had to be able to handle, 2.x existed for more advanced processors that could do a bit more than the specification called for, and 3.0 was intended for an entire new generation of products (which should be available sometime in 2004). Pixel Shader 2.0 contained at least six times as many general-purpose registers as were in the DirectX8 models and squared the number of available instruction slots, bringing pixel shader functionality to almost the same level as vertex shaders were in DirectX8. At the same time there was been an increasing need for texture lookups in the vertex shaders, especially for applications wishing to make sure of general-purpose displacement mapping. Consequently, with version 3.0 the pixel and vertex shading models have very similar capabilities, so much so in fact that having two separate hardware units for vertex and pixel shading is becoming largely redundant. As a result, all 4.0 shader models are identical in both syntax and feature set, allowing hardware to combine the different hardware units together into a single pool. This has the added benefit that any increase in shading power increases both vertex and pixel shading performance.

Unifying the shader models is great as there are several instances where you want to do things at the vertex level that’s mostly a pixel level operation, or something at the pixel level that’s usually done at the vertex level – but what about the instruction limits on these shaders, or resource limits all together? Wouldn’t it be great if you could create as many textures as you wanted, and write shaders of any length you wanted without ever running into those nasty hard-locked limits? Sounds like wishful thinking, but it turns out that it’s not really that far off – there’s just one little problem with current graphics processors’ memory systems…

Virtual Video Memory

The memory system on most consumer graphics processors is setup in a way that is very computer graphics centric – pretty much all it sees are texture objects, vertices, triangles, and shaders. If you want to render some geometry with a texture map, but the texture is not currently located in video memory, that entire texture with all of its mip-maps must be loaded into video memory before the graphics processor can begin rendering. There are a number of problems with this method, however. For one, it is extremely wasteful – for a given frame you’re never going to need to use the entire texture object such as mip-maps – in fact, the mip-map level you need the least often is the one that consumes the most memory and bandwidth: the first level. This is especially true for games set in very large, open environments where the vast majority of textures on screen are located so far away that only the lowest resolution mip-map levels are required. All of this excess information wouldn’t be a problem if only one texture needed to be transferred to the graphics card, but in cases where you’re constantly spilling over to AGP/system memory things can start getting aggravating for the user. Bandwidth on modern graphics cards is roughly ten times faster than that of the AGP bus, making it glaringly obvious to the user when their card has run out of available video memory – the frame rate of whatever application they’re using starts fluctuating sporadically.

One solution to this is to simply do all texturing directly from AGP memory. This would certainly remove the asymptotic performance problems with the above memory system, but only because now everything for every frame is going over the same, slow bus. In a sense this fixes the fluctuating problem by making everything equally slow; so clearly a better solution must be possible.

With graphics processors becoming increasingly similar to their general-purpose brethren, perhaps a better solution could be obtained by emulating how CPUs handle memory management. And, indeed, very similar problems were met as general-purpose processors evolved, since not all programs can always fit into CPU caches, and executing programs directly from system memory is far too slow. The solution back then was to change the focus away from physical memory constraints and instead use virtual memory. With virtual memory, the programmer no longer has to worry so much about the exact amount of cache or system memory and instead handles all memory allocations on the virtual address space, which is divided up into small pages (usually around 4KB in size). It’s then up to the implementation to manage each page and make sure the appropriate pages are in cache when they need to be, in system memory when they aren’t, or in a page file on the hard drive if there’s no other room.

When extended to graphics processors, virtual video memory takes care of the stuttering performance problems nicely, since all texture, shaders, et al. are split up into small chunks that can be seamlessly transferred over the bus. A 4KB page file, for example, equates to a 32x32x32bit sub texture, which is big enough so that you probably won’t have to transfer many pages over the bus whenever a new texture becomes visible (i.e. it is unlikely that much more than a 32x32 texel region of the texture has been exposed for the current frame), but small enough so that it can be done without virtually any noticeable performance hit.

Integer Instruction Set

While no longer having to deal with instruction limits is a big gain in programmability, there is still a lot more needed if the goal is to make graphics processors more general-purpose. A major area that needs improvement is in integer processing. Currently, almost everything done inside shaders is totally floating point (outside of static branching and the like), which is fine for most graphics operations, but it becomes a real big problem when you start doing dynamic branching or wish to do a form of non-interpolate-able memory lookup, such as when indexing a vertex buffer.

On current graphics processors, the only type of memory addressing you can do is a texture lookup, which uses floating-point values. If the address does not exactly align with a texel, either the nearest texel is taken (in the case of point sampling), or several texels are and interpolated between to obtain a value somewhere in between the closest texels. For textures this is fine, but this is clearly completely inadequate for general memory addressing, where contiguous blocks of memory may be completely unrelated to one another (and, hence, interpolating between them is completely meaningless). Luckily, Microsoft is including an entire integer instruction set in the 4.0 shader model for just these types of problems.

Unlimited Resources

Virtual Video Memory also aids the resource problems in a number of ways. Firstly, AGP/system memory becomes a much more viable storage space that it was in the past due to overall better bandwidth and memory utilization. Secondly, since virtual memory works on a logical address space, the user of the graphics processor is free to consume as much memory as they want as long as it is within bounds of the virtual address space. It’s up to the implementation to efficiently map virtual memory to the physical constraints of the user’s computer. Applications consuming inane amounts of memory might not achieve the most optimal frame rates, but they will at least work. Virtual address spaces need not be small, either – 3Dlabs’ Wildcat VP uses virtual video memory and has a virtual address space of 16GB. Clearly, resources cease to be all that limited when you can consume 16GB of them.

When virtual memory is applied to shaders, things get a bit interesting. One problem with trying to use the old memory stems with shaders whose lengths are unbounded is that the traditional memory system doesn’t necessarily see shaders as a collection of instructions but rather an an abstract block of data that may or may not fit into the graphics processor’s instruction slots. Once the shader is loaded in, it’s executed on every vertex or pixel that follows until the shader is unloaded. Using this line of thinking, it’d seem the only way to achieve longer shaders would be to either continuously increase the number of instruction slots on graphics processors, or use some form of automatic multi-passing that would break down shaders into manageable chunks. There are obviously limitations to how many instruction slots that can be fit onto a graphics processor, so the first option is out. The second option does work, and can do so quite well with a proper implementation, but it turns out that the proper implementation is actually the same as what would be done in a virtual memory system.

A better solution provides itself with virtual video memory. Instead of treating shaders as unique entities of memory as has been don in the past, treat them just like everything else and split them up into pages too. Assuming the graphics processor has enough instruction slots to execute an entire page of shader instructions, you then just execute shaders as a series of pages – load in shader page 1, execute shader, pause execution while the next page is loaded in, resume execution, and so on until the entire program has been executed. Instruction slots then become simply a L1 instruction cache, and everything starts working as it does with general-purpose processors.

Virtual Video Memory is what allows DirectX Next to claim “unlimited resources”, since there’s no longer any bound on the length of shaders, or the amount of texture memory the user can consume as long as everything fits into the graphics processor’s virtual address space. There is, however, always going to be a practical limit to these resources. For shaders, whenever a shader is longer than what can fit into the graphics processor’s L1 cache, performance is going to take a relatively sharp dive as there’s going to have to be quite a bit of loading and unloading of pages, along with the waiting in between those stages, and all of this has to be done with the shading of each pixel. Likewise, there will be a relatively fast degradation of performance when textures start spilling over to system memory or, worse, hard disk space. To this end, Microsoft is including two new caps values for resource limits – one for an absolute maximum past which things will cease to work properly, and one for a practical limit past which applications will cease to operate in real-time (defined as 10fps at 640x480).

General I/O Model

There are a number of interesting consequences to a unified shading model, some of which may not be immediately apparent. The most obvious addition is, of course, the ability to do texturing inside the vertex shader, and this is especially important for general-purpose displacement mapping, yet it need not be limited to that. A slightly less obvious addition is the ability to write directly to a vertex buffer from the vertex shader, allowing the caching of results for later passes. This is especially important when using higher-order surfaces and displacement mapping, allowing you to tessellate and displace the model once, store the results in a virtual memory vertex buffer, and simply do a lookup in all later passes.

But perhaps the most significant addition comes when you combine these two together with the virtual video memory mindset – with virtual video memory, writing to rand reading from a texture becomes pretty much identical to writing to or reading from any other block of memory (ignoring filtering, this is). With this bit of insight, the General I/O Model of DirectX Next was born – you can now write any data you need to memory to be read back at any other stage of the pipeline, or even at a later pass. This data need not be a vertex, pixel, or any other graphics centric data – given access to both the current index and vertex buffer you could conceivably use this system to generate connectivity information for determining silhouette edges. In fact, you should be able to generate all shadow volumes, for all lights completely on the GPU in a single pass, and then store each in memory to be rendered in latter passes (along with the lights associated with the volumes), but there’s just one little problem with that – outside of tessellation, today’s graphics processors can not create new triangles.

Topology Processor

Actually, today’s graphics processors can create new triangles and, in fact, most do so in cases where line or point sprite primitives are used. Most consumer graphics processors are only capable of rasterizing triangles, which means all lines and point sprites must, at some point, be converted to triangles. Since both a line and point sprite will end up turning into two triangles, which can be anywhere from two to six times as many vertices (depending on the indexing method), it’s best if this is done as late as possible. This is beneficial because, at the essence, these are the exact same operations required for shadow volumes. All that’s required is to make this section of the pipeline programmable and a whole set of previously blocked scenarios become possible without relying on the host processor; Microsoft calls this the “Typology Processor”, and it should allow shadow volume and fur fin extrusions to be done completely on the graphics processor, along with proper line mitering, point sprite expansion and, apparently, single pass render-to-cubemap.

Logically, the topology processor is separate from the tessellation unit. It is conceivably possible, however, that a properly designed programmable primitive processor could be used for both sets of operations.

Tessellator Enhancements

Higher-order surfaces were first introduced to DirectX in version 8, and at first a lot of hardware supported them (nVidia in the form of RT-Patches, ATI in the form of N-Patches), but they were so limited an such a pain to use that very few developers took any interest in them at all. Consequently, all the major hardware vendors dropped support for higher-order surfaces and all was right in the world; until, that is, DirectX 9 came about with adaptive tessellation and displacement mapping. Higher-order surfaces were still a real pain to use, and were still very limited, but displacement mapping was cool enough to overlook those problems, and several developers started taking interest. Unfortunately, hardware vendors had already dropped support for higher-order surfaces so even those developers that took interest in displacement mapping were forced to abandon it due to a lack of hardware support. To be fair, the initial implementation of displacement mapping was a bit Matrox centric, so it is really of no great surprise that there isn’t too much hardware out there that supports it (even Matrox dropped support). With pixel and vertex shader 3.0 hardware on its way, hopefully things will start to pick back up in the higher-order surface and displacement mapping realm, but there’s still the problem of all current DirectX higher-order surface formulations limitations.

It'd be great if hardware would simply directly support all the common higher-order surface formulations, such as Catmull-Rom, Bezier, and B-Splines, subdivision surfaces, all the conics, and the rational version of everything. It’d be even better if all of these could be adaptively tessellated. If DirectX supported all of these higher-order surfaces, there wouldn’t be much left in the way to stop them from being used – you could import higher-order surface meshes directly from your favorite digital content creation application without all the problems of the current system. Thankfully, this is exactly what Microsoft is doing for DirectX Next. Combined with displacement mapping, and the new topology processor, and there’s no longer any real reason not to use these features (assuming, of course, that the hardware supports it).

General API Improvements

Not all enhancements come in the ways of features – there is still quite a bit of overhead in the DirectX interfaces, especially in the ways of state changes. Currently, if you want to render a scene, everything must be split up into batches of geometry that all have the exact same textures, all reside in a contiguous block of the exact same vertex buffer, use exactly the same shaders, use the same transformation matrices, et cetera. Basically, if there’s anything at all different about a block of geometry, it has to be rendered separately. The problem with this is that every draw call incurs quite a bit of overhead, as the call has to go through the DirectX interfaces, through the driver, and eventually to the graphics processor. Some of this can only be solved via more efficient interfaces with the operating system and, apparently, Microsoft plans to do this with Longhorn.

Another way to reduce some of this overhead is to allow mesh instancing in the graphics processor. Mesh instancing is the process of taking a single mesh and creating several different instances of it with different transformations, texture, or even with different displacement maps.

Actually, given the General I/O Model, it should be possible to put all visible textures and transformation matrices into an array, accessible by the shaders, and submit all geometry that share the same shaders as one big batch, and let the shaders decide exactly which texture and transformation matrix goes with what set of geometry. This would drastically decrease the amount of CPU work required for rendering by moving most of the state management to the graphics processor.

Frame Buffer Access in the Pixel Shader

There are often times when you’d like to append some calculations to an image, such as for digital grading, color correction, and/or tone mapping. With DirectX 9, however, you can no read the current value of the pixel you’re overriding on the frame buffer because for most immediate mode renderers, there’s no way of guaranteeing that the value has even been written yet. In practice, however, most graphics processors and drivers work fine when you read from the same texture that you’re rendering to; but because this is undefined behavior, it’s liable to break at any moment and hence developers are urged to not rely on this functionality. The alternative is to use two separate textures and alternate between the two, consuming twice as much memory. With DirectX Next, developers are finally getting the functionality they’ve been requesting for years now – direct access to the frame buffer (current pixel, only) inside the pixel shader – maybe. Just because it’s now in the cards for DirectX Next doesn’t mean that it’s not still a problem for immediate mode renderers, and apparently, vendors of said renderers are requesting for this feature to be dropped from the spec. Most likely this will end up being one of many optional features, which, unfortunately, means most developers are probably going to have to completely ignore it and do things the old way.

Had it been mandatory, however, you’d be able to do almost everything the current fixed function blenders do (such as accumulating lighting contributions from multiple lights in the frame buffer) completely in the pixel shader, without the overhead of an additional render target. In fact, it’s quite possible that some hardware vendors would’ve dropped their fixed function blenders and instead opted to emulate the functionality through pixel shading in much the same way that ATI dropped fixed function vertex processor and instead emulate the same functionality through vertex shaders. This would’ve mean there’d be more room in the transistor budget to improve those shader units and make everything a bit faster. It should be noted that tile-based deferred renderers have none of the problems that immediate mode renderers do in this department, so you can still bet on support from those hardware vendors.

Conclusion

In fact, many of the major DirectX Next changes seem to be a perfect fit for tile-based deferred renderers. The new memory management model, which allows virtually unlimited resources, also allows the implementation of virtually unlimited geometry storage, which as always been a concern for the deferred rendering implementations. Access to current frame buffer values from the pixel shader is also very easy to do on a tile-based deferred renderer since all the data is already in the on-chip tile buffer. So there are no costly external memory accesses required or expensive pipeline and back-end cache flushes. The front-end changes to the vertex/geometry shader side are independent of the tile-based deferred rendering principle and are thus not as big of a problem to support on tile-based deferred renderers as they are for immediate mode renderers. It will indeed be very interesting to see whether PowerVR, who are been laying pretty low lately, will take this opportunity to put out a really awesome DirectX Next implementation.

It will also be fairly interesting to see exactly how the proposed feature develop – it’s unclear, for example, where exactly in the pipeline the topology processor will reside. It has also yet to have been determined whether or not the new tessellator will be programmable, highly configurable, or simply purely fixed function.

Since the API is still in an extremely infantile state, much of this probably will not be known for some time, and much of what was discussed in this article will probably undergo several changes or be completely scrapped; but all in all the next few years of real-time computer graphics should prove to be quite interesting.