Workstation Graphics: AGP Cross Section 2004by Derek Wilson on December 23, 2004 4:14 PM EST
- Posted in
3Dlabs Wildcat Realizm 200 TechnologyFixed function processing may still be the staple of the workstation graphics world, but 3Dlabs isn't going to be left behind in terms of technology. The new Wildcat Realizm GPU supports everything that one would expect from a current generation graphics processor, including VS 2.0, and PS 3.0 level programmability. The hardware itself weighs in at about 150M transistors and is fabbed on a 130nm process. We're taking a look at the top end AGP SKU from 3Dlabs, but the highest end offering essentially places two of these GPUs on one board, connected by a vertex/scalability unit, but we'll take a look at that when we explore PCI Express workstations. The pipeline of the Wildcat Realizm 200 looks very much like what we expect a 3D pipeline to look.
When we start to peel away the layers, we see a very straightforward and powerful architecture. We'll start by explaining everything, except for the Vertex Array and the Fragment Array (which will get a little more detailed investigation shortly).
The Command Processor is responsible for keeping track of command streams coming into the GPU. Of interest is the fact that it supports the NISTAT and NICMD registers for isochronous AGP operation. This allows the card to support requests by applications for a constant stream of data at guaranteed intervals. On cards without isochronous AGP support, cards must be capable of handling arbitrarily long delays in request fulfillment based on the capabilities of the chipset. This feature is particularly interesting for real-time broadcast video situations in which dropped frames are not an option. Of course, it only works with application support, and we don't have a good test as to the impact of isochronous AGP operation either.
Visibility is computed via a Hierarchical Z algorithm. Essentially, large tiles of data are examined at a time. If the entire tile is occluded, it can all be thrown out. 3Dlabs states that their visibility algorithm is able to discard up to 1024 multi-samples per clock. This would fit the math for a 16x16 pixel tile with 4x multi-samples per pixel. This is actually 256 pixels, which is the same size block as ATI's Hierarchical Z engine. And keep in mind, these are maximal numbers; if the tile is partially occluded, only part of the data can be thrown out.
What 3Dlabs calls the Pixel Array, it is circuitry that takes care of AA, compositing, color conversion, filtering, and everything else that might need to happen to the final image before scan out. This is similar in function, for example, to the ROP in an NVIDIA part. The Wildcat Realizm GPU is capable of outputting multiple formats from the traditional to fp16 data. This means that it can handle things like 10-bit alpha blending, or 10-bit LUT for monitors that support it. On the basic hardware level, 3Dlabs defines this block as a 16x 16-bit floating point SIMD array. This means that 4 fp16 RGBA pixels can be processed by the Pixel Array simultaneously. In fact, all the programmable processing parts of the Wildcat Realizm are referred to terms of SIMD arrays. This left us with a little math to compute based on the following information and this layout shot:
Vertex Array: 16-way, high-accuracy 36-bit float SIMD array, DX9 VS 2.0 capable
Fragment Array: 48-way, 32-bit float component SIMD array, DX9 PS 3.0 capable
For the Vertex Array, we have a 16x 36-bit floating point SIMD array. Since there are two physical Vertex Shader Arrays in the layout of the chip as shown above, and we are talking about SIMD (single instruction multiple data) arrays, it stands to reason that each array can handle 8x 36-bit components at maximum. It's not likely that this is one large 8-wide SIMD block. If 3Dlabs followed ATI and NVIDIA, they could organize this as one 6-wide unit and one 2-wide unit, executing two vec3 operations on one SIMD block and two scalar operations on the other. This would give them two "vertex pipelines" per array. Without knowing the granularity of the SIMD units, or how the driver manages allocating resources among the units, it's hard to say exactly how much work can get done per clock cycle. Also, as there are two physically separate vertex arrays, each half of the vertex processing block is likely to share resources like registers and caches. It is important to note that the vertex engine here is 36-bits wide. The extra 4 bits, which are above and beyond what anyone else offers, actually delivers 32-bits of accuracy in the final vertex calculation. Performing operations at the same accuracy level as the data stored essentially builds in a level of noise to the result. This is because intermediate results of calculations are truncated to the accuracy of the stored data. This is a desirable feature to maintain high precision vertex accuracy, but we haven't been able to come up with a real world application that pushes other parts to a place where 32-bit vertex hardware breaks down and the 36-bit hardware is able to show a tangible advantage.
The big step for vertex hardware accuracy will need to be 64-bit. For CAD/CAM applications, a db of 64-bit values is kept. These double values are very important for manufacturing, but currently, graphics hardware isn't robust enough to display anything but single precision floating point data. Already high transistor counts would get unmanageable with current fab technology.
Other notable features of the vertex pipeline of the Wildcat Realizm products include support for 32 hardware fixed function hardware lights, and VS 2.0 level functionality with a maximum of 1000 instructions. Not supporting VS 3.0 while implementing full PS 3.0 support is an interesting choice for 3Dlabs. Right now, fixed function support is more important to the CAD/CAM market and arguably more important to the workstation market overall. But, geometry instancing could really help geometry limited applications when working with scenes full of multiple objects. Vertex textures might also be useful in the DCC market. Application support does push hardware vendors to include and exclude certain features, but we really would have liked to see full SM 3.0 support in the Wildcat Realizm product line.
The Fragment Array consists of three physical blocks that make up a 48-way 32-bit floating point SIMD array. This means that we 16x 32-bit components being processed in each of the three physical blocks at any given time. What we can deduce from this is that each of the three physical blocks share common register and cache resources and very likely operate on four pixels with strong locality of reference at a time. It's possible that 3Dlabs could organize the distribution of pixels over their pipeline in a similar manner to either ATI or NVIDIA, but we don't have enough information to determine what's going on at a higher level. We also can't say how granular 3DLab's SIMD arrays are, which means that we don't know just how much work they can get done per clock cycle. In the worst case, the Wildcat Realizm is equipped with 4x 4-wide SIMD units per physical fragment block. This would mean that operating on one component at a time would make 3 components idle while waiting for the fourth. It's much more likely that they implemented a combination of smaller units and are able to divide the work among them, as both ATI and NVIDIA have done. We know that 3Dlabs units are all vector units, which means that we are limited to combinations of 2-, 3-, and 4-wide vector blocks.
Unfortunately, without more information, we can't draw conclusions on DX9 co-issue or dual-issue per pipeline capabilities of the part. No matter what resources are under the hood, it's up to the compiler and driver to handle the allocation of everything that the GPU offers to the GLSL or HLSL code running on it. On a side note, it is a positive sign to see a third company confirm the major design decisions that we have seen both ATI and NVIDIA make in their hardware. With the 3Dlabs Wildcat Realizm 200 coming to market as a 4 vertex/12 pixel pipe architecture, it will certainly be exciting to see how things stack up in the final analysis.
The Fragment Array supports PS 3.0 and 256000 instruction length shader programs. The Fragment Array also supports 32 textures in one clock. This, again, seems to be a total limitation per clock. The compiler would likely distribute resources as needed. If we look at this in the same way that we look at the ATI or NVIDIA hardware, we'd see that we can access 2.6 textures per pixel. This could also translate to 2/3 of the components loading their own texture if needed, and if the driver supported it. 3Dlabs also talks about the Wildcat Realizm's ability to handle "cascaded dependant textures" in shaders. Gracefully handling a large number of dependent textures is useful, but it will bring shader hardware to a crawl. It's unclear how many depth/stencil operations that the Realizm is capable of in one pass, but there is a separate functional block for such processing shown on the image above.
One very important thing to note is that the Wildcat Realizm calculates all pixel data in fp32, but stores pixel data in an fp16 (5s10) format. This has the effect of increasing memory bandwidth from what it would be with fp32, while decreasing accuracy. The precentage by which percision is decreased depends on the data being processed and algorithms used. The fp32->fp16 and fp16->fp32 conversions are all done with zero performance impact between the GPU and memory. It's very difficult to test the accuracy of the fragment engine. We've heard that it comes out somewhere near 30-bit accuracy from 3Dlabs, and it very well could. We would still like to see an empirical test that could determine the absolute accuracy for a handful of common shaders before we sign off on anything. This is at least an interesting solution to the problem of balancing a 32-bit and 16-bit solution. Moving all that pixel data can really impact bandwidth in shader intensive applications, and we saw just how hard the impact can be with NVIDIA's original 32-bit NV30 architecture. We've heard that it is possible to turn off the 16-bit storage feature and have the Realizm GPU store full 32-bit precision data, but we have yet to see this option in the driver or as a tweak. Obviously, we would like to get our hands on a switch like this to evaluate both the impact on performance and image quality.
Another important feature to mention about the Wildcat Realizm is its virtual memory support. The Realizm supports 16GB of virtual memory. This allows big memory applications to swap pages out of local graphics memory to system RAM if needed. On the desktop side, this isn't something that we've seen a real demand or need for, but workstation parts definitely benefit from it. Speed is absolutely useful, but more important than speed is actually being able to visualize a necessary data set. There are data sets and workloads that just can't fit in local framebuffers. The Wildcat Realizm 200 has 512 MB of local memory, but if needed, it could have swapped paged out up to the size of our free idle system memory. The 3Dlabs virtualization system doesn't support paging to disk and isn't managed by windows, but the 3Dlabs driver. Despite the limitations of the implementation, the solution is elegant, necessary, and extraordinarily useful to those who need massive amounts of RAM available to their visualization system.
On the display side, the Wildcat Realizm 200 is designed to push the limits of 9MP panels, video walls, and multi-system displays. With dual 10-bit 400MHz RAMDACs, two dual-link DVI-I connections, and support for an optional Multiview kit with genlock and framelock capabilities, the Wildcat Realizm is built to drive vast numbers of pixels. The scope of this article is limited to single display 3D applications, but if there is demand, we may explore the capabilities of professional cards to drive extremely high resolutions and 2 or more monitors.
Even though it can be tough to sort through at times, this low level description of hardware is nicer than what we get from ATI and NVIDIA in some ways because we get a chance to see what the hardware is actually doing. The block diagram high level look that others provide us can be very useful in understanding what a pipeline does, but it obfuscates the differences in respective implementations. We would love to have a combination of the low level physical description of hardware that 3Dlabs has given us and high level descriptions that we get from ATI and NVIDIA. Of course, then we could go build our own GPUs and skip the middle man.