Stage 2: Vertex Processing

At the forefront of the 3D pipeline we have what has commonly been referred to as one or more vertex engines. These "engines" are essentially a collection of pipelined execution units, such as adders and multipliers. The execution units are fairly parallelized, to the point where there are multiple adders, multipliers, etc… in order to exploit the fact that most of the data they will be working on is highly parallel in nature.

The functional units that make up these vertex engines are all 32-bit floating point units, regardless of whether we're talking about an ATI or NVIDIA architecture. In terms of the efficiency of these units, ATI claims that there is rarely a case when they process fewer than 4 vertex operations every clock cycle, while NVIDIA says that the NV35 can execute at least 3 vertex operations every clock cycle but gave the range from 3 - 4 ops per clock.

It's difficult to figure out why the discrepancy exists without looking at both architectures at a very low level, which as we mentioned at the beginning of this article is fairly impossible due to both manufacturers wanting to keep their secrets closely guarded.

An interesting difference that exists between the graphics pipeline and the CPU pipeline is the prevalence of branches in graphics code. As you will remember from our articles detailing the CPU world, branches occur quite commonly in code (e.g. 20% of all integer instructions are branches in x86 code). A branch is any piece of code where a decision must be made and the outcome of which determines which instruction to execute next. For example, a general branch would be:

If "Situation A" then begin executing the following code

Else, if "Situation B" then execute this code

As you can guess, branches are far less common in the graphics world. Extremely complex lighting algorithms are much more likely to contain branches than any other sort of code as well as vertex processing in general. Obviously in any case where branches exist, you will want to be able to predict the outcome of a branch before evaluating it in order to avoid costly pipeline stalls. Luckily, because of the high bandwidth memory subsystem that GPUs are paired up with as well as the limited nature of branches in graphics code to begin with, the branch predictors in these GPUs don't have to be too accurate. Whereas in the CPU world you need to be able to predict branches with ~95% accuracy, the requirements are no where near as stringent in the GPU world. NVIDIA insists that their branch predictor is significantly more accurate/efficient than ATI's, however it is difficult to back up those claims with hard numbers.

Stage 1: The Front End Stages 3 & 4: Triangle Setup & Rasterization
Comments Locked

19 Comments

View All Comments

Log in

Don't have an account? Sign up now