OpenGL may be implemented by any combination of hardware and software.
At the high-end, hardware may implement virtually all of OpenGL while at
the low-end, OpenGL may be implemented entirely in software. In between
are combination software/hardware implementations. More money buys more
hardware and better performance.
Intro-level workstation hardware and the recent PC 3-D hardware typically
implement point, line, and polygon rasterization in hardware but implement
floating point transformations, lighting, and clipping in software. This
is a good strategy since the bottleneck in 3-D rendering is usually
rasterization and modern CPU's have sufficient floating point performance
to handle the transformation stage.
OpenGL developers must remember that their application may be used on a
wide variety of OpenGL implementations. Therefore one should consider
using all possible optimizations, even those which have little return on
the development system, since other systems may benefit greatly.
From this point of view it may seem wise to develop your application on a
low-end system. There is a pitfall however; some operations which are
cheep in software may be expensive in hardware. The moral is: test your
application on a variety of systems to be sure the performance is dependable.
One should consider multiprocessing in these situations. By assigning
rendering and computation to different threads they may be executed in
parallel on multiprocessor computers.
For many applications, supporting multiprocessing is just a matter of
partitioning the render and compute operations into separate threads
which share common data structures and coordinate with synchronization
primitives.
SGI's Performer is an example of a high level toolkit designed for this
purpose.
Complexity may refer to the geometric or rendering attributes of a database.
Here are a few examples.
Objects which are entirely outside of the field of view may be culled.
This type of high level cull testing can be done efficiently with bounding
boxes or spheres and have a major impact on performance. Again, toolkits
such as Inventor and Performer have this feature.
Basically, one wants data structures which can be traversed quickly
and passed to the graphics library in an efficient manner. For example,
suppose we need to render a triangle strip. The data structure which
stores the list of vertices may be implemented with a linked list or an
array. Clearly the array can be traversed more quickly than a linked list.
The way in which a vertex is stored in the data structure is also significant.
High performance hardware can process vertexes specified by a pointer more
quickly than those specified by three separate parameters.
Our first attempt at rendering this information may be:
We can still do better, however. If we redesign the data structures used
to represent the city information we can improve the efficiency of drawing
the city points. For example:
In the following sections the techniques for maximizing performance,
as seen above, are explained.
After each of the following techniques look for a bracketed list of symbols
which relates the significance of the optimization to your OpenGL
system:
Example:
This is a very bad construct. The following is much better:
Wrong:
Example:
Note that software implementations of OpenGL may actually perform
these operations faster than hardware systems. If you're developing
on a low-end system be aware of this fact. [H,L]
The
It may be worthwhile to experiment with different visuals to determine
if there's any advantage of one over another.
Synchronization hurts performance. Therefore, if you need to
render with both OpenGL and native window system calls try to
group the rendering calls to minimize synchronization.
For example, if you're drawing a 3-D scene with OpenGL and displaying
text with X, draw all the 3-D elements first, call
Also, when responding to mouse motion events you should skip
extra motion events in the input queue.
Otherwise, if you try to process every motion event and redraw
your scene there will be a noticable delay between mouse input
and screen updates.
It can be a good idea to put a print statement in your redraw
and event loop function so you know exactly what messages are
causing your scene to be redrawn, and when.
Don't do this:
Do this:
Performance evaluation is a large subject and only the basics are covered here.
For more information see "OpenGL on Silicon Graphics Systems".
After bottlenecks have been identified the techniques outlined in
section 3 can be applied.
The process of identifying and reducing bottlenecks should be repeated
until no further improvements can be made or your minimum performance
threshold has been met.
Measure the performance of rendering in single buffer mode
to determine how far you really are from your target frame
rate.
1. Hardware vs. Software
2. Application Organization
At first glance it may seem that the performance of interactive OpenGL
applications is dominated by the performance of OpenGL itself. This may
be true in some circumstances but be aware that the organization of the
application is also significant.
2.1 High Level Organization
Multiprocessing
Some graphical applications have a substantial computational component
other than 3-D rendering. Virtual reality applications must compute
object interactions and collisions. Scientific visualization programs
must compute analysis functions and graphical representations of data.
Image quality vs. performance
In general, one wants high-speed animation and high-quality images in
an OpenGL application.
If you can't have both at once a reasonable compromise may be to render at
low complexity during animation and high complexity for static images.
GL_NEAREST
sampling and
glHint( GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST )
.
glPolygonMode( GL_FRONT_AND_BACK, GL_LINE )
to
inspect tesselation granularity and reduce if possible.
Level of detail management and culling
Objects which are distant from the viewer may be rendered with a reduced
complexity model. This strategy reduces the demands on all stages of the
graphics pipeline. Toolkits such as Inventor and Performer support this
feature automatically.
2.2 Low Level Organization
The objects which are rendered with OpenGL have to be stored in some sort
of data structure. Some data structures are more efficient than others
with respect to how quickly they can be rendered.
An Example
Suppose we're writing an application which involves drawing a road map.
One of the components of the database is a list of cities specified with
a latitude, longitude and name. The data structure describing a city
may be:
struct city {
float latitute, longitude; /* city location */
char *name; /* city's name */
int large_flag; /* 0 = small, 1 = large */
};
A list of cities may be stored as an array of city structs.
void draw_cities( int n, struct city citylist[] )
{
int i;
for (i=0; i < n; i++) {
if (citylist[i].large_flag) {
glPointSize( 4.0 );
}
else {
glPointSize( 2.0 );
}
glBegin( GL_POINTS );
glVertex2f( citylist[i].longitude, citylist[i].latitude );
glEnd();
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
This is a poor implementation for a number of reasons:
Here's a better implementation:
glPointSize
is called for every loop iteration.
glBegin
and glEnd
void draw_cities( int n, struct city citylist[] )
{
int i;
/* draw small dots first */
glPointSize( 2.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==0) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw large dots second */
glPointSize( 4.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==1) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw city labels third */
for (i=0; i < n ;i++) {
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
In this implementation we're only calling glPointSize twice
and we're maximizing the number of vertices specified between
glBegin
and glEnd
.
struct city_list {
int num_cities; /* how many cities in the list */
float *position; /* pointer to lat/lon coordinates */
char **name; /* pointer to city names */
float size; /* size of city points */
};
Now cities of different sizes are stored in separate lists.
Position are stored sequentially in a dynamically allocated array.
By reorganizing the data structures we've eliminated the need for a
conditional inside the glBegin/glEnd
loops.
Also, we can render a list of cities using the GL_EXT_vertex_array
extension if available, or at least use a more efficient version of
glVertex
and glRasterPos
.
/* indicates if server can do GL_EXT_vertex_array: */
GLboolean varray_available;
void draw_cities( struct city_list *list )
{
int i;
GLboolean use_begin_end;
/* draw the points */
glPointSize( list->size );
#ifdef GL_EXT_vertex_array
if (varray_available) {
glVertexPointerEXT( 2, GL_FLOAT, 0, list->num_cities, list->position );
glDrawArraysEXT( GL_POINTS, 0, list->num_cities );
use_begin_end = GL_FALSE;
}
else
#else
{
use_begin_end = GL_TRUE;
}
#endif
if (use_begin_end) {
glBegin(GL_POINTS);
for (i=0; i < list->num_cities; i++) {
glVertex2fv( &position[i*2] );
}
glEnd();
}
/* draw city labels */
for (i=0; i < list->num_cities ;i++) {
glRasterPos2fv( list->position[i*2] );
glCallLists( strlen(list->name[i]),
GL_BYTE, list->name[i] );
}
}
As this example shows, it's better to know something about efficient rendering
techniques before designing the data structures. In many cases one has to
find a compromize between data structures optimized for rendering and those
optimized for clarity and convenience.
3. OpenGL Optimization
There are many possibilities to improving OpenGL performance. The impact
of any single optimization can vary a great deal depending on the OpenGL
implementation.
Interestingly, items which have a large impact on software
renderers may have no effect on hardware renderers, and vice versa!
For example, smooth shading can be expensive in software but free in hardware
While glGet*
can be cheap in software but expensive in hardware.
3.1 Traversal
Traversal is the sending of data to the graphics system. Specifically, we
want to minimize the time taken to specify primitives to OpenGL.
GL_LINES, GL_LINE_LOOP,
GL_TRIANGLE_STRIP, GL_TRIANGLE_FAN
, and
GL_QUAD_STRIP
require fewer vertices to describe an
object than individual line, triangle, or polygon primitives.
This reduces data transfer and transformation workload. [all]
glVertex/glColor/glNormal
calls with
the vertex array mechanism may be very beneficial. [all]
glVertex
, glColor
,
glNormal
and glTexCoord
glVertex
, glColor
, etc. functions
which take a pointer
to their arguments such as glVertex3fv(v)
may be much
faster than those which take individual arguments such as
glVertex3f(x,y,z)
on systems with DMA-driven graphics
hardware. [H,L]
glNormal
.
If texturing is disabled don't call glTexCoord
, etc.
glBegin/glEnd
glBegin/glEnd
.
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n; i++) {
if (lighting) {
glNormal3fv( norm[i] );
}
glVertex3fv( vert[i] );
}
glEnd();
if (lighting) {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glNormal3fv( norm[i] );
glVertex3fv( vert[i] );
}
glEnd();
}
else {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glVertex3fv( vert[i] );
}
glEnd();
}
Also consider manually unrolling important rendering loops to
maximize the function call rate.
3.2 Transformation
Transformation includes the transformation of vertices from
glVertex
to window coordinates, clipping and lighting.
GL_SHININESS
material parameter. [L,S]
glEnable/Disable(GL_NORMALIZE)
controls whether
normal vectors are scaled to unit length before lighting. If you
do not use glScale
you may be able to disable
normalization without ill effects. Normalization is disabled
by default. [L,S]
GL_LINES
,
GL_LINE_LOOP
, GL_TRIANGLE_STRIP
,
GL_TRIANGLE_FAN
, and GL_QUAD_STRIP
decrease traversal and transformation load.
glRect
usage
glBegin(GL_QUADS)
... glEnd()
instead. [all]
3.3 Rasterization
Rasterization is the process of generating the pixels which represent
points, lines, polygons, bitmaps and the writing of those pixels to the
frame buffer. Rasterization is often the bottleneck in software
implementations of OpenGL.
3.4 Texturing
Texture mapping is usually an expensive operation in both hardware and
software.
Only high-end graphics hardware can offer free to low-cost texturing.
In any case there are several ways to maximize texture mapping performance.
GL_UNSIGNED_BYTE
component format is typically the
fastest for specifying texture images.
Experiment with the internal texture formats offered by the
GL_EXT_texture
extension. Some formats are faster
than others
on some systems (16-bit texels on the Reality Engine, for
example). [all]
GL_NEAREST
or GL_LINEAR
then there's no reason OpenGL has to compute the
lambda value which determines whether to use minification
or magnification sampling for each fragment.
Avoiding the lambda calculation can be a good performace improvement.
GL_DECAL
or GL_REPLACE_EXT
functions for 3 component textures is a simple assignment of texel
samples to fragments while GL_MODULATE
is a linear
interpolation between texel samples and incoming fragments. [S,L]
glTexImage2D
to repeatedly change the texture.
Use glTexSubImage2D
or
glTexCopyTexSubImage2D
.
These functions are standard in OpenGL 1.1 and available as extensions
to 1.0.
3.5 Clearing
Clearing the color, depth, stencil and accumulation buffers can be
time consuming, especially when it has to be done in software.
There are a few tricks which can help.
glClear
carefully [all]
glClear
.
glClear( GL_COLOR_BUFFER_BIT );
if (stenciling) {
glClear( GL_STENCIL_BUFFER_BIT );
}
Right:
if (stenciling) {
glClear( GL_COLOR_BUFFER_BIT | GL_STENCIL_BUFFER_BIT );
}
else {
glClear( GL_COLOR_BUFFER_BIT );
}
glScissor()
to restrict clearing to a smaller area.
[L].
int EvenFlag;
/* Call this once during initialization and whenever the window
* is resized.
*/
void init_depth_buffer( void )
{
glClearDepth( 1.0 );
glClear( GL_DEPTH_BUFFER_BIT );
glDepthRange( 0.0, 0.5 );
glDepthFunc( GL_LESS );
EvenFlag = 1;
}
/* Your drawing function */
void display_func( void )
{
if (EvenFlag) {
glDepthFunc( GL_LESS );
glDepthRange( 0.0, 0.5 );
}
else {
glDepthFunc( GL_GREATER );
glDepthRange( 1.0, 0.5 );
}
EvenFlag = !EvenFlag;
/* draw your scene */
}
3.6 Miscellaneous
glGetFloatv, glGetIntegerv, glIsEnabled,
glGetError, glGetString
require a slow, round trip
transaction between the application and renderer.
Especially avoid them in your main rendering code.
glPushAttrib
glPushAttrib( GL_ALL_ATTRIB_BITS )
in
particular can be very expensive on hardware systems. This
call may be faster in software implementations than in hardware.
[H,L]
glGetError
inside your
rendering/event loop to catch errors. GL errors raised during
rendering can slow down rendering speed. Remove the
glGetError
call for production code since it's a
"round trip" command and can cause delays. [all]
glColorMaterial
instead of glMaterial
glColorMaterial
may be faster than
glMaterial
. [all]
glDrawPixels
glDrawPixels
often performs best with
GL_UNSIGNED_BYTE
color
components [all]
glDrawPixels
. [all]
glPolygonMode
glBegin
with GL_POINTS, GL_LINES,
GL_LINE_LOOP
or GL_LINE_STRIP
instead as it can be much faster. [all]
3.7 Window System Integration
glXMakeCurrent
call, for example, can be expensive
on hardware systems because the context switch may involve moving a
large amount of data in and out of the hardware.
GLX_EXT_visual_rating
extension can help you select
visuals based on performance or quality. GLX 1.2's visual
caveat attribute can tell you if a visual has a performance
penalty associated with it.
glXWaitX
and
glXWaitGL
functions serve this purpose.
glXWaitGL
to synchronize, then call all the X drawing
functions.
3.8 Mesa-specific
Mesa is a free library which implements most of the OpenGL API in a
compatible manner. Since it is a software library, performance depends a
great deal on the host computer. There are several Mesa-specific features
to be aware of which can effect performance.
MESA_RGB_VISUAL
environment variable
can be used to determine the quickest visual by experimentation.
glColor
command should be put before
the glBegin
call.
glBegin(...);
glColor(...);
glVertex(...);
...
glEnd();
glColor(...);
glBegin(...);
glVertex(...);
...
glEnd();
glColor[34]ub[v]
are the fastest
versions of the glColor
command.
4. Evaluation and Tuning
To maximize the performance of an OpenGL applications one must be able
to evaluate an application to learn what is limiting its speed.
Because of the hardware involved it's not sufficient to use ordinary
profiling tools.
Several different aspects of the graphics system must be evaluated.
4.1 Pipeline tuning
The graphics system can be divided into three subsystems for the purpose
of performance evaluation:
At any given time, one of these stages will be the bottleneck. The
bottleneck must be reduced to improve performance.
The strategy is to isolate each subsystem in turn and evaluate changes
in performance.
For example, by decreasing the workload of the CPU subsystem one can
determine if the CPU or graphics system is limiting performance.
4.1.1 CPU subsystem
To isosulate the CPU subsystem one must reduce the graphics workload while
presevering the application's execution characteristics.
A simple way to do this is to replace glVertex()
and glNormal
calls with glColor
calls.
If performance does not improve then the CPU stage is the bottleneck.
4.1.2 Geometry subsystem
To isoslate the geometry subsystem one wants to reduce the number of
primitives processed, or reduce the transformation work per primitive
while producing the same number of pixels during rasterization.
This can be done by replacing many small polygons with fewer large
ones or by simply disabling lighting or clipping.
If performance increases then
your application is bound by geometry/transformation speed.
4.1.3 Rasterization subsystem
A simple way to reduce the rasterization workload is to make your window
smaller. Other ways to reduce rasterization work is to disable per-pixel
processing such as texturing, blending, or depth testing.
If performance increases, your program is fill limited.
4.2 Double buffering
For smooth animation one must maintain a high, constant frame rate.
Double buffering has an important effect on this.
Suppose your application needs to render at 60Hz but is
only getting 30Hz. It's a mistake to think that you must
reduce rendering time by 50% to achive 60Hz. The reason
is the swap-buffers operation is synchronized to occur
during the display's vertical retrace period (at 60Hz for
example). It may be that your application is taking only
a tiny bit too long to meet the 1/60 second rendering time
limit for 60Hz.
4.3 Test on several implementations
The performance of OpenGL implementations varies a lot.
One should measure performance and test OpenGL applications
on several different systems to be sure there are no
unexpected problems.
Last edited on May 16, 1997 by Brian Paul.