Graphics Card Interfaces

Categories: Linux

Intro

This article gives a brief overview of how an operating system interfaces with a modern graphics card - ie what “API” cards expose to operating systems. The purpose is to give some context in which to understand both the user-space and kernel-space code for linux graphics. Warning: I’m not an expert in this area; these are notes-to-myself deduced from information available on the internet and some reading of linux-kernel source-code.

The following topics can be found below:

  • a little information about generating outputs to physical displays;
  • a very brief look at the linux graphics stack;
  • a look at the Command Stream used to configure a card (aka the Command Processor);
  • an overview of AMD Southern Islands Compute Units (ie where vertex/pixel shading and GPGPU programs run)

If you’re trying to figure out how to configure your X server - you’ve found the wrong article. This is really only of interest to beginner kernel hackers (or maybe even interesting to nobody other than myself), or perhaps linux sysadmins who are curious about why userspace commands/config work the way they do.

If you’re not familiar with the overall architecture of the Linux graphics stack, then you should read this article first.

As with any attempt to summarize, there are areas below that are only approximately correct - this isn’t meant to be textbook length. However if you see any fundamental errors, please let me know!

It is assumed that you know how a PCI bus works, and roughly how Linux interacts with one; if not, you may wish to read this article first.

This article particularly describes an AMD “Southern Islands” graphics card (HD7000 series), attached via PCIe. However the general principles should apply to many cards, and at least partially to embedded graphics devices as well as PCIe ones.

AMD “Southern Islands” (aka Radeon HD7000) is the first AMD chip to use AMD’s Graphics Core Next (GCN) architecture, which is based on SIMD instructions; earlier chips used VLIW instructions. The VLIW approach makes GPUs simpler (more logic is pushed into the compiler stage), but ended up not being a good match for modern graphics demands (esp. DX10 and later) nor for generic GPU-based computing. This GPU series was first released late 2011.

Note that the AMD “Northern Islands” (Radeon HD6000 series) is still the “old” VLIW architecture, despite the name.

It appears that although the instruction-set has changed, the way that an OS configures the GPU is common between all AMD chips since r600

Frame Buffers

In the end, all graphics cards generate a “frame buffer” (aka “scan out buffer”), ie a block of memory with 3 bytes per pixel, being the RGB color to show at that pixel on the screen.

For analogue outputs such as composite-video or component-video, card circuitry periodically reads each pixel in turn and generates an appropriate voltage output at the appropriate time. For digital outputs such as HDMI, the graphics card simply dumps that data to the output cable (with some minor transformations), and lets the display device map that data to appropriate voltages to drive the physical output.

Very simple graphics cards do no more - just provide a “frame buffer” and associated output circuitry, and let the operating system write RGB values into that buffer. Libraries in the operating system then provide more sophisticated APIs for 2-D graphics or 3D graphics, and use software algorithms running on the CPU to compute the color of pixels to be written to the framebuffer.

More sophisticated cards provide various ways for an operating-system to offload the work of mapping graphics operations into pixel-colours from the CPU to the graphics card - but in the end the result is still a buffer full of pixel-values.

Of course modern cards provide multiple output options (eg component, HDMI, multiple monitors); the OS therefore also needs to support sending the appropriate commands to the card to select the right outputs. Cards also support different resolutions (ie change the size of the scan-out buffer and the way its contents are mapped to pixels on the screen); again the OS needs to send appropriate commands to select the appropriate resolution.

See here for more information about framebuffer drivers on Linux.

Graphics acceleration with modern AMD cards

Modern graphics cards are basically a co-processor, ie perform “offloading” of CPU work. The general process for rendering graphics is:

  • Allocate a buffer in which to write the output.
  • Allocate one or more buffers and fill them with input parameters, eg tuples of (x,y,z,normal,colour)
  • Allocate one or more buffers and fill them with CU instructions for the Compute Units of the GPU to execute
  • Fill a “command stream” buffer with CS instructions for configuring the GPU
  • Pass the “command stream” buffer to the GPU for execution.

When execution is complete, the output buffer contains a 2D pixel array of RGBA values, suitable for copying/composing to an appropriate area of the framebuffer. Asynchronously to this process, the card’s output circuitry is repeatedly turning the framebuffer contents into appropriate signals to the display device.

These cards can also be used for things other than graphics; they are really CPUs with special math-oriented instruction sets, and can simply be viewed as so, taking data as input and generating data as output. Having the input data be a 3D vertex graph plus texture maps, and the output data being per-pixel colour value is simply one application.

Quite how general-purpose the GPU is depends on how modern it is; the trend is clearly to move from GPU instructions with very graphics-specific operations to more generic mathematical instructions. In fact, modern GPUs usually have two quite distinct instructionsets: one is a “command stream” that configures the card, and performs a limted set of 2D and 3D drawing operations, while the other set performs mostly mathematical operations using SIMD (ie where each instruction operates on multiple data elements).

Cards can have internal memory, or can share the CPU’s main memory. When the RAM is on card, then loading the program, vertex graph and textures from the CPU into the card is slow, but computing the results and writing to the frame buffer is fast. When RAM is “system memory”, then the reverse is true. Possibly more importantly, separate graphics RAM allows the CPU to perform memory accesses while graphics rendering is in progress, without contention for the memory bus.

The Linux DRM kernel module provides device nodes under dev/dri/* which can be used to access the GEM/TTM kernel modules; these allow user-space programs to allocate and access the buffers mentioned above: for “window backing buffers”, vertex graphs, textures, shader programs, and configuration commands. With this basic support in kernel-space, the rest of the logic for generating graphics can run in user-space; mapping from graphics API calls such as OpenGL or GTK+ into the appropriate data and GPU instructions for the specific installed video card is done via card-specific userspace libraries (eg Mesa-3D + gallium).

Page Flips and VBlank

Above, it is mentioned that circuitry generates output signals to display devices by “scanning the framebuffer”. There are actually two buffers in use: one used by the output circuitry, and one currently being modified. When all drawing is ready, a “page flip” swaps the two. This ensures that the buffer that the output circuitry is using is “stable”, which prevents flickering.

Output devices also typically refresh pixels on the display starting from the top-left, and working line-by-line down to the bottom right. After refreshing a screen-worth, the circuitry pauses until it is time to do the refresh again. The pause duration depends upon the configured screen “refresh rate”. The point in time when this pause starts is called the “vblank”, and it is a good time to do a “page flip”.

Programmable Components

There are actually two completely different instruction-sets in the GPU :

  • The “GFX Command Processor” supports an instruction-set which is used for:

    • configuring the card;
    • doing 2D immediate graphics operations;
    • initiating scene rendering or GP-GPU programs (ie starting the rendering pipeline working)
  • The “Compute Units” support an instruction-set for high-speed data processing (including “pixel shading”).

In addition, there are some tasks that are just better done in a fixed manner with customised hardware than by general-purpose computing instructions. Most cards (including Southern Islands) therefore have “fixed function units” that can do “tesselation, geometry and high-order surface processing” on input data then pass the output to the Compute Units. A “rendering pipeline” consists of a sequence of fixed-function and programmable stages through which the “scene description” flows until it ends up at last as a flat 2d pixel array.

Documentation on recent graphics chips sometimes have a tendency to heavily document the Compute Units and almost completely skip over the Command Processor and fixed-function units. This is because the documentation is intended for people doing GPGPU computing, not for people writing graphics drivers. GPGPU programs can be loaded and executed using vendor-provided tools.

The Command Processor

In order to configure the graphics card, code running on the CPU can either

  1. do a sequence of writes to special memory addresses mapped to the graphics card “registers”, or
  2. pass a sequence of “CS instructions” to the graphics card which are then interpreted by the “command processor”.

According to the documentation, option (2) is preferred and option (1) is now recommended only for debugging, as it is much slower. However in the current linux drivers, both are used depending on circumstance. And reading of card values is done via direct reads of the appropriate mapped memory addresses.

To pass CS instructions to the GPU, the CPU does some initialisation at startup - it allocates a smallish memory buffer in normal memory, and pokes some GPU registers to set up a pair of head/tail pointers to turn this memory into a circular buffer. It can then write instructions into this buffer and when ready update the GPU’s “tail” pointer register via the traditional “poke a value to a fixed address”. When the GPU is ready, it reads a block of instructions (usually everything that is in the buffer) in one go, usually using DMA. It then updates its “head” pointer register to indicate that there is now more free space in the buffer, and raises an interrupt so the CPU notices. Transferring instructions in batches like this is efficient with respect to memory access, and importantly also decouples the GPU and CPU somewhat (ie neither has to wait for the other so often). See radeon_ring.c:radeon_ring_write().

A single instruction in the command buffer can also instruct the GPU to load the contents of a separate “indirect” buffer (IB) full of commands, ie the CPU side can allocate a buffer and fill it with commands, then write a single command to the circular buffer to execute the prepared instructions. This ensures that there are no problems with a “circular buffer full” state being encountered at an inconvenient time.

The “programs” to be executed in the “Compute Units” are instead treated just like “data”, ie a buffer is allocated, the instructions are written into it, and then as part of the “command stream”, the GPU is told which buffers to execute in the Compute Unit.

Because the command stream can be used to configure/trigger DMA by the graphics card to/from addresses in kernel memory, any userspace application with the right to send arbitrary commands can take over the host machine. In addition, it is even possible in some cases to set graphics modes that will damage the card or the display. Therefore, KMS kernel drivers typically scan the command-stream for safety before passing it on to the card; for X native DDX drivers, X is simply “trusted” to get it right in userspace.

Scratch registers: there are 8 GPU registers that can be configured to be “replicated” to ram via DMA. When anything in the GPU writes to one of these registers, the value is also pushed to memory. Particularly useful is that a command in the command-stream can write a magic value to one of these registers; polling the corresponding memory location from the CPU then reveals when that instruction has been completed. ?? Is this how the “fence” works?

Command Stream (CS) Instructions

There is unfortunately no documentation that I can find on the “command stream” used to configure southern islands chips, and kick off the rendering process. However from a look at the kernel driver code, it appears that this has changed little since even the r200 series of cards (unlike the programmable vertex/pixel shader parts which have now been replaced by Compute Units). And fortunately, the command stream is well documented for r500 cards. The only major difference appears to be an additional two “ring buffers” specifically for feeding GPGPU programs directly to the compute units. See radeon_cs.c:radeon_cs_get_ring().

Each CS instruction has a “type” and “length” field; type 0 is for “write to register” and type 3 is a ‘normal’ instruction, with embedded instruction-code. Types 1 and 2 are not very important. The length field indicates how many additional words of parameters are associated with the instruction.

Examples of “CP” (Command Processor) instructions include:

  • (type-0 command): write the following N constant values from the command-stream into registers P..P+N
  • (type-1 command): write the following N constant values from the command-stream into a single register P
  • PAINT: paint a rectangle with a given brush
  • BITBLT: copy a rectangular area from elsewhere in frame buffer
  • POLYLINE: draw a sequence of connected lines
  • NEXTCHAR: render a font glyph at the “current location”
  • 3D_DRAW_VBUF – trigger the full 3d rendering path, ie load shader programs, textures, etc. and then

The GPU has an internal space for storing “PVS (Programmable Vertex Shader) code”, ie shader programs. The command-stream loads the necessary commands into this, and sets registers to indicate the start-point. As noted earlier, the PVS instruction-set is completely different from the Command Stream instruction-set.

See the r5xx documentation sources in the “references” section at the end of this article for the full details.

Memory Management

Modern graphics cards often have large amounts of very fast RAM available.

For userspace applications running with 64-bit addressing, it is trivial to expose this memory to userspace applications - there is plenty of user address-space available, and mapping gigabytes of graphics ram into the application’s address-space can simply be directly done.

For 32-bit operating systems, things are really tricky. Basically, it is necessary for the application to map a buffer into the graphics card when needed, and then to unmap it once finished so that there is sufficient address-space available to map the next needed buffer. These mapping/unmapping operations have significant performance impact, and therefore graphics on 64-bit systems runs significantly faster.

The primary purpose of the Linux GEM/TTM libraries is to allow user-space code to manage memory buffers shared with the graphics card in order to transfer data such as vertexes and textures.

Here’s an interesting table from “GPU Gems Ch 30”, by Nvidia:

Table 30-1. Available Memory Bandwidth in Different Parts of the Computer System

Component Bandwidth
GPU Memory Interface 35 GB/sec
PCI Express Bus (x16) 8 GB/sec
CPU Memory Interface (800 MHz Front-Side Bus) 6.4 GB/sec

Graphics Translation Table (GTT)

Some integrated graphics systems (ie those using normal system memory rather than dedicated memory) contain a GTT (Graphics Translation Table) aka GART (Graphics Addres Remapping Table). This is a kind of simplified memory-management unit used to map memory-addresses in the GPU’s instruction stream to physical memory addresses on the system bus. One example is the modern series of Intel integrated graphics, as present on i3/i5/i7 chips; these are referred by intel as GEN graphics (Broadwell is GEN7, ie generation 7, of this design).

There is a very good document on the Intel GTT, which also helps to understand some of the design of the GEM memory manager used by Linux to allocate memory. The referenced document talks frequently about PCI; AIUI although the Intel integrated graphics is actually on the same die as the CPU, it is also connected to the PCI bus.

While a “proper” MMU as present on standard CPUs can map large amounts of memory (due to having three or four “levels” of page-tables), GTTs (the intel ones at least) have far fewer table entries, meaning that there is a limit to the amount of memory that the CPU can make available to them. In particular, this means that when a user-space application passes large amounts of data to the graphics system, existing GTT mappings may need to be removed to allow the new data to be mapped in. Managing this is one of the responsibilities of the GEM memory manager. Note also that when multiple applications are concurrently performing graphics (“timesharing the GPU”), managing the GTT entries efficiently becomes even more complex. Intel Sandybridge graphics and later have a Per-Process Graphics Translation Table (PPGTT) which allows better “context switching” when processing graphics.

An “aperture” is the section of a PCI graphics card’s onboard memory that has been mapped via a BAR so it is directly accessable from the CPU. As noted in a document referenced above, there appears to be no reason why a 64-bit system with discrete graphics on a PCIe bus should not support an extremely large aperture - but it appears that nobody does this. Instead, the settings “requested” by the graphics card (via the values it returns when the BARs are read) specify a reasonably small memory range.

According to the above document: “the aperture is a subset of the GTT which can be accessed through a PCI bar”. Intel integrated graphics has no “dedicated memory”, instead using memory on the system bus. The CPU configures the graphics system’s GTT so that it can access certain physical pages, and the CPU can potentially write directly to those same pages. However apparently that can lead to “memory coherence problems”; instead the CPU should only write to addresses mapped via the graphics BARs, for which the graphics system presumably gets notifications from the PCI controller so that it can flush caches etc.

DMA Units

A PCI device has the ability to become “bus master” and then read/write system memory. Some graphics systems have an inbuilt DMA module which allows them to set up asynchronous “fetches” of data from system memory into local card memory, or vice-versa. When DMA is initiated from the CPU side, then the only on-graphics-card memory that can be addressed is that which has been mapped via a PCI BAR (the “aperture”). However when DMA is initiated by the graphics card, it can address all of system memory, and all of the card memory.

Note however that allowing a PCI device to access all of system memory is somewhat of a security issue; some PCI controllers therefore contain an IOMMU through which bus-mastered read/writes are mapped so that the host system can keep control of the addresses that the graphics card can access.

The Intel integrated graphics uses a GTT or PPGTT whose mappings are under control of the CPU, thus allowing the CPU to limit the pages visible to code running on the GPU.

The Compute Units

Example Compute Unit Instructions

Just in case you are curious, it is interesting to see what kinds of instructions a Compute Unit offers. This information is really only needed by people writing code to implement accelerate graphics operations in X drivers or OpenGL libraries. Knowing something about programming with these instructions helps to understand why graphics drivers are large and complicated, and why some approaches rely on embedding a complete compiler like LLVM.

Here is a selection of interesting instructions from the AMD Southern Islands GPU..

Scalar (SISD) Instructions:

  • ADD, SUB, MUL, ABS, MIN, MAX - standard maths ops
  • AND, OR, XOR, SHL, SHR, NOT - standard bit ops
  • BFM, BFE - bitfield operations
  • CMP_EQ, CMP_GT, CMP_LT, CMP_GE, CMP_LE, BITCMP - set condition flag
  • CBRANCH - jump if condition flag set
  • CMOV - copy data if condition flag set
  • WQM : wholeQuadMode: if any bit in a group of 4 is set, set all bits
  • CountZeroBits, CountOneBits, FindLastBit
  • ICACHE_INV - invalidate instruction cache

Vector (SIMD) Instructions:

  • ADD(t,s1,s2) : lane[i].reg[t] = lane[i].reg[s1] + lane[i].reg[s2] for all i (ie perform t=s1+s2 on all lanes)
  • as above for SUB, MUL, MIN, MAX
  • SHR(t,s1,s2) : lane[i].reg[t] = lane[i].reg[s1] >> lane[i].reg[s2] for all i ( ie perform t=s1»s2 on all lanes)
  • AND/OR/XOR/BFM etc vectorised too
  • LDEXP: result = arg1 ^ (2*arg2) – ie C library function ldexp() in hardware
  • about a dozen types of float->int conversion operations
  • about a dozen types of int->float conversion operations
  • CEIL/FLOOR
  • RCP, RSQ – reciprocal value, reciprocal square root
  • SQRT, SIN, COS – trigonometry
  • CUBEID - compute a “face id” from a cubemap (result is an integer 0..5)
  • CubeMapS, CubeMapT, CubeMapMajorAxis
  • LERP - unsigned 8bit pixel average (linear interpolation)
  • MED - compute median of 3 values
  • SAD - sum of absolute differences
  • INTERP - vertex parameter interpolation with barycentric coordinates
  • IMAGE_SAMPLE, IMAGE_GATHER - read from image buffer, and store “processed” values in registers rather than whole image
  • copy lane[i] value to a scalar register
  • BUFFER_LOAD, BUFFER_STORE: transfer data between vector registers and main memory. Load/store ops explicitly control whether they want coherent data or not.
  • IMAGE_* (for texture maps and typed surfaces) : can compute “fragid”, slice, z and face_id values
  • SAMPLE_* : can compute sample_b, sample_c, sample_d, gather values, or “derivatives” (ie slopes of geometric faces)
  • EXPORT : stores RGBA data into memory, and optionally Z (depth)

Vector operations apply to N data values concurrently (“lanes”). The basic concept is that input data is broken up into groups of N values, and then they are fed through a “program” in one pass. When the program completes (S_ENDPGM is encountered), the next N values are fed through the same program, etc.

Example of a conditional vector operation is: “execute val=val*2 if val is odd”.

Compute Unit Hardware

A Southern Islands GPU:

  • Has 32 Compute Units (CUs), each of which has 64kb of private ram - ie is like a cluster of CPUs. Each CU also has its own L1 cache. And each CU has its own set of registers.

  • Each CU has 1 scalar and 4 vector units. Each vector unit can apply the same instruction to 64 values concurrently. A scalar instruction executes in one cycle; a vector instruction in 4 cycles (where the vector instruction can manipulate 64 values concurrently).

AMD’s terminology is a little unusual for those of us used to traditional programming; it is “data driven”, ie their view is that you take data and apply a program to it, rather than take a function and apply it to data. It is necessary to keep this in mind when reading AMD documentation. For example, when doing “pixel shading”, they describe this as taking a batch of 64 pixels, and applying the same program to each pixel. Note: their name for a GPU program is a “kernel” :-)

Wavefronts

A “wavefront” is an invocation of a SIMD “program” (kernel) together with up to 64 input values. A wavefront can have a mix of scalar and vector instructions, ie executes on either the scalar component of the CU, or one of its vector units. A wavefront has a single Program Counter (PC) value, ie exactly one of its instructions is executed at a time. Multiple “wavefronts” can be assigned to the same CU in which case they can interact (can see shared registers and shared memory); presumably a wavefront will block on a scalar instruction if some other wavefront on the same CU is using the scalar unit. However because there are 4 vector units in a CU, 4 wavefronts can potentially be running vector operations concurrently (each processing 64 values) on the same CU.

Each CU is allocated many more “wavefronts” than it has vector units to support them; this allows the CU to implement “hyperthreading”, ie when a wavefront blocks (due to memory access, or competition for the scalar unit, etc) then another of the allocated wavefronts can be executed in its place. This means that wavefronts may be completed in a different order than they were originally scheduled; for the rare cases where this matters, there are a number of synchronization instructions available. Each “kernel” (ie GPU program) must declare the amount of working memory and number of registers it intends to use; the dispatching unit can therefore distribute wavefronts (which reference kernels) across CUs in order to maximise the chances of parallel execution. This ability to schedule an alternate wavefront when an existing one “blocks” is in fact one of the important factorys which drove the change from VLIW to GCN architecture; with VLIW, the compiler is responsible for figuring out when an operation might block, and so rearrange instructions to avoid it - but in some cases, behaviour can’t be predicted until runtime.

Vector fork/join:

Conditions are interesting when using vectors of data: an “if” statement may be true for some vector elements, and false for others! The standard way this is handled is for there to be an “exec” bit for each vector element; a test instruction is applied to all vector elements which sets the exec bit for that element iff the test is true for that element. Then when the next instruction is executed, all elements where the exec bit is false treat it as a NOOP. This way, the program counter remains identical for all vector elements. It is assumed that only a few instructions are executed before the “exec” bit is forced to true for all elements again.

Implementing a conventional “if” statement is done via the “fork” command; effectively the program instructions are applied only to those data values which are “true”, up until end of program or “join” command is encountered. At that point, the program counter gets reset back to the “fork” point, and the instructions are repeated for the “false” data items. Clearly, the performance impact of this is significant - avoid where possible.

Other Items

An i2c bus is commonly used by graphics cards to communicate with (modern) attached displays to retrieve monitor parameters (eg EDID codes and resolutions).

References

Information in this article has been pulled from many sources. Here are a few of them:

Since this article was written, a number of useful resources have been published:

Unresolved Questions

  • Where is the CS documentation for Southern Islands?

  • for X userspace drivers, what maps BARs to addresses accessable via /dev/mem?

  • what is a “fence” in the graphics drivers? It is some kind of synchronization primitive between the writer and reader of a shared buffer, but I’m not sure of the exact details..

Footnotes

(none yet)