The Linux Sound Stack

Categories: Linux

Introduction

This article is about how sound gets output on a Linux system. It covers:

  • OSS
  • ALSA
  • PulseAudio

OSS (Open Sound System) basically defines a standard interface offered by sound-card-specific kernel drivers. Apps can then open any compatible sound-card node in /dev, and use the same API to interact with it. The API is of course based on read/write/ioctl systemcalls. The same API is implemented in BSD and other unix-like kernels.

Following problems between Linux maintainers and the maintainers of OSS, the ALSA project was started as a replacement. ALSA similarly provides a set of kernel drivers with a standard API; this API was intended to improve on the original OSS API, though that is debated. ALSA drivers also provide an OSS-compatible interface for backwards compatibility. The ALSA project also provides a userspace library which provides significant functionality in userspace. However because ALSA drivers only support one “operation” at a time (eg play only one input stream) and the userspace part is a library (not a daemon process), effectively only one application can produce sound at a time.

PulseAudio provides a daemon process that multiple clients can connect to concurrently, and can therefore mix streams from multiple processes before forwarding to an ALSA sound driver. Client apps can talk to the pulseaudio daemon directly (via various libraries), or can use the original ALSA library with the result being forwarded to the pulseaudio daemon (rather than being sent directly to hardware).

Sound Hardware

This hardly needs mentioning, but:

  • sound hardware is almost always on a PCI bus (regardless of whether it is actually a plugin-in board, or soldered to the motherboard). Even on Intel integrated GPUs, the sound system is PCI-addressable.
  • GPU cards usually also include a sound hardware component - and when the GPU can output HDMI, then sound data is also be embedded in the HDMI stream.
  • sound hardware often combines multiple outputs (eg 2 speakers for stereo, 3 for systems with dedicated base-speaker, or even 5 for “surround sound”) and potentially multiple inputs (microphones).
  • different sound hardware (both inputs and outputs) support different sampling rates

Sound components can be divided into the following categories:

  • playback from pcm-format data
  • capture into pcm-format data
  • control aka mixer (setting volume-levels, switching between multiple possible inputs/outputs)
  • sequencer (MIDI interface)
  • timer (not sure what this does…)

PCM

Sound is usually stored in digital form (ie as a sequence of bytes) using “pulse code modulation” (PCM) representation.

In general, you can think of PCM as being a stream of bytes representing the absolute displacement of the speaker membrane measured at regular time intervals. Play sound into a microphone, and sample the displacement of the speaker from its central point at regular intervals - eg 8khz. Scale the samples to be in range 128+-128 (ie 0..255), and this is what a “pcm” audio node will return when read. When the input sound is a “pure tone”, plotting the bytes read on a graph will produce a sine-wave (approximately); more complex sounds produce more erratic-looking sample values. Playing sound is just the reverse - take a sinewave, sample at 8khz, map the values into range 0..255 and send to the output device to play a “pure tone” - or send less-regular patterns for more complex sounds.

There are minor variants, including 16-bit sampling (0..65535) and mu-law scaling. And a data-stream can potentially contain multiple channels of data, interleaved in various possible ways.

The primary alternative to this kind of digital sound is MIDI, in which a process selects an instrument then sends a sequence of notes; some device (whether a sound-card or an external device such as an attached electronic keyboard) then converts that into speaker displacements.

OSS

The Open Sound System (OSS) is the original sound architecture for Linux and many other unix-like systems. It defines a standard API for kernel devices that perform playback/capture/control/etc.

Under the original OSSv3, only one thread at a time may use a device (opening an already-open audio device returns EBUSY), and there are many other limitations.

After releasing OSSv3, the primary developer of OSS, the primary developer started releasing code under proprietary licences. While some unix-like distributions simply forked OSSv3 and continued development, under Linux it was instead replaced by ALSA. Eventually the same developer released code for an OSSv4 under an open licence; OSSv4 resolves many issues with OSSv3 but Linux had long moved to ALSA instead.

According to the wikipedia article, some of the BSD family of operating systems have stayed with OSS as a general architecture, and significantly enhanced it.

OSS defines its API in C header-file soundcard.h. However there is no oss library; this header just provides suitable constants for passing directly to read/write/ioctl calls against the kernel devices.

According to the OSS programming manual, OSS provides the following /dev nodes:

  • /dev/mixer – select sources, set volumn levels
  • /dev/sndstat – diagnostic info only
  • /dev/dsp*, /dev/audio* – reading these files records sound (in PCM format), writing to these files (in PCM format) produces sound; sometimes referred to as codecs
  • /dev/sequencer* – access to “synthesiser” features on card (not used much any longer AFAIK), and to connect to MIDI devices (eg external electronic keyboards).
  • /dev/midi* – lower-level access to MIDI devices

The linux kernel has several generic kernel drivers ALSA-based systems have optional device drivers which emulate the OSS devices (they can be found under /dev/snd).

In short, OSS provides very direct access to kernel drivers. It (at least in v3) does not provide good multi-process support; it can therefore be useful for professional sound processing but is not a good fit for general desktop use.

The programming manual linked to above is actually a good resource for learning about sound devices, as the software interface so closely maps to real capabilities.

In the kernel source, directory /sound/oss contains many OSS-specific drivers. File /sound/core/sound_oss.c provides a way for other sound-drivers which have OSS-emulation support to register that; this in turn makes it possible for userspace to use mknod to expose that functionality as files under /dev. It appears however that most modern linux distros don’t bother to enable OSS-emulation; apparently there are not enough applications left that depend on it.

ALSA

ALSA provides a set of Linux kernel audio device drivers, replacing the original OSS ones. These drivers provide an ALSA-specific API to userspace.

The alsa-oss (aka aoss) components provide backwards compatibility with OSS; there are two approaches:

  • additional kernel drivers that emulate the old OSS kernel modules (ie can be used to back the expected nodes in /dev);
  • a library which can intercept OSS operations and redirect to appropriate ALSA equivalents.

The library approach is interesting, as there is no OSS userspace “library”; applications written for OSS perform direct read/write/ioctl operations against the OSS devices in /dev. The aoss library is loader via $LD_PRELOAD, ie force-loads the library into the oss-based application. Presumably this library wraps the glibc open call to check whether the path is to an OSS file, and if so insert custom handling somehow.

Unlike OSS, ALSA provides a user-space library libasound (also known as “alsa-lib”) that applications are supposed to use instead of interacting directly with device files. Note that this is a library, not a daemon. The library is written to support multiple threads (fixing one problem with OSSv3), but as with OSS the kernel device drivers can still only be used by one process at a time.

It appears that ALSA provides a generic “sound processing pipeline” architecture. There are many different “plugins” available which transform or process sound data. An application (or config file) can then link these together into a pipline like a shell (or gstreamer). The data flowing through a pipeline can be multichannel data.

The most important plugins are:

  • The hw plugin which actually communicates with real physical hardware
  • The plug plugin which can resample (ie convert between different sampling rates)
  • The file plugin which writes data to a file as well as passing it on to the next stage
  • The route plugin which allows multiple input channels to be “mixed” and then passed on to different output devices.

Some plugins will try to delegate their work to the hardware, ie query whether the physical device supports such functionality in hardware, and emulate it in software if necessary.

ALSA also uses the term “device” for a templated wrapper of a plugin. Usually a “device” is given a name derived from the name of the plugin it uses, but not always. The following ALSA “devices” are built on the above plugins:

  • The hw device directly maps to the hw plugin: it takes as parameter the physical (card,device) address to write data to. Obviously, this is always the last item in a pipeline.
  • The plug device wraps the plug plugin; it takes just one parameter: the next ALSA-device in the pipeline (presumably it “resamples” to match the properties supported by the next device).
  • The plughw device wraps both the plug and plughw plugins; it takes one parameter: the (card,device) address needed by the hw plugin.
  • The file device wraps the file plugin, and is obviously the last item in a pipeline. It takes 2 parameters, being the filename and the format in which data should be written to the file.
  • The tee device wraps the file plugin, taking 3 parameters: the next ALSA-device in the pipeline and then the 2 parameters needed by the file plugin. As well as writing to the file, it also passes the data on.

When building a pipline, each ALSA device is passed the name of the next ALSA-device in the pipeline as its first “parameter”.

Actual hardware modules are addressed as “card:device:subdevice”, where device is an electric component (ie feature) on the card.

The libasound library reads a set of config files each time an application uses libasound to open an ALSA “device”. The files are:

  • /etc/asound.conf
  • /usr/share/alsa/alsa.conf
  • /usr/share/alsa/alsa.conf.d/*
  • $HOME/.asoundrc

Entries in these files are usually merged together, although a config-file can use the exclamation-point to overwrite (ie discard existing settings for that item). In the config-files, each entry defines a “device”. The “type” attribute specifies the plugin. The “slave” attribute indicates the name of the next device in the pipeline. The result is that a user can manipulate the way that an application outputs (or inputs) sound by modifying the ALSA configuration files. When an application outputs sound to an app-specific ALSA “device”, then the user even has per-app control over sound processing for that app. However usually applications write sound to the “pcm.default” ALSA device.

References:

ALSA Kernel Drivers

The “soundcore” kernel module creates file /proc/asound which provides info about all other sound-related hardware (ALSA drivers register themselves with “soundcore”). The call to proc_mkdir which creates /proc/asound can be found in “/sound/core/info.c” (yes, the sound stuff is directly under the Linux root, not under /drivers).

The device-nodes for ALSA drivers are traditionally found under /dev/snd:

  • control* nodes are used to switch input/output paths for devices
  • pcm* nodes are used to output and input audio (read/write operations); ioctl operations can be used to query/set the data format, etc.
  • seq is ??. Presumably used to interact with MIDI devices, but why no card/device info?
  • timer is ??

The “Cn” part of each device name indicates the “card”, and the “Dn” part indicates the device on that card.

The device-nodes are usually owned by root, and belong to the “audio” group. In simple systems, user accounts can manually be added to the “audio” group to grant them rights to access audio-related hardware. In more sophisticated systems, device rights are assigned to a single “active user” at a time by either (a) changing the group on these files, or (b) changing the membership of the audio group.

PulseAudio

Introduction

pulseaudio is a sound-system based around a daemon process that runs as some “active user” (unlike Xorg which typically runs as root). Other apps that generate sound talk to the current pulseaudio daemon (usually via a unix-domain socket). Pulseaudio then forwards sound to the /dev/snd/* devices using the same device-level API as ALSA. This central-daemon approach introduces some complexity but the daemon can merge sound from multiple applications before forwarding to the hardware, thus resolving one major limitation of ALSA.

The pulseaudio daemon is implemented as a small core of functionality plus a large set of extension modules that the user can add to the core to configure the desired behaviour (somewhat similar to the ALSA “plugins” approach). One example is that client/server communications protocols are implemented as modules.

Applications can integrate directly with pulseaudio via libpulse0. Alternately, they can use libcanberra which is a generic sound-api that supports multiple backends - and of course there is a libcanberra-pulse back-end. Both these libraries support several ways of communicating with a pulseaudio daemon; the two primary ways are via a local unix-domain socket (ie a file), and via a TCP connection (ie a host:port address). Which approach is used can be defined via a configuration-file or environment variable.

Integration with ALSA

There are a large number of applications that still use libasound (ie ALSA) - and for sophisticated programs, libasound is actually still a good option. As described above, ALSA applications read/write sound via named “pipelines”, and libasound builds these pipelines by reading configuration. This makes it possible to change the ALSA configuration files so that the final “sink” for a pipeline is the pulseaudio daemon itself - or the initial “source” when reading sound. The pulseaudio package usually installs extra config files under /usr/share/alsa/alsa.conf.d to rewire the default ALSA pipelines to use the “pulseaudio-alsa plugin” which reads/writes data via a unix-domain socket, just like standard pulseaudio clients do. ALSA’s config hooks are so flexible that the pipeline can even be rewired only when a process named pulseaudio is running. Running the “alsamixer” application will show the current pipeline; when pulseaudio integration is active then it will show “pulseaudio” as the “sound card”.

As noted above, ALSA kernel sound drivers are not designed to be used by multiple processes. AFAIK, the pulseaudio daemon generally opens all of them when it starts - or maybe only when actually playing sound. In either case, having pulseaudio and ALSA directly accessing the same drivers at the same time is not likely to end well. Theoretically, an app can open /dev/snd/* directly; if pulseaudio has already started then that open will fail otherwise the app can play sounds, and pulseaudio will fail (at least for that device). Alternately, an app can use libasound and bypass the pulseaudio plugin, in which case libasound can also try to directly open specific /dev/snd/* devices - again failing if pulseaudio already has that device open else blocking pulseaudio from using that device.

in ALSA config files (esp. /etc/asound.conf), a block is of form group.item {...}. Normally, blocks are merged; an exclamation-mark before the item name forces an override (ie only this block is used).

The ALSA library docs say that the “open” function can take a mode: block or nonblock, and that various functions in the library will then either block when the resource is not available, or return an error when the resource is not available. I’m not sure if this also applies to multiple apps opening the same device or whether that’s just for multiple threads in the same library..

There is also a direct oss/pulseaudio integration tool: padsp.

PulseAudio Configuration Files

pulseaudio looks for configuration files within the following directories:

  • ~/.config/pulse (formerly ~/.pulse).
  • /etc/pulse

Within these directories:

  • client.conf is for the libraries that connect to pulseaudio, eg libpulse0, libcanberra-pulse, or alsa-pulseaudio-plugin
  • daemon.conf is used to configure the background sound-daemon that runs as “the current active user”
  • default.pa is also used by the background daemon; this file contains module-configuration stuff.

PulseAudio Helper Programs

The “pactl” and “pacmd” programs talk to the pulseaudio daemon, and effectively do what could have been done via the pulseaudio config files at startup - ie add/remove modules. Changes made via pactl/pacmd are not persisted.

Client/Server communication

Server Setup

When the pulseaudio daemon starts up, it allocates a random “authentication cookie” and writes it into $HOME/.config/pulse/cookie. The daemon then initialises each module which defines a communications protocol.

The standard module “module-native-protocol-unix” allocates a unix-domain socket and listens on it for client connections. By default, the socket it creates is named $XDG_RUNTIME_DIR/pulse/native. The value of $XDG_RUNTIME_DIR is usually /run/user/{userid} (and /var/run is usually a symlink to /run).

Alternate modules exist, such as one that opens and listens on a TCP port - which allows applications on remote systems to play sound via this daemon.

By default, communications modules require the client to present a copy of the authentication cookie allocated by the daemon on startup; some communications modules support option “auth-anonymous=1” to bypass this requirement.

An x11-integration module exists which simply stores an (address, cookie) pair as a property associated with the root window of an X server; when a client is using the same X server then it can easily retrieve this information. Note however that this doesn’t help if the socket is a unix-domain socket, and the X client is remote!

Additional client/server sockets can be set up interactively via “pactl load-module module_native_protocol_unix socket=/path/to/new/socket”, ie adding another communications module to the existing set. A few other params are also available for this module; most useful is “auth-anonymous=1” (avoids needing the cookie available, but obviously not so secure!).

Client Init

By default, pulseaudio clients (presumably including the alsa-plugin) assume module-native-protocol-unix is available, and try to talk to the server via socket $XDG_RUNTIME_DIR/pulse/native. An alternate server can be specified via environment variable $PULSE_SERVER. The client always reads the “client.conf” config files as described in the section on configuration above.

When a client app has $DISPLAY available, then it connects to that X server and looks for a special property on the root window which contains an (address, cookie) pair placed there by a pulseaudio server; when present it connects to that server.

When the client config specifies “autospawn=yes” and the client cannot connect to any pulseaudio instance, then the client automatically starts the pulseaudio daemon. In systemd-based systems, systemd-init is instead configured to start pulseaudio as soon as anything opens the standard pulse/native socket; the autospawn setting should therefore be set to no.

As noted in the server section, the client needs to provide an “authentication cookie” unless the server created that communication channel with “auth-anonymous=1”.

See the pulseaudio FAQ for the full list of ways a client tries to connect to a server.

The command pactl list shows a lot of interesting information about the current pulseaudio daemon configuration.

paprefs allows setting of common preferences.

pavucontrol allows per-application volume control.

See also: oss/osspd/padev

TODO: Are there ways to synchronize graphics and sound? Maybe not necessary, as sound doesn’t need to be keyed perfectly to 60fps graphics..

Other Sound-Related Libraries

SDL (libsdl) provides a sound API.

FMOD (libfmod) appears to also provide sound support.

gstreamer - not sure quite how this fits into the above picture. Presumably a layer on top of pulseaudio.

phonon - Qt sound API; really just a platform-independent wrapper around an underlying sound architecture

Generating Sound from a Container

When running an application (or distribution) within a Linux container, it is quite easy to get sound working:

  • use “pactl load-module …” to create a new unix-domain socket for pulseaudio (in a suitable directory). You will also need to use “auth_anonymous=1” - or to copy the current user’s ~/.config/pulse directory to the container so the cookie is available.
  • bind-mound the directory containing the pulse-audio socket into the container’s filesystem
  • in the container, set environment variable PULSE_SERVER to point to that socket
  • to support alsa-based apps within the container it is necessary to install the pulseaudio-alsa integration package (to get the pulseaudio-alsa-plugin), and might be necessary to manually edit one of the ALSA config files to enable the plugin (the auto-detect might not work as the pulseaudio daemon is not in the container).

Note that it is not necessary to run a pulseaudio daemon process in the container, nor is it necessary to bind-mount /dev/snd; the client process within the container just needs to write to the pulseaudio socket.

References