Categories: Linux

Overview

One of the fundamental duties of an operating-system kernel is to manage processes, ie executing applications. There are a few interesting features about processes in Linux…

Memory Namespaces and Open Files

The general definition of a process is a set of one or more threads of execution which share a set of resources. The most important resources are a virtual-memory address-space, and a set of open file-descriptors - though a Linux file-descriptor can actually be a handle to many things such as a source-of-events (eg inotify), or an open socket, in addition to simple data-files.

A thread is basically a datastructure with at least:

the address of the next instruction to be executed (ie a program-counter),
values for other CPU registers, and
a call-stack (indicating how it got to the instruction being executed, along with local-variables allocated along the way)

A kernel can “run” a thread at any time by loading the thread state into a CPU. In particular, the program-counter value then causes the appropriate instructions to be executed. A kernel can “suspend” a thread by saving the current state of a CPU into the thread’s datastructure - and then “run” some other thread instead. This is done by a kernel component called the “scheduler”.

Multiple threads associated with the same process see the same memory addresses and thus can “race” with each other when reading/writing those memory locations. It is the responsibility of the programmer to use synchronization primitives to avoid problems.

It is technically possible to share blocks of memory between threads in different processes, but it is not the default. It is also possible to share other resources such as file handles with other processes, and the linux kernel (and Posix standard) provides several mechanisms to do this. A process which starts a new “child” process (via the “fork”, “clone” or related systemcall) can share some resources with the child, but unrelated processes share nothing by default.

Threads and Processes

On Linux, every “schedulable entity” (aka task or thread) is represented via a struct task_struct within the kernel, and each task_struct has a unique integer id which I will call the “task-id” (see later). Every task_struct also has an integer field (which I will call the “process-id”) which groups (labels) threads belonging to the same “process”. Systemcall clone creates a new task_struct with its unique task-id; one of the flags this call supports is CLONE_THREAD:

when CLONE_THREAD is set then the new task_struct has process-id=parent.process-id ie is labelled as “belonging to the same process” as the thread that created it.
when CLONE_THREAD is not set, then the new task_struct has process-id=task-id, ie is labelled as being the “primary thread” in a new process.

The “fork” and “vfork” systemcalls are simply wrappers around clone which always set CLONE_THREAD=0, ie always label the new “schedulable entity” as being a new process.

Confusingly (for historical reasons), the kernel sourcecode uses name “pid” (and datatype “pid_t”) for the unique task-id (ie unique task_struct identifier) and name “tgid” (thread-group-id) for the “process-id” while the ps command uses the name “tid” for the “task-id” and “pid” for the “process-id”. This can be seen by comparing the output of ps with the information in /proc/{N}/status for any non-primary thread in a multi-threaded process. The status-file shows task_struct.pid (ie the unique task_struct id) as “Pid:” and task_struct.tgid as “Tgid:”, while the “ps” command will (by default) output the “Tgid” value under column-heading “PID”. Actually, the “ps” command has the more sensible naming.

Note that a new task (task_struct) has an integer field holding the current thread group id (ie current process) and the parent thread group id (ie parent process), but not the parent thread id (tid). It is interesting that information on exactly which thread created the current thread/task is simply not available - but it is actually irrelevant; only the identity of the thread groups (current and parent) are actually needed.

The clone systemcall has many other interesting flags that control which resources will be shared/isolated between the caller of clone (the “older” schedulable entity) and the new one created by the call to clone. This article doesn’t address any of these flags except the ones related to thread-group/parent-threadgroup/session/process-group/controlling-tty. See the man-page for the clone systemcall for full details of other flags.

It is interesting to look at the processes on a running Linux system using command

ps -e -L -o tid,pid,ppid,state,euid,egid,sid,pgid,tty,tpgid,command

where:

-e means “every process” (ie including those associated with other ttys than the one issuing the command)
-L means include “schedulable entities” where tid != pid, ie include non-primary threads
-o is followed by a list of information to print for each process
tid = the “schedulable entity” id, aka task-id or thread-id
pid = the process-id of the process (aka tgid or threadgroup-id)
ppid = the parent process id (aka parent threadgroup-id, aka tid of “primary thread” in parent thread-group)
state = the state of the process (eg running, stopped)
euid = the effective user-id associated with this process; column “ename” can be used to get the text representation instead of the number
egid = the effective primary user-group associated with this process (there can also be secondary-groups); column “egroup” can be used to get the text representation instead of the number
sid = session-id, ie the tid of the session leader
pgid = process group id, equivalent to the tid of the “process group leader” (note: related to session-management, not directly related to concept of parent process)
tty = controlling tty (terminal) name
tname = controlling tty (terminal) name
tpgid = id of the foreground process on the associated tty (tty-pgid), or -1 if there is no associated tty
command = the commandline of the process

Entries in the /proc filesystem provide a direct view of the “schedulable entities” tracked by the kernel scheduler using the task_struct datastructure. The “ps” command simply provides a nicely formatted/filtered view of this information; directory /proc/{N} provides access to information about tid=N, eg:

file /proc/{N}/status shows a subset of of the info from the corresponding kernel task_struct (sadly, not including sid/pgid/tty/tpgid)
file /proc/{N}/stat also shows information from task_struct (including sid/pgid/tty/tpgid), but in less readable format
file /proc/{N}/statm shows some memory-consumption-related info about the task: size/resident/shared/text/0/data/0
file /proc/{N}/cmdline shows the commandline associated with the task, ie argv[0]
dir /proc/{N}/task shows info about all other tasks with the same pid (ie all threads in the same thread-group)

By default, “ps” omits entries representing secondary threads of a process, ie entries where tid!=pid. Omitting threads is normally desirable, but for this example it is interesting to include it (via option “-L”). The Xorg X server is an example of a process that has multiple threads, as are NetworkManager, rsyslogd, gdm3.

The kernel task_struct datastructure includes various fields to mark entries as belonging to various different “groups”. Usually, membership of such a group is defined by setting a task_struct field to the tid of the task_struct entry that is the “leader” (primary or first member) of the group.

For more information, see the source-code for debian package procfs, in particular file proc/readproc.[ch]. The kernel code responsible for providing the status/stat/statm files can be found in fs/proc; see functions proc_pid_status(file “status”), proc_pid_statm (file “statm”), proc_tgid_stat (file “/proc/{N}/stat”), proc_tid_stat(file “/proc/{N}/task/{N}/stat”).

Grouping Entities

The Linux kernel allows schedulable entities to be grouped in various ways. The general approach taken by Linux is to add a simple integer field to the task_struct to specify “group membership”; all entities with the same integer value in that field “belong to the same group”. The actual integer value used is usually the TID of the “primary thread” in the group.

The concept of a process has already been discussed above, and follows exactly this design: datastructure task_struct has an integer field named “tgid” which holds the TID of the “primary thread” in the thread-group. The primary thread has its own TID in this field (tgid=tid), secondary threads have their tgid set to the TID of the primary thread.

Sessions and Process Groups

This section discusses the meaning of the sid/pgid/tty/tname/tpgid fields from the ps output above.

When the primary way of interacting with a computer was via a text console, it was important to be able to switch between logical “jobs”, ie from within a “shell”, to suspend and later resume a process or set of processes. It is still moderately important, although not as critical as it once was, given graphical displays where multiple terminal windows (or tabs) can be open concurrently.

To support “job management”, two concepts were invented: sessions and process-groups (aka jobs), and support was added to the kernel for these. Despite their “primary driving use-case” for enhancing interactive text terminal management, these groups are implemented in a fairly generic way and so can potentially be used for other purposes.

The typical meaning of a “session” is a set of process-groups associated with the same terminal (tty) of which only one is active (“foreground”) at a time. A process-group is typically used to link a set of (one or more) processes which “belong together” in the sense that they should all be active (or suspended) at the same time. The first-created process-group within a session is the “leader”; this is the one that becomes “active” when the currently-active one is suspended.

As an example:

a text-mode shell is started;
the user types “yes | more” to tell the shell to start two “cooperating” child processes (aka a “job”)
the user types “ctrl-z” to “suspend the current job” (making the “leader group” active)
the user types “bg” to allow the two processes in the suspended job to resume execution as long as they don’t try to read from the terminal
the user types “top” to tell the shell to start another process
the user types “q” to tell the “top” process to exit
the user types “exit” to tell the shell to exit

This example will look something like this on the console:

$ PS1=">>" sh
>> yes | more
   ...
  ^Z  # ctrl-z -- displays "Stopped(SIGTSTP)"
>> bg -- displays "Stopped(SIGTTOU)"
>> top
   q  # terminate top
>> exit
$

This example requires the shell and kernel to perform a number of operations. In overview, the steps are:

create session
bind session to a controlling terminal
create a process-group (in that session) and start new processes in it
handle ctrl-z by suspending one process-group and handing control of the terminal back to the shell
handle “bg” by resuming a process-group while preventing it from writing to the controlling terminal (see the SIGTTOU message)
create another process-group (in same session) and start new process (“top”) in it

The actual steps in more detail are as follows..

The shell process creates a new session when it starts, with itself in that session. It also creates a new process-group and again places itself in that group.

The shell then opens node “/dev/ttyN”; the kernel-mode tty-driver updates the task_struct to mark the session associated with the calling process as being “controlled” by the specified tty. A file-descriptor is then returned as usual.

The shell performs a blocking read of its STDIN filedescriptor, ie the tty file-descriptor opened above. The tty-driver checks whether the calling process is in a session (it is) and if so whether it is in the “active group” (it is); the read is therefore valid. However there is currently no data to read, so the calling thread is placed on a list of threads to wake when data is available, and then marked as “blocked”.

The string “yes|more” from the keyboard is processed by the tty-driver; it wakes all threads blocked on input (the shell), and the shell then receives the entered text (the read call returns with data).

The shell creates a new process-group within the same current session, and marks this new group as the “active process group” (aka “foreground group”) for the session. The shell then creates the pipe, and spawns the “yes” and “more” processes within the new process-group. The processes start running.

The shell then goes back to read from STDIN. This time, the tty-driver sees that it is in a session, but is not in the session’s active (foreground) process group - so the tty driver calls into the scheduler to queue a SIGTTIN for the shell, which in turn causes the calling thread (the shell) to be marked as “not runnable”.

The user types ctrl-z which is handled by the tty-driver. The driver calls into the kernel scheduler to queue a SIGTSTP for all processes in the currently active process-group of the session “controlled” by that specific tty; the yes and more processes will be marked as “not runnable”. The driver then finds the leader process group for the tty session, marks it as the “active” (foreground) process group in that session and tells the scheduler to resume all suspended threads in that process group. The shell is now once again ready to receive input from the tty as soon as keystrokes are entered.

The user types “bg” which causes the shell’s read operation to return that data. The shell tells the kernel to unblock all threads in the relevant process-group (sends a SIGCONT), allowing yes/more to run again (ie make them eligible for CPU time). However their process-group is not marked as active; if any process in that group should try to read from the tty then the tty driver would simply tell the scheduler to suspend it again. This is basically the definition of a “background” process. Optionally, this auto-suspend behaviour can also be triggered when a process writes to a tty - ie the shell can choose whether “background” processes should be able to scribble over the screen or not. The fact that message “Stopped(SIGTTOU)” is shown in the example above indicates that exactly this is happening - the “more” process tried to write to the terminal, the tty-driver queued a SIGTTOU for it, the default libc signal-handler caught the signal and wrote out the message then let the kernel suspend the process.

Although the yes/more “job” has been allowed to resume, the “active group” is still the one including the shell, so when it tries again to read STDIN it is permitted (and blocks until data is present). When “top” is entered, the read returns with that data. The shell creates a new process-group within the same session, marks it as “active” and spawns the “top” process within that group. Top periodically polls for keyboard input (permitted, as it is in the active group) while the shell’s call to read causes it to be suspended (not in the active group).

There are now three process-groups in one session (shell, yes|more, top); one of these groups has multiple processes in it. The active group is the one containing “top”; no process in any other group within the same session can successfully read (it would be blocked if it tries, as the shell currently is). Thus when “q” is entered, it is the top process that sees it, and then exits.

As the terminating process had a session and was the last process in the active process-group, the scheduler makes the “leader process group” active (the shell).

And so on…

The “Stopped(SIGTSTP)” and “Stopped(SIGTTOU)” messages are generated by the default signal handlers provided by libc. If the child registers its own handlers for these signals then the messages will not be output.

Creating New Sessions

A child process inherits the sessionid of its parent. The systemcall setsid allows any process to become the “leader” of a new session, ie sets sid=tid.

The “session” feature can also be used to group sets of processes together for any desired purpose. Desktop environments (eg Gnome) typically include a “session manager” which is the first thing started, and it then starts a bunch of “core” processes such as a “per-session dbus instance”, a “per-session pulse-audio instance” and similar items. However because any process can reset its session-id, ensuring that all child-processes of a particular parent can be found regardless of what they do is best achieved via the control group subsystem instead (a much more modern and currently Linux-only feature).

As usual with “groups” of processes, the group identifier (sid) is equal to the TID of the “primary” schedulable entity in the group, ie the setsid call updates the caller’s task_struct so that sid=tid.

TODO: in a multi-threaded process, when one thread calls setsid does that affect all threads in the process? I expect it does.. though doing that is probably poor style (better to call setsid early after fork before spawning threads).

Creating New Process Groups

The pgid field (ie forming groups of processes) is usually used for a “pipeline” of processes, ie a sequence of processes joined via pipes such as “ls | grep foo | more”. In this case, the first process is normally the “group leader” and each of these processes has pgid = tid of the group-leader process. Note that these processes are normally “siblings” (ie all are children of a common parent, usually a shell).

Systemcall setpgid can be used to create a new group, ie update the task_struct of the calls so that tpgid=tid. Interestingly (and unlike setsid), this call can also be used to move an existing process from its current process-group into another existing process-group (though the target group must be within the same session). Also, systemcall setsid automatically creates a new process-group. See the man-pages for more details.

The first-created process within a process-group is the group “leader”.

A shell must keep track of all the process-groups it creates within its own session, and the command “jobs” will output one line per process group.

Tracking the Foreground Group

The tpgid field specifies which of the other process-groups in the tty’s associated session is currently active (aka “has focus” or “is in foreground”). Threads of any process in that group are allowed to read keystrokes from that terminal and write characters to it; threads belonging to other processes which try to read/write the terminal will be suspended. Systemcall tcsetpgrp is used by the shell to specify the process-group with “focus” when a new process-group is started, or “fg” is run on an existing one.

The Controlling TTY

The expression “controlling TTY of a process” is sometimes used when talking about this kind of “job control”. This indicates that the tty driver controls the processes, in the sense of sending signals to them when keystrokes like “ctrl-z” are pressed, or telling the scheduler to mark them as unrunnable when a read() call is made while the associated process group is not active.

As described earlier, opening a /dev/ttyN file invokes the “open” logic of the kernel tty-driver which then (optionally) marks the session associated with the caller as being “controlled” by that TTY. The NOCTTY flag can be passed on open() to prevent that behaviour if desired. As mentioned earlier, setsid can be used to clear the controlling-tty for a threadgroup (process).

Using “setsid” to start a new session “clears the controlling tty” in the sense that the tty driver will no longer send signals to that process-group. However the process may still have STDIN/STDOUT file-descriptors pointing to that tty, and so can still read/write that device. If a “non-controlled” process tries to read data from the tty, then there will simply be a race with other readers; exactly one reader will get the keystrokes. Similarly if multiple processes in the active process group try to read from the TTY, they will race against each-other; this whole system guarantees exclusive access to one process group within the same session.

The documentation for setsid specifies that the call:

fails if the caller is a process-group-leader - presumably because the group would otherwise be left without a leader
makes the caller a process-group-leader - because every group needs a leader
clears the “controlling terminal” - because the tty-driver is arbitrating exclusive-access for that tty between the process-groups in the previous session; it doesn’t make sense for it to do so for two sessions at a time.

As already noted, setsid will not close any file-descriptors, ie the caller may still have open file-descriptors pointing to the same tty - it just isn’t “controlled” by the tty any more, in the sense of being automatically signalled/suspended when it reads/writes that filedescriptor.

In the output of a “ps” command, the tty column indicates which “controlling terminal/tty” the task is associated with. A new task always inherits the tty field of its parent process (if any). A process is “disconnected” from the inherited tty when it calls setsid, or when its session-leader terminates.

Relevant Kernel Data Structures

The task_struct structure definition in the kernel (include/linux/sched.h) has the following datafields relevant for this article:

struct list_head tasks – maps to /proc/N/tasks, ie the list of all tasks within a process? There is a separate “children” property elsewhere..
unsigned long jobctl – a bitmask of JOBCTL_* flags defined in the same file
pid_t pid – the “schedulable entity id” aka TID
pid_t tgid – the group of threads forming a process
task_struct parent – the “creating thread” or the “primary thread” in the parent process?
list_head children – pointers to child schedulable entities created with clone or fork
list_head sibling
task_struct group_leader – threadgroup leader??
pid_link pids[PIDTYPE_MAX] ??
list_head thread_group
list_head thread_node
u32 parent_exec_id
u32 self_exec_id
struct signal_struct *signal

A “signal_struct” is shared by all threads within a process (ie all data_structs with the same tgid) - ie holds “per-process” (aka per-threadgroup) information. Settings related to “signal handling” do fall into this category (hence the name) but this structure also holds many fields not related to signals at all. The following fields are relevant for this article:

int leader – boolan : is this thread-group the session group leader?
tty_struct *tty – which tty is this thread-group asociated with?

The signal_struct has a boolean “session group leader” flag, and a “struct tty_struct *tty” field.

References

API documentation:

man 2 clone
man 2 fork
setpgid()
man 2 prctl
man 7 credentials - has a good discussion of group and session ids

About

Recent Posts

Categories

Process Groups