Chapter1:IntroductionandOverviewconcept.Figure 1-1 provides a rough initial overview about thelayers that comprise a complete Linuxsystem,and alsoabout someimportant subsystems ofthekernel as such.Notice,however,thattheindividual subsystems will interact in a variety of additional ways in practice that are not shown in thefigure.Applications+UserspaceNetworkingDeviceDriversC Library+个SystemCallsVFSFilesystems←Device1CorekerneldriversKernel spaceMemory mgmtProcess mgmtHardwareArchitecture specific codeFigure 1-1:High-level overviewof thestructure ofthe Linuxkerneland thelayersinacompleteLinuxsystem1.3.1Processes,TaskSwitching,andSchedulingApplications,servers, and other programs running under Unix are traditionally referred to as processes.Each process is assigned address space in the virtual memory of the CPU.Theaddress spaces of the indi-vidual processesaretotallyindependentsothattheprocessesareunawareofeachotherasfaraseachprocess is concerned,ithas the impression ofbeing the only process in the system.If processes wanttocommunicatetoexchangedata,forexample,then specialkernel mechanisms mustbeused.Because Linux is a multitasking system, it supports whatappears to be concurrentexecution of severalprocesses.SinceonlyasmanyprocessesasthereareCPUsinthesystemcanreallyrunatthesametime, thekernel switches (unnoticed byusers)between theprocesses atshort intervalstogivethem theimpressionofsimultaneousprocessing.Here,therearetwoproblemareas:1.Thekernel, with thehelp of the CPU,is responsible forthe technical details oftask switch-ing.Each individual processmust be given the illusion that the CPU is always available.Thisisachievedbysavingall state-dependentelements of theprocessbeforeCPUresources arewithdrawn and the process is placed in an idle state.When theprocess is reactivated, theexact saved state is restored.Switchingbetween processes isknownastask switching.2.Thekernel must also decide how CPU time is shared between theexisting processes. Impor-tant processes aregiven a larger share of CPU time,less important processes a smallershareThe decision as to which process runs for how long is known as scheduling.1.3.2UnixProcessesLinuxemploysahierarchical schemeinwhicheachprocessdependsonaparentprocess.Thekernelstarts the init program as the first process that is responsible for further system initialization actionsand displayof theloginpromptor (inmorewidespreadusetoday)displayofagraphical login interfaceinit is thereforethe rootfrom which all processes originate,more or less directly,as showngraphically4
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 4 Chapter 1: Introduction and Overview concept. Figure 1-1 provides a rough initial overview about the layers that comprise a complete Linux system, and also about some important subsystems of the kernel as such. Notice, however, that the individual subsystems will interact in a variety of additional ways in practice that are not shown in the figure. Applications Userspace C Library Kernel space Hardware Device drivers Core kernel System Calls Networking Device Drivers VFS Filesystems Memory mgmt Architecture specific code Process mgmt Figure 1-1: High-level overview of the structure of the Linux kernel and the layers in a complete Linux system. 1.3.1 Processes, Task Switching, and Scheduling Applications, servers, and other programs running under Unix are traditionally referred to as processes. Each process is assigned address space in the virtual memory of the CPU. The address spaces of the individual processes are totally independent so that the processes are unaware of each other — as far as each process is concerned, it has the impression of being the only process in the system. If processes want to communicate to exchange data, for example, then special kernel mechanisms must be used. Because Linux is a multitasking system, it supports what appears to be concurrent execution of several processes. Since only as many processes as there are CPUs in the system can really run at the same time, the kernel switches (unnoticed by users) between the processes at short intervals to give them the impression of simultaneous processing. Here, there are two problem areas: 1. The kernel, with the help of the CPU, is responsible for the technical details of task switching. Each individual process must be given the illusion that the CPU is always available. This is achieved by saving all state-dependent elements of the process before CPU resources are withdrawn and the process is placed in an idle state. When the process is reactivated, the exact saved state is restored. Switching between processes is known as task switching. 2. The kernel must also decide how CPU time is shared between the existing processes. Important processes are given a larger share of CPU time, less important processes a smaller share. The decision as to which process runs for how long is known as scheduling. 1.3.2 UNIX Processes Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly, as shown graphically 4
Chapter1:IntroductionandOverviewby the pstree program.init is the top of a tree structure whose branches spread further and furtherdown.wolfgangemeitner>pstreeinit-+-acpidI-bonobo-activati-cronI-cupsd-2*[dbus-daemon]-dbus-launchI-dcopserverI-dhcpcd-esdi-eth1-events/0-gam_serveri-gconfd-2I-gdm---gdm-+-x'-startkde-+-kwrapper-ssh-agent-gnome-vfs-daemo-gpg-agentI-hald-addon-acpi-kaccess-kded-kdeinit-+-amarokapp---2*[amarokapp]-evolution-alarm-kinternet-kio_file-klauncher-konqueror-konsole---bash-+-pstree"-xemacs-kwin-nautilus'-netappleti-kdesktopi-kgpgI-khelperI-kickerI-klogdI-kmixi-knotify[-kpowersaveI-kscd1-ksmserver-ksoftirqd/01-kswapdo-kthread-+-aio/0I-ata/0I-kacpid-kblockd/0I-kgameportdi-khubd5
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 5 Chapter 1: Introduction and Overview by the pstree program. init is the top of a tree structure whose branches spread further and further down. wolfgang@meitner> pstree init-+-acpid |-bonobo-activati |-cron |-cupsd |-2*[dbus-daemon] |-dbus-launch |-dcopserver |-dhcpcd |-esd |-eth1 |-events/0 |-gam_server |-gconfd-2 |-gdm-gdm-+-X | ‘-startkde-+-kwrapper | ‘-ssh-agent |-gnome-vfs-daemo |-gpg-agent |-hald-addon-acpi |-kaccess |-kded |-kdeinit-+-amarokapp-2*[amarokapp] | |-evolution-alarm | |-kinternet | |-kio_file | |-klauncher | |-konqueror | |-konsole-bash-+-pstree | | ‘-xemacs | |-kwin | |-nautilus | ‘-netapplet |-kdesktop |-kgpg |-khelper |-kicker |-klogd |-kmix |-knotify |-kpowersave |-kscd |-ksmserver |-ksoftirqd/0 |-kswapd0 |-kthread-+-aio/0 | |-ata/0 | |-kacpid | |-kblockd/0 | |-kgameportd | |-khubd 5
Chapter1:IntroductionandOverview-kseriodi-2*[pdflush]reiserfs/oHow this tree structure spreads is closely connected with how newprocesses aregenerated.For thispurpose, Unix uses twomechanisms called fork and exec.1.forkGenerates anexact copyof the currentprocessthat differs fromtheparentprocessonlyin its PID (process identification).After the system call has been executed,therearetwoprocesses in the system,both performing the same actions.The memory contents of the ini-tial process are duplicatedat least in the view of the program. Linux uses a well-knowntechniqueknown ascopy on write thatallows it to make the operation much more efficientby deferring the copy operations until either parent or child writes to a pageread-onlyaccessedcanbesatisfiedfromthesamepageforboth.Apossiblescenarioforusingforkis,forexample,whenauseropensasecondbrowserwindow. If the corresponding option is selected, the browser executes a fork to duplicate itscode and then starts the appropriate actions to build a newwindow in thechild process.2.exec-Loads a new program into an existing content and then executes it.The memorypages reserved by the old program are flushed, and their contents are replaced with newdata.The new program then starts executing.ThreadsProcesses are not theonlyform of programexecution supported by thekernel.Inaddition to heavy-weightprocessesanothernameforclassicalUnixprocessestherearealsofhreads,sometimesreferredtoaslight-weight processes.Theyhavealsobeenaround for some time,and essentially,aprocessmay consistofseveral threads that all share the same data and resources but take different paths through the programcode.The thread concept is fully integrated into many modern languages-Java,for instance.In simpleterms,aprocess canbeseenasanexecutingprogram,whereas athread isaprogramfunction orroutinerunning in parallel to the main program.This is useful, for example, when Web browsers need to loadseveral images in parallel. Usually,the browser would have to execute several fork and exec calls togenerate parallel instances; these would then be responsible forloading the images and making datareceived availableto themain program using somekind of communication mechanisms.Threadsmakethis situation easier to handle.The browser defines a routine to load images,and the routine is startedas a thread with multiple strands (each with different arguments). Because the threads and the mainprogram share the same address space, data received automatically reside in the main program.There isthereforeno needforany communication effort whatsoever,excepttopreventthe threads from steppingonto theirfeet mutuallyby accessing identical memory locations,for instance.Figure 1-2 illustrates thedifferencebetweena program withand without threads.Address SpaceControl FlowW/O ThreadsWith ThreadsFigure1-2:Processes withandwithoutthreads.6
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 6 Chapter 1: Introduction and Overview | |-kseriod | |-2*[pdflush] | ‘-reiserfs/0 . How this tree structure spreads is closely connected with how new processes are generated. For this purpose, Unix uses two mechanisms called fork and exec. 1. fork — Generates an exact copy of the current process that differs from the parent process only in its PID (process identification). After the system call has been executed, there are two processes in the system, both performing the same actions. The memory contents of the initial process are duplicated — at least in the view of the program. Linux uses a well-known technique known as copy on write that allows it to make the operation much more efficient by deferring the copy operations until either parent or child writes to a page — read-only accessed can be satisfied from the same page for both. A possible scenario for using fork is, for example, when a user opens a second browser window. If the corresponding option is selected, the browser executes a fork to duplicate its code and then starts the appropriate actions to build a new window in the child process. 2. exec — Loads a new program into an existing content and then executes it. The memory pages reserved by the old program are flushed, and their contents are replaced with new data. The new program then starts executing. Threads Processes are not the only form of program execution supported by the kernel. In addition to heavy-weight processes — another name for classical Unix processes — there are also threads, sometimes referred to as light-weight processes. They have also been around for some time, and essentially, a process may consist of several threads that all share the same data and resources but take different paths through the program code. The thread concept is fully integrated into many modern languages — Java, for instance. In simple terms, a process can be seen as an executing program, whereas a thread is a program function or routine running in parallel to the main program. This is useful, for example, when Web browsers need to load several images in parallel. Usually, the browser would have to execute several fork and exec calls to generate parallel instances; these would then be responsible for loading the images and making data received available to the main program using some kind of communication mechanisms. Threads make this situation easier to handle. The browser defines a routine to load images, and the routine is started as a thread with multiple strands (each with different arguments). Because the threads and the main program share the same address space, data received automatically reside in the main program. There is therefore no need for any communication effort whatsoever, except to prevent the threads from stepping onto their feet mutually by accessing identical memory locations, for instance. Figure 1-2 illustrates the difference between a program with and without threads. W/O Threads With Threads Address Space Control Flow Figure 1-2: Processes with and without threads. 6
Chapter1:IntroductionandOverviewLinux provides the clone method to generate threads. This works in a similar way to fork but enables aprecisecheck tobemadeofwhichresourcesaresharedwiththeparentprocessand whicharegeneratedindependently for the thread. This fine-grained distribution of resources extends the classical threadconceptand allows fora more or less continuous transitionbetween thread and processes.NamespacesDuring thedevelopmentof kernel 2.6, support for namespaces was integrated into numerous subsystems.This allows different processes to have different views of the system.Traditionally,Linux (and Unix ingeneral) use numerous global quantities,for instance, process identifiers: Every process in the system isequippedwithauniqueidentifier(ID),andthisIDcanbeemployedbyusers (orotherprocesses)torefertotheprocess-—by sendingita signal, forinstance.Withnamespaces,formerlyglobal resourcesaregrouped differently: Every namespace can contain a specific set of PIDs, or can provide different viewsof the filesystem, where mounts in one namespace do not propagate into different namespaces.Namespaces are useful; for example, they are beneficial for hosting providers: Instead of setting uponephysical machinepercustomer,theycan instead usecontainers implemented withnamespacestocreatemultipleviewsofthesystemwhereeachseemstobeacompleteLinuxinstallationfromwithinthe container and does not interact with other containers: They are separated and segregated from eachother.Every instance looks like a singlemachine running Linux,but in fact,many such instances canoperate simultaneously on a physical machine.This helps use resources more effectively.In contrast tofullvirtualizationsolutionslikeKVM,onlyasinglekernelneedstorunonthemachineand isresponsibleto manage all containers.Notallpartsofthekernel areyetfullyawareofnamespaces,andI willdiscusstowhatextentsupport isavailablewhenweanalyzethevarious subsystems.1.3.3AddressSpacesandPrivilegeLevelsBefore we start to discuss virtual address spaces, there are some notational conventions to fix.Through-out this bookIuse the abbreviations KiB,MiB,and GiBas units of size.Theconventional units KB, MB,and GBarenotreally suitableininformationtechnologybecausetheyrepresent decimal powers(103106,and109)although thebinarysystemis thebasisubiquitous incomputing.AccordinglyKiBstandsfor 210, MiB for 220,and GiB for 230 bytes.Because memory areas areaddressed by means of pointers, theword length of the CPU determines themaximum size of the address space that canbe managed.On 32-bit systems such as IA-32,PPC,andm68k, these are232 = 4GiB,whereas on more modern 64-bit processors such as Alpha, Sparc64,IA-64,and AMD64, 264bytes can be managed.The maximal size of the address space is not related to how much physical RAM is actually available,and therefore it is known as the virtual address space. One more reason for this terminology is that everyprocess in the system has the impression that it would solely live in this address space, and otherpro-cesses arenot presentfrom theirpoint ofview.Applications do not need to care aboutother applicationsand can workasif theywould run as theonlyprocess on thecomputer.Linux divides virtual address space into two parts known askernel space and userspace as illustrated inFigure1-3.7
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 7 Chapter 1: Introduction and Overview Linux provides the clone method to generate threads. This works in a similar way to fork but enables a precise check to be made of which resources are shared with the parent process and which are generated independently for the thread. This fine-grained distribution of resources extends the classical thread concept and allows for a more or less continuous transition between thread and processes. Namespaces During the development of kernel 2.6, support for namespaces was integrated into numerous subsystems. This allows different processes to have different views of the system. Traditionally, Linux (and Unix in general) use numerous global quantities, for instance, process identifiers: Every process in the system is equipped with a unique identifier (ID), and this ID can be employed by users (or other processes) to refer to the process — by sending it a signal, for instance. With namespaces, formerly global resources are grouped differently: Every namespace can contain a specific set of PIDs, or can provide different views of the filesystem, where mounts in one namespace do not propagate into different namespaces. Namespaces are useful; for example, they are beneficial for hosting providers: Instead of setting up one physical machine per customer, they can instead use containers implemented with namespaces to create multiple views of the system where each seems to be a complete Linux installation from within the container and does not interact with other containers: They are separated and segregated from each other. Every instance looks like a single machine running Linux, but in fact, many such instances can operate simultaneously on a physical machine. This helps use resources more effectively. In contrast to full virtualization solutions like KVM, only a single kernel needs to run on the machine and is responsible to manage all containers. Not all parts of the kernel are yet fully aware of namespaces, and I will discuss to what extent support is available when we analyze the various subsystems. 1.3.3 Address Spaces and Privilege Levels Before we start to discuss virtual address spaces, there are some notational conventions to fix. Throughout this book I use the abbreviations KiB, MiB, and GiB as units of size. The conventional units KB, MB, and GB are not really suitable in information technology because they represent decimal powers 103 , 106, and 109 although the binary system is the basis ubiquitous in computing. Accordingly KiB stands for 210, MiB for 220, and GiB for 230 bytes. Because memory areas are addressed by means of pointers, the word length of the CPU determines the maximum size of the address space that can be managed. On 32-bit systems such as IA-32, PPC, and m68k, these are 232 = 4 GiB, whereas on more modern 64-bit processors such as Alpha, Sparc64, IA-64, and AMD64, 264 bytes can be managed. The maximal size of the address space is not related to how much physical RAM is actually available, and therefore it is known as the virtual address space. One more reason for this terminology is that every process in the system has the impression that it would solely live in this address space, and other processes are not present from their point of view. Applications do not need to care about other applications and can work as if they would run as the only process on the computer. Linux divides virtual address space into two parts known as kernel space and userspace as illustrated in Figure 1-3. 7
Chapter1:IntroductionandOverview232respectively264Kernel-spaceTASK_SIZEUserspace0Figure1-3:Divisionof virtualaddress space.Every user process in the system has its own virtual address range that extends from O to TAsK_sIzE.Theareaabove(fromTAsK_sIzEto232or264)isreserved exclusivelyforthekerneland maynotbeaccessed by user processes. TAsK_sIze is an architecture-specific constant that divides the address spacein a given ratio- in IA-32 systems, for instance, the address space is divided at 3 GiB so that the virtualaddress spacefor eachprocess is 3 GiB; 1 GiB is availableto thekernel because the total size of thevirtualaddress space is 4 GiB.Although actual figures differ according to architecture, thegeneral concepts donot.Ithereforeusethesesamplevaluesinourfurtherdiscussions.This division does not depend on how much RAM is available. As a result of address space virtualization,each user process thinks it has 3 GiB of memory.The userspaces of the individual system processes aretotally separate from each other.The kernel space at the top end of the virtual address space is alwaysthesame,regardlessoftheprocesscurrentlyexecuting.Notice that thepicture can bemore complicated on 64-bit machines because these tend to use less than64bitstoactuallymanagetheirhugeprincipalvirtualaddressspace.Insteadof64bits,theyemploya smaller number,for instance, 42 or 47bits.Because of this, the effectively addressableportion oftheaddress space is smaller thantheprincipal size.However,it is still largerthan theamount of RAMthatwilleverbepresent inthemachine,andisthereforecompletelysufficient.Asanadvantage,theCPUcansavesomeeffortbecauselessbitsarerequiredtomanagetheeffectiveaddress spacethanarerequiredto address the complete virtual address space.The virtual address space will contain holes that are notaddressable inprinciple in such cases, so the simple situation depicted in Figure 1-3 is notfully valid.Wewill comebacktothistopicinmoredetail inChapter4.PrivilegeLevelsThekernel divides the virtual address space into two parts so that it is able to protect the individualsystemprocesses from eachother.All modern CPUs offerseveral privilegelevels in whichprocessescanreside.Therearevariousprohibitionsineachlevelincluding,forexample,executionofcertainassemblylanguage instructions or access to specific parts of virtual address space. The IA-32 architecture uses asystem of four privilege levels that can be visualized as rings.The inner rings are able to access morefunctions, the outer rings less, as shown in Figure 1-4.Whereas theIntel variant distinguishesfour different levels,Linux uses only two different modeskernel mode and user mode.The key difference between the two is that access to the memory area aboveTAsK_sIzEthat is,kernel spaceisforbidden in user mode.Userprocesses are notable to manipulateor read the data in kernel space. Neither can they execute code stored there. This is the sole domain8
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 8 Chapter 1: Introduction and Overview 0 TASK_SIZE 232 respectively 264 Userspace Kernelspace Figure 1-3: Division of virtual address space. Every user process in the system has its own virtual address range that extends from 0 to TASK_SIZE. The area above (from TASK_SIZE to 232 or 264) is reserved exclusively for the kernel — and may not be accessed by user processes. TASK_SIZE is an architecture-specific constant that divides the address space in a given ratio — in IA-32 systems, for instance, the address space is divided at 3 GiB so that the virtual address space for each process is 3 GiB; 1 GiB is available to the kernel because the total size of the virtual address space is 4 GiB. Although actual figures differ according to architecture, the general concepts do not. I therefore use these sample values in our further discussions. This division does not depend on how much RAM is available. As a result of address space virtualization, each user process thinks it has 3 GiB of memory. The userspaces of the individual system processes are totally separate from each other. The kernel space at the top end of the virtual address space is always the same, regardless of the process currently executing. Notice that the picture can be more complicated on 64-bit machines because these tend to use less than 64 bits to actually manage their huge principal virtual address space. Instead of 64 bits, they employ a smaller number, for instance, 42 or 47 bits. Because of this, the effectively addressable portion of the address space is smaller than the principal size. However, it is still larger than the amount of RAM that will ever be present in the machine, and is therefore completely sufficient. As an advantage, the CPU can save some effort because less bits are required to manage the effective address space than are required to address the complete virtual address space. The virtual address space will contain holes that are not addressable in principle in such cases, so the simple situation depicted in Figure 1-3 is not fully valid. We will come back to this topic in more detail in Chapter 4. Privilege Levels The kernel divides the virtual address space into two parts so that it is able to protect the individual system processes from each other. All modern CPUs offer several privilege levels in which processes can reside. There are various prohibitions in each level including, for example, execution of certain assembly language instructions or access to specific parts of virtual address space. The IA-32 architecture uses a system of four privilege levels that can be visualized as rings. The inner rings are able to access more functions, the outer rings less, as shown in Figure 1-4. Whereas the Intel variant distinguishes four different levels, Linux uses only two different modes — kernel mode and user mode. The key difference between the two is that access to the memory area above TASK_SIZE — that is, kernel space — is forbidden in user mode. User processes are not able to manipulate or read the data in kernel space. Neither can they execute code stored there. This is the sole domain 8