Chapter1:IntroductionandOverviewof thekernel.This mechanism prevents processes from interfering with each other by unintentionallyinfluencingeachother'sdata.1A-32LinuxLessPrivilegesUser-modeKernel-modeFigure1-4:Ringsystemofprivilegelevels.The switch from usertokernelmode is made by means of special transitions known as system calls; theseare executed differentlydepending on the system.If a normal process wants to carry out anykind ofaction affecting the entiresystem (e.g-,manipulatingI/Odevices),it can do this onlyby issuing a requestto thekernel with the help of a system call.Thekernelfirst checks whetherthe process is permitted toperform the desired action and then performs the action on its behalf. A return is then made to usermode.Besides executing code on behalfof a userprogram,thekernel can also be activated byasynchronoushardware interrupts,and is thensaid torunin interruptcontext.Themain differenceto running inprocesscontext is that the userspace portion of thevirtual address space must not be accessed.Because interruptsoccurat randomtimes,a random userland process is active when an interrupt occurs, and sincetheinterrupt will most likely be unconnected with the cause of the interrupt, the kernel has no businesswith the contents of the current userspace. When operating in interrupt context, the kernel mustbe morecautious than normal; for instance, it must notgo to sleep.This requires extra care when writing interrupthandlers and is discussed in detail in Chapter 2.An overview of thedifferent execution contexts is givenin Figure 1-5.Besides normal processes, there can alsobe kernel threads running on the system.Kernel threads are alsonot associated with anyparticular userspace process,so theyalsohave no business dealing with theuserportionoftheaddressspace.Inmanyotherrespects,kernelthreadsbehavemuchmorelikeregularuserland applications,though:In contrasttoakernel operating in interrupt context, they maygo to sleep,and they are also tracked by the scheduler like every regular process in the system. The kernel uses themforvariouspurposes thatrangefrom data synchronization ofRAM and blockdevicestohelpingtheschedulerdistribute processes among CPUs,and we will frequently encounterthem in the course of thisbook.Notice that kernel threads can be easily identified in the output of ps because their names are placedinside brackets:faxwolfgangemeitner>psTIME COMMANDPID TTYSTAT2?S<0:00[kthreadd]33S<0:00[migration/o]43S<0:00[ksoftirqd/o]9
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 9 Chapter 1: Introduction and Overview of the kernel. This mechanism prevents processes from interfering with each other by unintentionally influencing each other’s data. 1 0 2 3 Kernelmode Usermode Less Privileges IA-32 Linux Figure 1-4: Ring system of privilege levels. The switch from user to kernel mode is made by means of special transitions known as system calls; these are executed differently depending on the system. If a normal process wants to carry out any kind of action affecting the entire system (e.g., manipulating I/O devices), it can do this only by issuing a request to the kernel with the help of a system call. The kernel first checks whether the process is permitted to perform the desired action and then performs the action on its behalf. A return is then made to user mode. Besides executing code on behalf of a user program, the kernel can also be activated by asynchronous hardware interrupts, and is then said to run in interrupt context. The main difference to running in process context is that the userspace portion of the virtual address space must not be accessed. Because interrupts occur at random times, a random userland process is active when an interrupt occurs, and since the interrupt will most likely be unconnected with the cause of the interrupt, the kernel has no business with the contents of the current userspace. When operating in interrupt context, the kernel must be more cautious than normal; for instance, it must not go to sleep. This requires extra care when writing interrupt handlers and is discussed in detail in Chapter 2. An overview of the different execution contexts is given in Figure 1-5. Besides normal processes, there can also be kernel threads running on the system. Kernel threads are also not associated with any particular userspace process, so they also have no business dealing with the user portion of the address space. In many other respects, kernel threads behave much more like regular userland applications, though: In contrast to a kernel operating in interrupt context, they may go to sleep, and they are also tracked by the scheduler like every regular process in the system. The kernel uses them for various purposes that range from data synchronization of RAM and block devices to helping the scheduler distribute processes among CPUs, and we will frequently encounter them in the course of this book. Notice that kernel threads can be easily identified in the output of ps because their names are placed inside brackets: wolfgang@meitner> ps fax PID TTY STAT TIME COMMAND 2 ? S< 0:00 [kthreadd] 3 ? S< 0:00 _ [migration/0] 4 ? S< 0:00 _ [ksoftirqd/0] 9
Chapter1:IntroductionandOverview5 ?S<0:00[migration/1]6?S<0:00[ksoftirqd/1]1S<?0:00[migration/2]8?S<0:00[ksoftirqd/2]9?S<0:00[migration/3]10 ?S<0:00[ksoftirqd/3]S<11 0:00[events/0]Se12?0:00[events/1]13?S<0:00[events/2]S14 ?0:00[events/3]15 ?0:00[khelper]15162?S<0:00[jfsCommit]S<0:0015163?[jfsSync]KernelMust not beUseraccessedAL?AArrows indicate thatReturn fromInterruptSystem callCPU executes heresystem callFigure 1-5: Execution inkernel and usermode.Most of the time,the CPU executescodeinuserspace.Whentheapplicationperformsasystemcall,aswitchtokernelmodeisemployed,andthekernel fulfillstherequest.Duringthis,itmayaccesstheuserportion of thevirtual address space.After the systemcallcompletes, the CPUswitchesbacktousermode.Ahardwareinterruptalsotriggersaswitchtokernelmode,butthistime,theuserspaceportionmustnotbeaccessedbythekernelOn multiprocessor systems, many threads are started on a per-CPU basis and are restricted to run ononly one specific processor.This is represented by a slash and the numberofthe CPU that areappendedto the name of the kernel thread.VirtualandPhysicalAddressSpacesIn most cases, a single virtual address space is bigger than the physical RAM available to the system. Andthe situation does not improve when each processhas its own virtual address space.Thekernel and CPUmust therefore considerhow thephysical memoryactuallyavailable can bemapped onto virtual addressareas.The preferred method is to use page tables to allocate virtual addresses to physical addresses. Whereasvirtual addresses relate to the combined user and kernel space of a process, physical addresses are usedtoaddresstheRAMactuallyavailable.ThisprincipleisillustratedinFigure1-6The virtual address spaces of bothprocesses shown in the figure are divided intoportions ofequal sizeby thekernel. These portions are known as pages.Physical memory is also divided into pages of thesamesize10
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 10 Chapter 1: Introduction and Overview 5 ? S< 0:00 _ [migration/1] 6 ? S< 0:00 _ [ksoftirqd/1] 7 ? S< 0:00 _ [migration/2] 8 ? S< 0:00 _ [ksoftirqd/2] 9 ? S< 0:00 _ [migration/3] 10 ? S< 0:00 _ [ksoftirqd/3] 11 ? S< 0:00 _ [events/0] 12 ? S< 0:00 _ [events/1] 13 ? S< 0:00 _ [events/2] 14 ? S< 0:00 _ [events/3] 15 ? S< 0:00 _ [khelper] . 15162 ? S< 0:00 _ [jfsCommit] 15163 ? S< 0:00 _ [jfsSync] System call Return from system call Must not be accessed User Kernel Interrupt Arrows indicate that CPU executes here ( ) Figure 1-5: Execution in kernel and user mode. Most of the time, the CPU executes code in userspace. When the application performs a system call, a switch to kernel mode is employed, and the kernel fulfills the request. During this, it may access the user portion of the virtual address space. After the system call completes, the CPU switches back to user mode. A hardware interrupt also triggers a switch to kernel mode, but this time, the userspace portion must not be accessed by the kernel. On multiprocessor systems, many threads are started on a per-CPU basis and are restricted to run on only one specific processor. This is represented by a slash and the number of the CPU that are appended to the name of the kernel thread. Virtual and Physical Address Spaces In most cases, a single virtual address space is bigger than the physical RAM available to the system. And the situation does not improve when each process has its own virtual address space. The kernel and CPU must therefore consider how the physical memory actually available can be mapped onto virtual address areas. The preferred method is to use page tables to allocate virtual addresses to physical addresses. Whereas virtual addresses relate to the combined user and kernel space of a process, physical addresses are used to address the RAM actually available. This principle is illustrated in Figure 1-6. The virtual address spaces of both processes shown in the figure are divided into portions of equal size by the kernel. These portions are known as pages. Physical memory is also divided into pages of the same size. 10
Chapter1:IntroductionandOverviewPageFrameRAMProcess BProcess AFigure 1-6: Virtual and physical addresses.The arrows in Figure1-6 indicatehow thepages inthe virtualaddress spaces aredistributed across thephysical pages.For example, virtual page 1 of process A is mapped to physical page 4, while virtualpage1of process B is mapped to thefifth physical page.This shows that virtual addresses change theirmeaningfromprocesstoprocess.Physical pages are often called page frames. In contrast, the term page is reserved for pages in virtualaddress space.Mappingbetween virtual address spaces and physical memory also enables the otherwise strict sep-aration betweenprocesses tobe lifted.Our example includesapage frame explicitly shared bybothprocesses. Page 5 of A and page 1 of B both point to the physicai page frame 5. This is possible becauseentries in both virtual address spaces (albeit at different positions)point to the same page. Since the ker-nel is responsible for mapping virtual address space tophysical address space,it is ableto decide whichmemoryareas are tobeshared betweenprocessesand whichare not.The figure also shows that not all pages of the virtual address spaces are linked with a page frame. Thismaybebecauseeitherthepagesarenotusedorbecausedatahavenotbeenloadedintomemorybecausethey are not yet needed. It may also be that the page has been swapped out onto hard disk and will beswappedbackinwhenneeded.Finally,notice that there are two equivalent terms to address theapplications that run on behalf of theuser.Oneofthemisuserland,and thisisthenomenclaturetypicallypreferred bytheBSDcommunityforallthingsthatdonotbelongtothekernel.Thealternativeistosaythatanapplicationrunsinuserspace.Itshouldbenotedthatthetermuserlandwillalwaysmeanapplicationsassuch,whereasthetermuserspacecan additionally notonly denote applications, but also the portion of the virtual address space in whichtheyareexecuted, in contrasttokernel space.1.3.4PageTablesDatastructuresknownaspagetablesareusedtomapvirtualaddressspacetophysicaladdressspace.Theeasiest way of implementingtheassociation between both would beto usean arraycontaining an entryfor each page in virtual address space. This entry would point to the associated page frame. But there isa problem. IA-32 architectureuses, for example,4 KiB pages—given a virtual address space of 4 GiB,this would producean arraywith a millionentries.On64-bitarchitectures, thesituationismuchworse.Because each process needs its own page tables,this approach is impractical because the entire RAM ofthe system would be needed to hold the pagetables.11
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 11 Chapter 1: Introduction and Overview Process A RAM Process B Page Frame Figure 1-6: Virtual and physical addresses. The arrows in Figure 1-6 indicate how the pages in the virtual address spaces are distributed across the physical pages. For example, virtual page 1 of process A is mapped to physical page 4, while virtual page 1 of process B is mapped to the fifth physical page. This shows that virtual addresses change their meaning from process to process. Physical pages are often called page frames. In contrast, the term page is reserved for pages in virtual address space. Mapping between virtual address spaces and physical memory also enables the otherwise strict separation between processes to be lifted. Our example includes a page frame explicitly shared by both processes. Page 5 of A and page 1 of B both point to the physical page frame 5. This is possible because entries in both virtual address spaces (albeit at different positions) point to the same page. Since the kernel is responsible for mapping virtual address space to physical address space, it is able to decide which memory areas are to be shared between processes and which are not. The figure also shows that not all pages of the virtual address spaces are linked with a page frame. This may be because either the pages are not used or because data have not been loaded into memory because they are not yet needed. It may also be that the page has been swapped out onto hard disk and will be swapped back in when needed. Finally, notice that there are two equivalent terms to address the applications that run on behalf of the user. One of them is userland, and this is the nomenclature typically preferred by the BSD community for all things that do not belong to the kernel. The alternative is to say that an application runs in userspace. It should be noted that the term userland will always mean applications as such, whereas the term userspace can additionally not only denote applications, but also the portion of the virtual address space in which they are executed, in contrast to kernel space. 1.3.4 Page Tables Data structures known as page tables are used to map virtual address space to physical address space. The easiest way of implementing the association between both would be to use an array containing an entry for each page in virtual address space. This entry would point to the associated page frame. But there is a problem. IA-32 architecture uses, for example, 4 KiB pages — given a virtual address space of 4 GiB, this would produce an array with a million entries. On 64-bit architectures, the situation is much worse. Because each process needs its own page tables, this approach is impractical because the entire RAM of the system would be needed to hold the page tables. 11
Chapter1:IntroductionandOverviewAs mostareas ofvirtual addressspacesarenotusedand are thereforenotassociatedwithpageframes,afar less memory-intensive model that fulfills the same purpose can be used:multilevel paging.To reduce the size of page tables and to allow unneeded areas to be ignored, the architectures split eachvirtual addressintomultipleparts,asshowninFigure1-7(thebitpositionsatwhichtheaddressissplitdifferaccordingtoarchitecture,butthis is ofno relevance here).In theexample,I use a split of thevirtualaddress into four components, and this leads to a three-level page table.This is what most architecturesoffer.However,someemployfour-levelpagetables,andLinuxalsoadoptsfourlevelsof indirection.Tosimplify the picture, I stick to a three-level variant here.VirtualPGDPMDPTEoffsetAddressMiddlePageGlobal PagePage TablePageFrameTableTableFigure 1-7: Splitting a virtual address.The first part of the virtual address is referred to as a page global directory or PGD. It is used as an indexin an array that exists exactly once for each process.Its entries are pointers to the start of further arrayscalled pagemiddledirectories orPMD.Once the corresponding arrayhas been found by reference to the PGD and its contents,the PMD is usedas an indexfor the array.Thepage middledirectory likewiseconsists of pointers tofurtherarraysknownas page tables or page directories.The PTE (or page table entry) part of the virtual address is used as an index to the page table. Mappingbetween virtual pages and page frames is achieved because thepage table entries point to pageframes.The last part of the virtual address isknown as an offset.It is used to specify a byte position within thepage; after all, each address points to a uniquely defined byte in address space.Aparticularfeatureofpagetablesisthatnopagemiddletablesorpagetablesneedbecreatedforareasofvirtual address space that are not needed.This saves a great deal of RAM as compared to the single-arraymethod.Of course,this method also has a downside.Each time memory is accessed, it is necessaryto run throughthe entirechain toobtain thephysical address from thevirtual address.CPUs try to speed upthisprocessintwoways:1.AspecialpartoftheCPUknownasamemorymanagementunit (MMU)isoptimizedtoper-formreferencingoperations.12
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 12 Chapter 1: Introduction and Overview As most areas of virtual address spaces are not used and are therefore not associated with page frames, a far less memory-intensive model that fulfills the same purpose can be used: multilevel paging. To reduce the size of page tables and to allow unneeded areas to be ignored, the architectures split each virtual address into multiple parts, as shown in Figure 1-7 (the bit positions at which the address is split differ according to architecture, but this is of no relevance here). In the example, I use a split of the virtual address into four components, and this leads to a three-level page table. This is what most architectures offer. However, some employ four-level page tables, and Linux also adopts four levels of indirection. To simplify the picture, I stick to a three-level variant here. PGD PMD PTE Offset Global Page Table + Middle Page Table Page Table Virtual Address + + Page Frame + Figure 1-7: Splitting a virtual address. The first part of the virtual address is referred to as a page global directory or PGD. It is used as an index in an array that exists exactly once for each process. Its entries are pointers to the start of further arrays called page middle directories or PMD. Once the corresponding array has been found by reference to the PGD and its contents, the PMD is used as an index for the array. The page middle directory likewise consists of pointers to further arrays known as page tables or page directories. The PTE (or page table entry) part of the virtual address is used as an index to the page table. Mapping between virtual pages and page frames is achieved because the page table entries point to page frames. The last part of the virtual address is known as an offset. It is used to specify a byte position within the page; after all, each address points to a uniquely defined byte in address space. A particular feature of page tables is that no page middle tables or page tables need be created for areas of virtual address space that are not needed. This saves a great deal of RAM as compared to the single-array method. Of course, this method also has a downside. Each time memory is accessed, it is necessary to run through the entire chain to obtain the physical address from the virtual address. CPUs try to speed up this process in two ways: 1. A special part of the CPU known as a memory management unit (MMU) is optimized to perform referencing operations. 12
Chapter1:IntroductionandOverview2.Theaddresses that occur mostfrequently in address translation areheld ina fast CPU cachecalled a Translation Lookaside Buffer (TLB).Translation is accelerated because the address datain the cache are immediately available withoutneeding to access the page tables and there-foretheRAM.While caches are operated transparently onmany architectures,some requirespecialattention fromthekernel, which especially implies that their contentsmust be invalidatedwhenever the contents of the page tables have been changed.Corresponding calls must bepresent in every part of the kernel that manipulates page tables. If the kernel is compiled foran architecture that does not require such operations, it automatically ensures that the callsarerepresentedbydo-nothingoperations.InteractionwiththeCPUThe IA-32 architecture uses a two-level-onlymethod to map virtual addresses to physical addresses.The size of the address space in 64-bit architectures (Alpha,Sparc64,IA-64, etc.)mandates a three-levelor four-level method, and the architecture-independent part of thekernel always assumes a four-levelpage table.The architecture-dependent code of the kernel for two- and three-level CPUs must therefore emulate themissinglevelsbydummypagetables.Consequently,theremainingmemorymanagementcodecanbeimplementedindependentlyoftheCPUused.MemoryMappingsMemory mappings arean importantmeans ofabstraction.Theyareused atmanypoints inthekernelandarealso availableto user applications.Mapping is themethodbywhichdata from anarbitrary sourceare transferred into the virtual address space of a process.The address space areas in which mappingtakes place can be processed using normal methods in the same way as regular memory.However,anychanges made are transferred automaticallytothe original data source.This makes itpossibletouseidentical functions toprocess totallydifferentthings.Forexample,the contents ofa filecanbemappedinto memory.A process then need only read the contents of memoryto access the contents of the file,or write changes to memory in order to modify the contents of the file. The kernel automatically ensuresthat anychanges madeare implemented in thefile.Mappings are also used directly in thekernel when implementing device drivers.The input and outputareas of peripheral devices can bemapped into virtual address space; reads and writes to these areas arethen redirected to the devices by the system, thus greatly simplifying driver implementation.1.3.5AllocationofPhysicalMemoryWhen it allocates RAM,thekernelmustkeep track of which pages havealready been allocated and whicharestillfreeinordertopreventtwoprocessesfromusingthesameareasinRAM.Becausememoryallocation and release arevery frequenttasks, thekernel must also ensure that theyare completed asquicklyaspossible.Thekernel can allocate only wholepageframes.Dividing memory into smallerportions is delegated to the standard library in userspace.This library splits the page frames receivedfrom thekernel intosmallerareasandallocatesmemorytotheprocesses.13
Mauerer runc01.tex V2 - 09/04/2008 4:13pm Page 13 Chapter 1: Introduction and Overview 2. The addresses that occur most frequently in address translation are held in a fast CPU cache called a Translation Lookaside Buffer (TLB). Translation is accelerated because the address data in the cache are immediately available without needing to access the page tables and therefore the RAM. While caches are operated transparently on many architectures, some require special attention from the kernel, which especially implies that their contents must be invalidated whenever the contents of the page tables have been changed. Corresponding calls must be present in every part of the kernel that manipulates page tables. If the kernel is compiled for an architecture that does not require such operations, it automatically ensures that the calls are represented by do-nothing operations. Interaction with the CPU The IA-32 architecture uses a two-level-only method to map virtual addresses to physical addresses. The size of the address space in 64-bit architectures (Alpha, Sparc64, IA-64, etc.) mandates a three-level or four-level method, and the architecture-independent part of the kernel always assumes a four-level page table. The architecture-dependent code of the kernel for two- and three-level CPUs must therefore emulate the missing levels by dummy page tables. Consequently, the remaining memory management code can be implemented independently of the CPU used. Memory Mappings Memory mappings are an important means of abstraction. They are used at many points in the kernel and are also available to user applications. Mapping is the method by which data from an arbitrary source are transferred into the virtual address space of a process. The address space areas in which mapping takes place can be processed using normal methods in the same way as regular memory. However, any changes made are transferred automatically to the original data source. This makes it possible to use identical functions to process totally different things. For example, the contents of a file can be mapped into memory. A process then need only read the contents of memory to access the contents of the file, or write changes to memory in order to modify the contents of the file. The kernel automatically ensures that any changes made are implemented in the file. Mappings are also used directly in the kernel when implementing device drivers. The input and output areas of peripheral devices can be mapped into virtual address space; reads and writes to these areas are then redirected to the devices by the system, thus greatly simplifying driver implementation. 1.3.5 Allocation of Physical Memory When it allocates RAM, the kernel must keep track of which pages have already been allocated and which are still free in order to prevent two processes from using the same areas in RAM. Because memory allocation and release are very frequent tasks, the kernel must also ensure that they are completed as quickly as possible. The kernel can allocate only whole page frames. Dividing memory into smaller portions is delegated to the standard library in userspace. This library splits the page frames received from the kernel into smaller areas and allocates memory to the processes. 13