Linux Kernel Internals 6.The BIOS Bootstrap Loader function is invoked via int Ox19 with %dl containing the boot device 'drive number'.This loads track 0,sector I at physical address Ox7C00(0x07C0:0000). 1.4 Booting:bootsector and setup The bootsector used to boot Linux kernel could be either: .Linux bootsector,arch/i386/boot/bootsectS ·LLO(or other boo tloader's)bootsector .No bootsector(loadlin etc) Lin detal The first few used for segment values =0x07c0 33 SYSSEG 34 SYSSIZE system size:f of l6-byte define DEF SETUPSEG 0×9020 tdefine DEF_SYSSIZE 0x7E00 Now.let us consider the actual code of bootsectS: movw B00TSB6,a× 657686960 movw SINITSEG,ax 256, 1.4 Booting:bootsector and setup
6. The BIOS Bootstrap Loader function is invoked via int 0x19 with %dl containing the boot device 'drive number'. This loads track 0, sector 1 at physical address 0x7C00 (0x07C0:0000). 1.4 Booting: bootsector and setup The bootsector used to boot Linux kernel could be either: • Linux bootsector, arch/i386/boot/bootsect.S • LILO (or other bootloader's) bootsector • No bootsector (loadlin etc) We consider here the Linux bootsector in detail. The first few lines initialize the convenience macros to be used for segment values: 29 SETUPSECS = 4 /* default nr of setup−sectors */ 30 BOOTSEG = 0x07C0 /* original address of boot−sector */ 31 INITSEG = DEF_INITSEG /* we move boot here − out of the way */ 32 SETUPSEG = DEF_SETUPSEG /* setup starts here */ 33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */ 34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16−byte clicks */ (the numbers on the left are the line numbers of bootsect.S file) The values of DEF_INITSEG, DEF_SETUPSEG, DEF_SYSSEG, DEF_SYSSIZE are taken from include/asm/boot.h: /* Don't touch these, unless you really know what you're doing. */ #define DEF_INITSEG 0x9000 #define DEF_SYSSEG 0x1000 #define DEF_SETUPSEG 0x9020 #define DEF_SYSSIZE 0x7F00 Now, let us consider the actual code of bootsect.S: 54 movw $BOOTSEG, %ax 55 movw %ax, %ds 56 movw $INITSEG, %ax 57 movw %ax, %es 58 movw $256, %cx 59 subw %si, %si 60 subw %di, %di Linux Kernel Internals 1.4 Booting: bootsector and setup 4
Linux Kernel Internals cld ljmp SINITSEG,Sgo o远yo thVe op o in the vector table. The old stack might have clobbered the movw s0x4000-12,d 456 多ax,8d ax and es already contain INITSEG put stack at INITSEG:0x4000-12. The lines 54-63 move the bootsector code from address Ox7C00 to 0x90000.This is achieved by 1 set %ds:%si to SBOOTSEG:0(0x7C0:0=0x7C00) 2.set %es:%di to SINITSEG:0(0x9000:0=0x90000) 3.set the number of 16bit words in %cx(256 words=512 bytes=1 sector) 4.clear DF(direction)flag in EFLAGS to auto-increment addresses(cld) 5.go ahead and copy 512 bytes(rep movsw) The reason this code does not use"rep movsd"is intentional (hint-codel6). The line 64 jumps to the label "go:"in the newly made copy of the bootsector,i.e.in the segment 0x9000. This and the following three instructions(lines 64-76)prepare the stack at SINITSEG:0x4000-12,i.e.%ss= SINITSEG(0x9000)and %sp=0x3FEE(0x4000-12).This is where the limit on setup size comes from that we mentioned earlier(see Building the Linux Kernel Image). The lines77-103 patch the disk parameter table for the first disk to allow multi-sector reads in RAM most we ighdoean't hurt.tow does. Segmentareafo11ow:dsesscs-INITSEG,fs0, 91 movw set fs to 0 1.4 Booting:bootsector and setup 5
61 cld 62 rep 63 movsw 64 ljmp $INITSEG, $go 65 # bde − changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We 66 # wouldn't have to worry about this if we checked the top of memory. Also 67 # my BIOS can be configured to put the wini drive tables in high memory 68 # instead of in the vector table. The old stack might have clobbered the 69 # drive table. 70 go: movw $0x4000−12, %di # 0x4000 is an arbitrary value >= 71 # length of bootsect + length of 72 # setup + room for stack; 73 # 12 is disk parm size. 74 movw %ax, %ds # ax and es already contain INITSEG 75 movw %ax, %ss 76 movw %di, %sp # put stack at INITSEG:0x4000−12. The lines 54−63 move the bootsector code from address 0x7C00 to 0x90000. This is achieved by: 1. set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00) 2. set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000) 3. set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector) 4. clear DF (direction) flag in EFLAGS to auto−increment addresses (cld) 5. go ahead and copy 512 bytes (rep movsw) The reason this code does not use "rep movsd" is intentional (hint − .code16). The line 64 jumps to the label "go:" in the newly made copy of the bootsector, i.e. in the segment 0x9000. This and the following three instructions (lines 64−76) prepare the stack at $INITSEG:0x4000−12, i.e. %ss = $INITSEG (0x9000) and %sp = 0x3FEE (0x4000−12). This is where the limit on setup size comes from that we mentioned earlier (see Building the Linux Kernel Image). The lines 77−103 patch the disk parameter table for the first disk to allow multi−sector reads: 77 # Many BIOS's default disk parameter tables will not recognize 78 # multi−sector reads beyond the maximum sector number specified 79 # in the default diskette parameter tables − this may mean 7 80 # sectors in some cases. 81 # 82 # Since single sector reads are slow and out of the question, 83 # we must take care of this by creating new parameter tables 84 # (for the first disk) in RAM. We will set the maximum sector 85 # count to 36 − the most we will encounter on an ED 2.88. 86 # 87 # High doesn't hurt. Low does. 88 # 89 # Segments are as follows: ds = es = ss = cs − INITSEG, fs = 0, 90 # and gs is unused. 91 movw %cx, %fs # set fs to 0 Linux Kernel Internals 1.4 Booting: bootsector and setup 5
Linux Kernel Internals 078,bx fs:bx is parameter table address 1dsw 多f:(8bx),8i ds:si is source 456969960 卷C don't need cld->done on line 66 1 36,0x4(3di) patch sector count 103 The floppy disk contre roller is reset using BIOS service int 0x13 funct al addr at This ha FDC BIOS service int 0x function 2"read se s during lines 107-124 8照 load_set reset FDC int s0x13 head 0 0 90x0200,8b× addresa -512,in INITSEG 115 sects,sal (assume all on head 0,track 0) 117 oad_setup dump error code rint nl load_setup 124 ok_load_setup: If loading failed for some reason(bad floppy or someone pulled the diskette out during the operation)then we dump error code and retry in an endless loop.The only way to get out of it is to reboot the machine, unless retry succeeds but usually it doesn't (if something is wrong it will only get worse). Ifloading setup sects sectors of setup code succeeded we jump to label"ok load setup:" Then we pro image at in low memo y( Is n AipR0Stisowvewt nore callst the enti sed)kerne mag nger ich is Thi ne by setup. oor prote ompresse 386 ndp stac ompres ncompress the ke add s0x1000002 1.4 Booting:bootsector and setup
92 movw $0x78, %bx # fs:bx is parameter table address 93 pushw %ds 94 ldsw %fs:(%bx), %si # ds:si is source 95 movb $6, %cl # copy 12 bytes 96 pushw %di # di = 0x4000−12. 97 rep # don't need cld −> done on line 66 98 movsw 99 popw %di 100 popw %ds 101 movb $36, 0x4(%di) # patch sector count 102 movw %di, %fs:(%bx) 103 movw %es, %fs:2(%bx) The floppy disk controller is reset using BIOS service int 0x13 function 0 "reset FDC" and setup sectors are loaded immediately after the bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using BIOS service int 0x13, function 2 "read sector(s)". This happens during lines 107−124: 107 load_setup: 108 xorb %ah, %ah # reset FDC 109 xorb %dl, %dl 110 int $0x13 111 xorw %dx, %dx # drive 0, head 0 112 movb $0x02, %cl # sector 2, track 0 113 movw $0x0200, %bx # address = 512, in INITSEG 114 movb $0x02, %ah # service 2, "read sector(s)" 115 movb setup_sects, %al # (assume all on head 0, track 0) 116 int $0x13 # read it 117 jnc ok_load_setup # ok − continue 118 pushw %ax # dump error code 119 call print_nl 120 movw %sp, %bp 121 call print_hex 122 popw %ax 123 jmp load_setup 124 ok_load_setup: If loading failed for some reason (bad floppy or someone pulled the diskette out during the operation) then we dump error code and retry in an endless loop. The only way to get out of it is to reboot the machine, unless retry succeeds but usually it doesn't (if something is wrong it will only get worse). If loading setup_sects sectors of setup code succeeded we jump to label "ok_load_setup:" Then we proceed to load the compressed kernel image at physical address 0x10000. This is done to preserve the firmware data areas in low memory (0−64K). After the kernel is loaded we jump to $SETUPSEG:0 (arch/i386/boot/setup.S). Once the data is no longer needed (e.g. no more calls to BIOS) it is overwritten by moving the entire (compressed) kernel image from 0x10000 to 0x1000 (physical addresses, of course). This is done by setup.S which sets things up for protected mode and jumps to 0x1000 which is the head of the compressed kernel, i.e. arch/386/boot/compressed/{head.S,misc.c}. This sets up stack and calls decompress_kernel() which uncompresses the kernel to address 0x100000 and jumps to it. Linux Kernel Internals 1.4 Booting: bootsector and setup 6
Linux Kemnel Internals achteibioCromaA0Ooaoenoai中tcooiopehae6 combinations of loader type/version vs zImage/bzImage and is therefore highly complex. Let us examine the kludge in the bootsector code that allows to load a big kernel known also as "bzImage" The setup sectors are loaded as usual at 0x90200 but the kernel is loaded 64K chunk at a time using a specia helper routine that calls BIOS to move data from low to high memory.This helper routine is referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in setup.S.The bootsect_kludge label in setup.S contains the value of setup segment and the offset of bootsect helper code in it so that bootsector can use lcall instruction to jump to it (inter-segment jump).The reason why it is in setup.S is simply because there is no more space left in bootsect.S(which is strictly not true- there are approx 4 spare bytes and at least I spare byte in bootsect.S but that is not enough,obviously).This routine uses BIOS service int 0x15 (ax-0x8700)to move to high memory and resets %es to al ways point that the code in bootsect.S doesn't run out of low memory when copying data from disk 1.5 Using LILO as a bootloader There are several advantages in using a specialized bootloader(LILO)over a bare bones Linux bootsector veen multi s ke omm 3.much arger m kemels -up to 2.5M vs IM Old versions of LILO(v17 and earlier)could no load bzlm couple of vears ago or earlier)use the ame e as hoo ata m ry by means of BIOS ser ple (Peter Anvin notably)argue that zImage port she uld he The main reason (according to lan Cox)itsta ys is that there are s pparently some broken BIOSes that make it impossible to boot bzImage kemels while loading zlmage ones fine The last thing LILO does is to jump to setup.S and things proceed as normal 1.6 High level initialisation By"high-level initialisation"we consider anything which is not directly related to bootstrap,even though parts of the code to perform this are written in asm,namely arch/i386/kernel/head.S which is the head of the uncompressed kernel.The following steps are performed: 1.initialises segment values(%ds=%es-%fs=%gs=KERNEL DS=0x18) 2.initialises page tables 3.enables paging by setting PG bit in %cr 4.zero-cleans BSS (on SMP,only first CPU does this) 5.copies the first 2k of bootup parameters(kernel commandl ne) 6.checks CPU type using EFLAGS and,if possible,cpuid,able to detect 36 and higher 1.5 Using LILO as a bootloader 7
Note that the old bootloaders (old versions of LILO) could only load the first 4 sectors of setup so there is code in setup to load the rest of itself if needed. Also, the code in setup has to take care of various combinations of loader type/version vs zImage/bzImage and is therefore highly complex. Let us examine the kludge in the bootsector code that allows to load a big kernel, known also as "bzImage". The setup sectors are loaded as usual at 0x90200 but the kernel is loaded 64K chunk at a time using a special helper routine that calls BIOS to move data from low to high memory. This helper routine is referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in setup.S. The bootsect_kludge label in setup.S contains the value of setup segment and the offset of bootsect_helper code in it so that bootsector can use lcall instruction to jump to it (inter−segment jump). The reason why it is in setup.S is simply because there is no more space left in bootsect.S (which is strictly not true − there are approx 4 spare bytes and at least 1 spare byte in bootsect.S but that is not enough, obviously). This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory and resets %es to always point to 0x10000 so that the code in bootsect.S doesn't run out of low memory when copying data from disk. 1.5 Using LILO as a bootloader There are several advantages in using a specialized bootloader (LILO) over a bare bones Linux bootsector: 1. Ability to choose between multiple Linux kernels or even multiple OSes. 2. Ability to pass kernel command line parameters (there is a patch called BCP that adds this ability to bare−bones bootsector+setup) 3. Ability to load much larger bzImage kernels − up to 2.5M vs 1M Old versions of LILO (v17 and earlier) could not load bzImage kernels. The newer versions (as of a couple of years ago or earlier) use the same technique as bootsect+setup of moving data from low into high memory by means of BIOS services. Some people (Peter Anvin notably) argue that zImage support should be removed. The main reason (according to Alan Cox) it stays is that there are apparently some broken BIOSes that make it impossible to boot bzImage kernels while loading zImage ones fine. The last thing LILO does is to jump to setup.S and things proceed as normal. 1.6 High level initialisation By "high−level initialisation" we consider anything which is not directly related to bootstrap, even though parts of the code to perform this are written in asm, namely arch/i386/kernel/head.S which is the head of the uncompressed kernel. The following steps are performed: 1. initialises segment values (%ds=%es=%fs=%gs=__KERNEL_DS= 0x18) 2. initialises page tables 3. enables paging by setting PG bit in %cr0 4. zero−cleans BSS (on SMP, only first CPU does this) 5. copies the first 2k of bootup parameters (kernel commandline) 6. checks CPU type using EFLAGS and, if possible, cpuid, able to detect 386 and higher Linux Kernel Internals 1.5 Using LILO as a bootloader 7
Linux Kernel Internals 7.the first CPU calls start kernel(),all others call arch/i386/kernel/smpboot.c:initialize secondary()if ready=1,which just reloads esp/eip and doesn't return. The init/main.c:start kernel()is written in C and does the following: 1.takes a global kernel lock (it is needed so that only one CPU goes through initialisation) 2.performs arch-specific setup (memory layout analysis,copying boot command line again,etc.) nne A es uap 6. required for scheduler tialis ses tim rq s 0 mandline option 10 11.if module s port was compiled into the kernel,initialises dynamical module loading facility mmand line was su pplied initialises profiling buffers 13.kmem cache init(),initialises most of slab allocator 14.enables interrupts 15 calculates bogomins value for this cpu 16.calls meminit()which calculates max mapnr,totalram pages and high memory and prints out the "line 17.kmem cache sizes init(),finishes slab allocator initialisation 18.initialises data structures used by procfs 19.fork init().creates uid cache,initialises max threads based on the amount of memory available and configures RLIMIT NPROC for init_task to be max_threads/2 20.creates various slab caches needed for VFS.VM,buffer cache etc 21.if System V IPC support is compiled in,initialises IPC subsystem.Note,that for System Vshm this includes mounting an internal (in-kernel)instance of shmfs filesystem 22.if quota support is compiled into the kemel,create and initialise a special slab cache for it 23.performs arch-specific"check for bugs"and,whenever possible,activates workaround for processor/bus/etc bugs.Comparing various architectures reveals that"i64 has no bugs"and"a32 foof bug"which is only checked if kernel is compiled for ork aro 24.sets a flag ate that a sch be invok t"next opp ortunity and creates a keme )w it/bin/init,/b all these 25 "in paramete s in loop,this is dle thread with pid=0 d ot here that the in Important thin nel thr ead calls do basic gh the list of fur d h cal odule inito macros and invokes the m The e functions either do ach other or thei ndencies have been manually fixed by the link order in the makefiles This means that d ndin on the osition of change sometimes this is important because you can imagine two subsy ems a and b with b depending on some initialisation done by A.If A is compiled statically and B is a module then B's entry point is guaranteed to be invoked after A prepared all the necessary environment.If A is a module,then B is also necessarily a 1.5 Using LILO as a bootloader
7. the first CPU calls start_kernel(), all others call arch/i386/kernel/smpboot.c:initialize_secondary() if ready=1, which just reloads esp/eip and doesn't return. The init/main.c:start_kernel() is written in C and does the following: 1. takes a global kernel lock (it is needed so that only one CPU goes through initialisation) 2. performs arch−specific setup (memory layout analysis, copying boot command line again, etc.) 3. prints Linux kernel "banner" containing the version, compiler used to build it etc. to the kernel ring buffer for messages. This is taken from the variable linux_banner defined in init/version.c and is the same string as displayed by "cat /proc/version". 4. initialises traps 5. initialises irqs 6. initialises data required for scheduler 7. initialises time keeping data 8. initialises softirq subsystem 9. parses boot commandline options 10. initialises console 11. if module support was compiled into the kernel, initialises dynamical module loading facility 12. if "profile=" command line was supplied initialises profiling buffers 13. kmem_cache_init(), initialises most of slab allocator 14. enables interrupts 15. calculates BogoMips value for this CPU 16. calls mem_init() which calculates max_mapnr, totalram_pages and high_memory and prints out the "Memory: ..." line 17. kmem_cache_sizes_init(), finishes slab allocator initialisation 18. initialises data structures used by procfs 19. fork_init(), creates uid_cache, initialises max_threads based on the amount of memory available and configures RLIMIT_NPROC for init_task to be max_threads/2 20. creates various slab caches needed for VFS, VM, buffer cache etc 21. if System V IPC support is compiled in, initialises IPC subsystem. Note, that for System V shm this includes mounting an internal (in−kernel) instance of shmfs filesystem 22. if quota support is compiled into the kernel, create and initialise a special slab cache for it 23. performs arch−specific "check for bugs" and, whenever possible, activates workaround for processor/bus/etc bugs. Comparing various architectures reveals that "ia64 has no bugs" and "ia32 has quite a few bugs", good example is "f00f bug" which is only checked if kernel is compiled for less than 686 and worked around accordingly 24. sets a flag to indicate that a schedule should be invoked at "next opportunity" and creates a kernel thread init() which execs execute_command if supplied via "init=" boot parameter or tries to exec /sbin/init,/etc/init,/bin/init,/bin/sh in this order and if all these fail, panics with suggestion to use "init=" parameter. 25. goes into the idle loop, this is an idle thread with pid=0 Important thing to note here that the init() kernel thread calls do_basic_setup() which in turn calls do_initcalls() which goes through the list of functions registered by means of __initcall or module_init() macros and invokes them. These functions either do not depend on each other or their dependencies have been manually fixed by the link order in the Makefiles. This means that depending on the position of directories in the trees and the structure of the Makefiles the order initialisation functions are invoked can change. Sometimes, this is important because you can imagine two subsystems A and B with B depending on some initialisation done by A. If A is compiled statically and B is a module then B's entry point is guaranteed to be invoked after A prepared all the necessary environment. If A is a module, then B is also necessarily a Linux Kernel Internals 1.5 Using LILO as a bootloader 8