Linux Kernel Internals module so there are no problems But what if both a and b are statically linked into the kernel?The order they are invoked depends on the relative entry point offsets in the"initcall.init"ELF section of the kernel image.Rogier Wolff proposed to introduce the hierarchical "priority"infrastructure whereby modules could let the linker know in what(relative)order they should be linked but so far there are no patches available that implement this in a sufficiently elegant manner to be acceptable into the kernel.Therefore-make sure your link order is correct,if,in the example above,A and B work fine when compiled statically once they will work always provided they are listed sequentially in the same Makefile.If they don't work change the order their object files are listed. Another thing worthy of note is Linux's ability to execute"alterative init program"by means of passing init="boot commandline.This is useful for recovering from accidentally overwritten"/sbin/init"or debugging the initialisation(re)scripts and /etc/inittab by hand,executing them one at a time. 1.7 SMP Bootup on x86 On SMP,the BPgoes through the normal sequence of bootsector,setup etc until it reaches the start_kemel( and then on to smp_init()a d calls mpboot.c:smp_boot_cpus(). smp_boot_cpus( goes in a op an on it.What do boo an e t get cp ons deepu( trampo ne.S The boot CPU cre ates a copy e code for each CPU in the low The AP code v the BP t writes ber that tramp oline ust be in ed by The trampoline code simply sets %bx register to 1,enters protected mode and jumps to startup 32 which is the main entry to arch/i386/kernel/head S. Now.the ap starts executing head s and discovering that it is not a bp.it skips the code that clears BSs and then enters initialise secondary()which just enters the idle task for this CPU-recall that init tasks[cpu]was already initialised by BPexecuting do boot cpu(cpu). Note,that init task can be shared but each idle thread must have its own TSS so init tss[NR CPUS]is an arrav 1.8 Freeing initialisation data and code When the operating system initialises itself most of the code and data structures are never needed again.Most operating systems(BSD,FreeBSD etc.)cannot dispose of this unneeded information thus wasting the precious physical kernel memory.The excuse they use(see McKusick's 4.4BSD book)is that"the relevant code is spread around various subsystems and so it is not feasible to free it".Linux,of course,cannot use such excuses because under Linux "if something is possible in principle,then it is already implemented or somebody is working on it" 1.7 SMP Bootup on x86 9
module so there are no problems. But what if both A and B are statically linked into the kernel? The order they are invoked depends on the relative entry point offsets in the ".initcall.init" ELF section of the kernel image. Rogier Wolff proposed to introduce the hierarchical "priority" infrastructure whereby modules could let the linker know in what (relative) order they should be linked but so far there are no patches available that implement this in a sufficiently elegant manner to be acceptable into the kernel. Therefore − make sure your link order is correct, if, in the example above, A and B work fine when compiled statically once they will work always provided they are listed sequentially in the same Makefile. If they don't work change the order their object files are listed. Another thing worthy of note is Linux's ability to execute "alternative init program" by means of passing "init=" boot commandline. This is useful for recovering from accidentally overwritten "/sbin/init" or debugging the initialisation (rc) scripts and /etc/inittab by hand, executing them one at a time. 1.7 SMP Bootup on x86 On SMP, the BP goes through the normal sequence of bootsector, setup etc until it reaches the start_kernel() and then on to smp_init() and especially src/i386/kernel/smpboot.c:smp_boot_cpus(). The smp_boot_cpus() goes in a loop for each apicid (until NR_CPUS) and calls do_boot_cpu() on it. What do_boot_cpu() does is create (i.e. fork_by_hand) an idle task for the target cpu and writes in well−known locations defined by the Intel MP spec (0x467/0x469) the eip of trampoline code found in trampoline.S. Then it generates STARTUP IPI to the target cpu which makes this AP execute the code in trampoline.S. The boot CPU creates a copy of trampoline code for each CPU in the low memory. The AP code writes a magic number in its own code which is verified by the BP to make sure that AP is executing the trampoline code. The requirement that trampoline code must be in low memory is enforced by the Intel MP specification. The trampoline code simply sets %bx register to 1, enters protected mode and jumps to startup_32 which is the main entry to arch/i386/kernel/head.S. Now, the AP starts executing head.S and discovering that it is not a BP, it skips the code that clears BSS and then enters initialise_secondary() which just enters the idle task for this CPU − recall that init_tasks[cpu] was already initialised by BP executing do_boot_cpu(cpu). Note, that init_task can be shared but each idle thread must have its own TSS so init_tss[NR_CPUS] is an array. 1.8 Freeing initialisation data and code When the operating system initialises itself most of the code and data structures are never needed again. Most operating systems (BSD, FreeBSD etc.) cannot dispose of this unneeded information thus wasting the precious physical kernel memory. The excuse they use (see McKusick's 4.4BSD book) is that "the relevant code is spread around various subsystems and so it is not feasible to free it". Linux, of course, cannot use such excuses because under Linux "if something is possible in principle, then it is already implemented or somebody is working on it". Linux Kernel Internals 1.7 SMP Bootup on x86 9
Linux Kernel Internals So,as I said earlier,Linux kernel can only be compiled as ELF binary and now we find out the reason(or one of the reasons)for that.The reason related to throwing away initialisation code/data is that Linux provides two macros to be used .init-for initialisation code ●initdata-for data These evaluate to gcc attribute specificators(also known as"gcc magic")as defined in include/linux/init.h attribute (section ("text.init")) attribute (section (".data.init"))) init tendif initdata What this means is that if the code is compiled statically into the kernel (ie moDUle is not defined)then it is placed in the special ELF section".text init"which is declared in the linker map in arch/i386/vmlinux.lds. Otherwise(ie.if it is a module)the macros evaluate to nothing. What happens during boot is that the"init"kernel thread(function init/main.c:init())calls the arch-specific function free_initmem(which frees all the pages between addresses init_begin and init_end. On a typical system(my workstation),this results in freeing about 260K of memory. The functions registered via module_init()are placed in".initcall.init"which is also freed in the static case. The current trend in Linux,when designing a subsystem(not necessarily a module)is to provide init/exit entry points from the early stages of design so that in the future the subsystem in question can be modularized if needed.Example of this is pipefs,see fs/pipe.c.Even if subsystem will never become a module,e.g. bdflush(see fs/buffer.c)it is still nice and tidy to use module_init()macro against its initialisation function. provided it does not matter when exactly is the function called There are two more macros whichwhich work very simila called exit and directly conected to the mdule supprt and therefor illpnenh 1.9 Processing kernel command line Let us recall what happens to the commandline passed to kernel during boot 1.LILO(or BCP)ac cepts the ndline using BIOS keyboard s n in phy emory,as well as a signa at the ture saying t isa valid 1.9 Processing kernel command line 10
So, as I said earlier, Linux kernel can only be compiled as ELF binary and now we find out the reason (or one of the reasons) for that. The reason related to throwing away initialisation code/data is that Linux provides two macros to be used: • __init − for initialisation code • __initdata − for data These evaluate to gcc attribute specificators (also known as "gcc magic") as defined in include/linux/init.h: #ifndef MODULE #define __init __attribute__ ((__section__ (".text.init"))) #define __initdata __attribute__ ((__section__ (".data.init"))) #else #define __init #define __initdata #endif What this means is that if the code is compiled statically into the kernel (i.e. MODULE is not defined) then it is placed in the special ELF section ".text.init" which is declared in the linker map in arch/i386/vmlinux.lds. Otherwise (i.e. if it is a module) the macros evaluate to nothing. What happens during boot is that the "init" kernel thread (function init/main.c:init()) calls the arch−specific function free_initmem() which frees all the pages between addresses __init_begin and __init_end. On a typical system (my workstation), this results in freeing about 260K of memory. The functions registered via module_init() are placed in ".initcall.init" which is also freed in the static case. The current trend in Linux, when designing a subsystem (not necessarily a module) is to provide init/exit entry points from the early stages of design so that in the future the subsystem in question can be modularized if needed. Example of this is pipefs, see fs/pipe.c. Even if subsystem will never become a module, e.g. bdflush (see fs/buffer.c) it is still nice and tidy to use module_init() macro against its initialisation function, provided it does not matter when exactly is the function called. There are two more macros which which work very similar, called __exit and __exitdata but they are more directly connected to the module support and therefore will be explained in a later section. 1.9 Processing kernel command line Let us recall what happens to the commandline passed to kernel during boot. 1. LILO (or BCP) accepts the commandline using BIOS keyboard services and stores it at a well−known location in physical memory, as well as a signature saying that there is a valid commandline there Linux Kernel Internals 1.9 Processing kernel command line 10
Linux Kernel Internals 2.arch/i386/kernel/head S copies the first 2k ofit out to the zeropage Note that current version (21)of LILO chops the commandline to 79 bytes.This is a nontrivial bug in LILO(when large EBDA support is enabled)and Werner promised to fix it sometime soon.If you really need to pass commandlines longer than 79 bytes then you can either use BCP or hardcode your commandline in arch/i386/kernel/setup.c:parse mem cmdline()function 3.arch/i386/kemel/setup.c:parse_mem_cmdline()(called by setup_arch()called by start_kernel() copies 256 bytes from zeropage into saved command line which is displayed by /proc/emdline.This same routine processes"mem="portion and makes appropriate adjustments to VM parameters 4.we return to commandline in parse_options()(called by start_kemel()which processes some "in-kernel"parameters(currently"init="and environment/arguments for init)and passes each word to checksetup() 5.checksetup()goes through the code in ELF section"setup.init"and invokes each function passing it the word f it matche Note that using the return value of0 from the function registered via setup( It is po sible to pass the same "variable=value to more thar one function with invalid to one and valid to an Jeff Garzik commented hackers who do that get spanked:)"Why?Because unetonnnoteridorde the result depending on the orde this is cl in one er wi ed before /+ struct kernel param int (+setup_func)(char +) extern struct kernel param setup start,setup end tatic setup(str,fn) n[] initdata =str:V aram_setup_##fninitsetup = endif So,you would typically use it in your code like this (taken from code of real driver,BusLogic HBA drivers/scsi/BusLogic.c): int ints[3]; 1.9 Processing kemnel command line 11
2. arch/i386/kernel/head.S copies the first 2k of it out to the zeropage. Note that current version (21) of LILO chops the commandline to 79 bytes. This is a nontrivial bug in LILO (when large EBDA support is enabled) and Werner promised to fix it sometime soon. If you really need to pass commandlines longer than 79 bytes then you can either use BCP or hardcode your commandline in arch/i386/kernel/setup.c:parse_mem_cmdline() function 3. arch/i386/kernel/setup.c:parse_mem_cmdline() (called by setup_arch() called by start_kernel()) copies 256 bytes from zeropage into saved_command_line which is displayed by /proc/cmdline. This same routine processes "mem=" portion and makes appropriate adjustments to VM parameters 4. we return to commandline in parse_options() (called by start_kernel()) which processes some "in−kernel" parameters (currently "init=" and environment/arguments for init) and passes each word to checksetup() 5. checksetup() goes through the code in ELF section ".setup.init" and invokes each function passing it the word if it matches. Note that using the return value of 0 from the function registered via __setup() it is possible to pass the same "variable=value" to more than one function with "value" invalid to one and valid to another. Jeff Garzik commented: "hackers who do that get spanked :)" Why? Because this is clearly ld−order specific, i.e. kernel linked in one order will have functionA invoked before functionB and another will have it in reversed order with the result depending on the order So, how do we write code that processes boot commandline? We use __setup() macro defined in include/linux/init.h: /* * Used for kernel command line parameter setup */ struct kernel_param { const char *str; int (*setup_func)(char *); }; extern struct kernel_param __setup_start, __setup_end; #ifndef MODULE #define __setup(str, fn) \ static char __setup_str_##fn[] __initdata = str; \ static struct kernel_param __setup_##fn __initsetup = \ { __setup_str_##fn, fn } #else #define __setup(str,func) /* nothing */ endif So, you would typically use it in your code like this (taken from code of real driver, BusLogic HBA drivers/scsi/BusLogic.c): static int __init BusLogic_Setup(char *str) { int ints[3]; Linux Kernel Internals 1.9 Processing kernel command line 11
Linux Kernel Internals (void)get_options(str,ARRAY_SIZE(ints),ints); if (inta:bsolete Comnd "Format Ignored\n",NULL); return 0; tr) return BusLogic_ParseDriverOptions(str): -setup("BusLogic-",BusLogic_Setup); nodules】 up(does nothing so the code that wishes to process boot comm nually".T s also mea ns that it is pos ssible to write code that processes parameters when compiled but not when it is static or vice versa 2.Process and Interrupt Management 2.1 Task Structure and Process Table Every process under Linux is dynamically allocated a'sruct structure.The of ux system is limited only by the amount of physical memory present The default maximum number of threads is set to a safe e thread structures can take up at most half max_threads-mempages /(THREAD_SIZE/PAGE_SIZE)/2; ture basically means'num physpages/4'so,for example on 512 machine you can threads wh ich is a considera or olde reover,this can be anged at runtime using KERN_MAX_THREADS sysctl(2)or simply using procfs interface to kernel tunables 2.Process and Interrupt Management 12
(void)get_options(str, ARRAY_SIZE(ints), ints); if (ints[0] != 0) { BusLogic_Error("BusLogic: Obsolete Command Line Entry " "Format Ignored\n", NULL); return 0; } if (str == NULL || *str == '\0') return 0; return BusLogic_ParseDriverOptions(str); } __setup("BusLogic=", BusLogic_Setup); Note, that for modules __setup() does nothing so the code that wishes to process boot commandline and can be either a module or statically linked must invoke its parsing function manually in the module initialisation routine "manually". This also means that it is possible to write code that processes parameters when compiled as a module but not when it is static or vice versa. 2.Process and Interrupt Management 2.1 Task Structure and Process Table Every process under Linux is dynamically allocated a 'struct task_struct' structure. The maximum number of processes that can be created on the Linux system is limited only by the amount of physical memory present, and is equal to (see kernel/fork.c:fork_init()): /* * The default maximum number of threads is set to a safe * value: the thread structures can take up at most half * of memory. */ max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2; which on IA32 architecture basically means 'num_physpages/4' so, for example on 512M machine you can create 32k threads which is a considerable improvement over the 4k−epsilon limit for older (2.2 and earlier) kernels. Moreover, this can be changed at runtime using KERN_MAX_THREADS sysctl(2) or simply using procfs interface to kernel tunables: Linux Kernel Internals 2.Process and Interrupt Management 12
Linux Kernel Internals cat /proc/sys/kernel/threads-max 32764 100000> cat -g vmlinux /proc/kcore 98eo0ae9ra ted by 'BOOT_IMAGE-240ac18 ro root-306 video-matrox:vesa:0x118' gdb) The set of processes on the Linux system is represented as a collection of'struct task_struct'structures which are linked in two ways: hashed by d list using p->next_task and p-prev_task pointers The hashtable is called pidhash[]and is defined in include/linux/sched.h: /PID hashi (shouldnt this be dynamic?)/ define PIDHASH_SZ (4096 >2) extern struct task_struct *pidhash[PIDHASH_SZ]; tdefine pid_hashfn(x)((((x)>8)(x))&(PIDHASH_SZ -1)) The tasks are hashed by their pid value and the above hashing function is supposed to distribute the elements uniformly in their domain(0 to PID_MAX-1).The hashtable is used to quickly find a task by given pid, using find task pid()inline from include/linux/sched.h: tatic inline struct findtask by pid(int pid) struct task_struct *p,**htable spidhash[pid_hashfn(pid)]; for(p *htable;p s6 p->pid !pid;p p->pidhash next) return pi The tasks hashlist ( hash pidand unhash pid( 2.Process and Interrupt Management 13
# cat /proc/sys/kernel/threads−max 32764 # echo 100000 > /proc/sys/kernel/threads−max # cat /proc/sys/kernel/threads−max 100000 # gdb −q vmlinux /proc/kcore Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'. #0 0x0 in ?? () (gdb) p max_threads $1 = 100000 The set of processes on the Linux system is represented as a collection of 'struct task_struct' structures which are linked in two ways: 1. as a hashtable, hashed by pid 2. as a circular, doubly−linked list using p−>next_task and p−>prev_task pointers The hashtable is called pidhash[] and is defined in include/linux/sched.h: /* PID hashing. (shouldnt this be dynamic?) */ #define PIDHASH_SZ (4096 >> 2) extern struct task_struct *pidhash[PIDHASH_SZ]; #define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ − 1)) The tasks are hashed by their pid value and the above hashing function is supposed to distribute the elements uniformly in their domain (0 to PID_MAX−1). The hashtable is used to quickly find a task by given pid, using find_task_pid() inline from include/linux/sched.h: static inline struct task_struct *find_task_by_pid(int pid) { struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)]; for(p = *htable; p && p−>pid != pid; p = p−>pidhash_next) ; return p; } The tasks on each hashlist (i.e. hashed to the same value) are linked by p−>pidhash_next/pidhash_pprev which are used by hash_pid() and unhash_pid() to insert and remove a given process into the hashtable. These are done under protection of the rw spinlock called 'tasklist_lock' taken for WRITE. Linux Kernel Internals 2.Process and Interrupt Management 13