Linux Kernel Internals The circular doubly-linked list that uses p->next task/prev_task is maintained so that one could go through all tasks on the system easily.This is achieved by for each tasko macro from include/linux/sched.h: tdefine for_each_task(p) (p p->next_task)I=sinit_task The tasklist_lock for READ.No using init task the beginning(and end)of the list-this is The modifier hashtable or/and the ss table links notably fork trace must take the tasklist lock for WRITE What is inte is that the writ ts on the local onu The reason for this is not trivial The send si interrupts while the readers don't need to Now that we understand how the task struct structures are linked together,let us examine the members of task struct.They loosely corresponds to the members of UNIX'struct proc'and'struct user combined together. The other versions of UNIX separated the task state information into part which should be kept memory-resident at all times(called 'proc structure'which includes process state,scheduling information ete.)and part which is only needed when the process is running(called'u area which includes file descriptor table,disk quota information etc.).The only reason for such ugly design was that memory was a very scarce resource.Modern operating systems(well,only Linux at the moment but others.e.g.FreeBSD seem to improve in this direction towards Linux)do not need such separation and therefore maintain process state in a kernel memory-resident data structure at all times. The taskstruct structure is declared in include/linux/sched.h and is currently 1680 bytes in size The state field is declared as volatile long state; /*-1 unrunnable,0 runnable,>0 stopped * #define TAS _UNINTERRUPTIBLE define TASK EXCLUSIVE Why is TASK_EXCLUSIVE defined as 32 and not 16?Because 16 was used up by TASK_SWAPPING and I forgot to shift TASK_EXCLUSIVE up when I removed all references to TASK_SWAPPING (sometime in 2.3.x) 2.Process and Interrupt Management 14
The circular doubly−linked list that uses p−>next_task/prev_task is maintained so that one could go through all tasks on the system easily. This is achieved by for_each_task() macro from include/linux/sched.h: #define for_each_task(p) \ for (p = &init_task ; (p = p−>next_task) != &init_task ; ) The users of for_each_task() should take tasklist_lock for READ. Note that for_each_task() is using init_task to mark the beginning (and end) of the list − this is safe because the idle task (pid 0) never exits. The modifiers of the process hashtable or/and the process table links, notably fork, exit and ptrace must take the tasklist_lock for WRITE. What is more interesting is that the writers must also disable interrupts on the local cpu. The reason for this is not trivial. The send_sigio() walks the task list and thus takes tasklist_lock for READ and it is called from kill_fasync() in the interrupt context. This is why writers must disable the interrupts while the readers don't need to. Now that we understand how the task_struct structures are linked together, let us examine the members of task_struct. They loosely corresponds to the members of UNIX 'struct proc' and 'struct user' combined together. The other versions of UNIX separated the task state information into part which should be kept memory−resident at all times (called 'proc structure' which includes process state, scheduling information etc.) and part which is only needed when the process is running (called 'u area' which includes file descriptor table, disk quota information etc.). The only reason for such ugly design was that memory was a very scarce resource. Modern operating systems (well, only Linux at the moment but others, e.g. FreeBSD seem to improve in this direction towards Linux) do not need such separation and therefore maintain process state in a kernel memory−resident data structure at all times. The task_struct structure is declared in include/linux/sched.h and is currently 1680 bytes in size. The state field is declared as: volatile long state; /* −1 unrunnable, 0 runnable, >0 stopped */ #define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 #define TASK_ZOMBIE 4 #define TASK_STOPPED 8 #define TASK_EXCLUSIVE 32 Why is TASK_EXCLUSIVE defined as 32 and not 16? Because 16 was used up by TASK_SWAPPING and I forgot to shift TASK_EXCLUSIVE up when I removed all references to TASK_SWAPPING (sometime in 2.3.x). Linux Kernel Internals 2.Process and Interrupt Management 14
Linux Kernel Internals The volatile in p->state declaration means it can be modified asynchronously(from interrupt handler): 1.TASK RUNNING means the task is"supposed to be"on the run queue.The reason it may not yet be on the runqueue is that marking task as TASK RUNNING and placing it on the runqueue is not atomic,however if you look at the queue under protection of runqueue_lock then every TASK RUNNING is on the runqucue.The converse is not true.Namely,drivers can mark themselves (or rather the process context they run in)as TASK_INTERRUPTIBLE(or UNINTERRUPTIBLE)and then call schedule()which removes it from the runqueue (unless there is a pending signal,in which case it is left on the runqucue).speaking not true because setting state=TASK_RUNNING and placing task on the runq by wake_up_process()is not atomic so you can see (very b riefly)TASK_RUNNING tasks not yet on the rung. TASK INTER means the ta sk is sleeping b in b of a timer TERRUPTIBLE same as except It cannot be woken up.TASK_ZOMBIE task has d but ha not had its status collected (wait()-ed for by th (natural or STOPPED task was stoppe er due to oo contro or di can be OF NIN with any othe ill be woken up alone instea em by w ing up a Task flags contain information about the process states which are not mutually exclusive /per process flags,defined below * Per process flags define PF_ALIGNWARN 0×00000001 define PF STARTING 0x0000000 0x00000040 define PF_SU 0x00000100 used super -user privileges+/ SIG 0×0000040 killed by a signal+/ define ting mm release tdefine PF_USEDFPU 0x00100000 /task used FPU this quantum (SMP)+/ The fields p->has_cpu.p->processor,p->counter,p->priority,p->policy and p->rt_priority are related to the scheduler and will be looked at later. The fields p- >mm and p->active_mm point to the process'address space described by mm_struct structure thi is to minimize ing address spac n the tas So,if we are scheduling-in the thread (v hich s no p next active_mm wi ill be set to th pre e m that was out v ame as pre mm if prev can b een threads if CLONE VM flag is passed to the clone(2)system call or by means o vfork(2)system cal 2.Process and Interrupt Management 6
The volatile in p−>state declaration means it can be modified asynchronously (from interrupt handler): 1. TASK_RUNNING means the task is "supposed to be" on the run queue. The reason it may not yet be on the runqueue is that marking task as TASK_RUNNING and placing it on the runqueue is not atomic, however if you look at the queue under protection of runqueue_lock then every TASK_RUNNING is on the runqueue. The converse is not true. Namely, drivers can mark themselves (or rather the process context they run in) as TASK_INTERRUPTIBLE (or UNINTERRUPTIBLE) and then call schedule() which removes it from the runqueue (unless there is a pending signal, in which case it is left on the runqueue). speaking not true because setting state=TASK_RUNNING and placing task on the runq by wake_up_process() is not atomic so you can see (very briefly) TASK_RUNNING tasks not yet on the runq. TASK_INTERRUPTIBLE means the task is sleeping but can be woken up by a signal or by expiry of a timer. TASK_UNINTERRUPTIBLE same as TASK_INTERRUPTIBLE, except it cannot be woken up. TASK_ZOMBIE task has terminated but has not had its status collected (wait()−ed for) by the parent (natural or by adoption). TASK_STOPPED task was stopped either due to job control signals or due to ptrace(2). TASK_EXCLUSIVE this is not a separate state but can be OR−ed to either one of the TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. This means that when this task is sleeping on a wait queue with many other tasks, it will be woken up alone instead of causing "thundering herd" problem by waking up all the waiters. Task flags contain information about the process states which are not mutually exclusive: unsigned long flags; /* per process flags, defined below */ /* * Per process flags */ #define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */ /* Not implemented yet, only for 486*/ #define PF_STARTING 0x00000002 /* being created */ #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x00000100 /* used super−user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_VFORK 0x00001000 /* Wake up parent in mm_release */ #define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */ The fields p−>has_cpu,p−>processor, p−>counter, p−>priority, p−>policy and p−>rt_priority are related to the scheduler and will be looked at later. The fields p−>mm and p−>active_mm point to the process' address space described by mm_struct structure and to the active address space if the process doesn't have a real one (e.g. kernel threads) − this is to minimize TLB flushes on switching address spaces when the task is scheduled out. So, if we are scheduling−in the kernel thread (which has no p−>mm) then its next−>active_mm will be set to the prev−>active_mm of the task that was scheduled−out which will be the same as prev−>mm if prev−>mm != NULL. The address space can be shared between threads if CLONE_VM flag is passed to the clone(2) system call or by means of vfork(2) system call. Linux Kernel Internals 2.Process and Interrupt Management 15
Linux Kernel Internals The fields p->exec domain and p->personality related to the personality of the task,i.e.to the way certain system calls behave in order to emulate"personality"of foreign flavours of UNIX. The field p->fs contains filesystem information,which under Linux means three pieces of information: 1.root directory's dentry and mountpoint 2.alternate ro ot directory's dentry and mountpoint 3.current working directory's dentry and mountpoint it can be shared between cloned tasks when The field p- file ains the file descriptor table.This also can be shared between tasks if CLONE FILES ed with clone(2)system call The field r and can be shared between cloned tasks by means of ed to lone(2)system call. 2.2 Creation and termination of tasks and kernel threads ocess"in different ways starting from "instance of a hich is produced by cloe(fork()syste calls"Under Linux.there are three kinds of processes: ·IdleThread ◆Kernel Threads ●User Tasks The idle thread is created at compile time for the first CPU and then it is"manually"created for each CPU by means of arch-specific fork_by_hand()in arch/i386/kernel/smpboot.c which unrolls fork system call by hand (on some archs).Idle tasks share one init_task structure but have a private TSS structure in per-CPU array init tss.Idle tasks all have pid=0 and no other task can share pid,i.e.use CLONE PID flag to clone(2). adare created usinemel thead(function which invokes tho one system call in kerne mode. user adar s space,i.e.p->mm=NULL be kit mm(),e.g via e() tion.Kemel t s can always access cat pid num eni the o privilees and camt be preemted theschedule rs in th low range.R or'srinimplies that the kemel threads User iorceated by means of clone(2)or fork(2)system calls,both of which internally invoke Let us understand what happe fork()ystem call Although the fork(2) all is architec diffe rstack and ,the actual 、do fork oes the ob i s po able nd is l cated a el/fork c 2.2 Creation and termination of tasks and kernel threads 16
The fields p−>exec_domain and p−>personality related to the personality of the task, i.e. to the way certain system calls behave in order to emulate "personality" of foreign flavours of UNIX. The field p−>fs contains filesystem information, which under Linux means three pieces of information: 1. root directory's dentry and mountpoint 2. alternate root directory's dentry and mountpoint 3. current working directory's dentry and mountpoint Also, this structure includes a reference count because it can be shared between cloned tasks when CLONE_FS flags are passed to the clone(2) system call. The field p−>files contains the file descriptor table. This also can be shared between tasks if CLONE_FILES is specified with clone(2) system call. The field p−>sig contains signal handlers and can be shared between cloned tasks by means of CLONE_SIGHAND flag passed to the clone(2) system call. 2.2 Creation and termination of tasks and kernel threads Different books on operating systems define a "process" in different ways, starting from "instance of a program in execution" and ending with "that which is produced by clone(2) or fork(2) system calls". Under Linux, there are three kinds of processes: • Idle Thread • Kernel Threads • User Tasks The idle thread is created at compile time for the first CPU and then it is "manually" created for each CPU by means of arch−specific fork_by_hand() in arch/i386/kernel/smpboot.c which unrolls fork system call by hand (on some archs). Idle tasks share one init_task structure but have a private TSS structure in per−CPU array init_tss. Idle tasks all have pid = 0 and no other task can share pid, i.e. use CLONE_PID flag to clone(2). Kernel threads are created using kernel_thread() function which invokes the clone system call in kernel mode. Kernel threads usually have no user address space, i.e. p−>mm = NULL because they explicitly do exit_mm(), e.g. via daemonize() function. Kernel threads can always access kernel address space directly. They are allocated pid numbers in the low range. Running at processor's ring 0 implies that the kernel threads enjoy all the io privileges and cannot be pre−empted by the scheduler. User tasks are created by means of clone(2) or fork(2) system calls, both of which internally invoke kernel/fork.c:do_fork(). Let us understand what happens when a user process makes a fork(2) system call. Although the fork(2) system call is architecture−dependent due to the different ways of passing user stack and registers, the actual underlying function do_fork() that does the job is portable and is located at kernel/fork.c. Linux Kernel Internals 2.2 Creation and termination of tasks and kernel threads 16
Linux Kernel Internals The following steps are done 1.Local variable retval is set to-ENOMEM as it is the value ermo is set to if fork(2)fails to allocate a new task structure 2.if CLONE_PID is set in clone_flags then return an error(-EPERM)unless the caller is the idle thread(during boot only).So,normal user threads cannot pass CLONE PID to clone(2)and expect it to succeed.For fork(2)it is irrelevant as clone_flags is set to SIFCHLD-this is only relevant when do_fork()is invoked from sys_clone()which passes the clone_flags from the value requested from userspace 3.current vfork sem is initialised(it is later cleared in the child).This is used by sys_vfork( (vfork(2)system call,corresponds to clone_flags=CLONE_VFORK|CLONE_VMSIGCHLD)to make the parent sleep until the child does mm_release(for example as a result of execing another program or exit(2)-ing 4.A new task is allocated using arch-dependent alloc_task_struct()macro,on x86 it is just a at GFP ty. s the first reason why fork(2)system call may sleep.If this we return 5. cess t ure are co d into the new e,usin dbe replaced a memset?Later on,the fields s tha set to ra ake it a fact)the EAGAIN. verify if the eded RLIMIT NPROC soft lin -if so,fail with 8.If the sy ide number of tasks xceeds the value of the tunable max_thr ads.fail with -EAGAIN 9.If the bin ary being executed belongs to a modularised execution domain increment the 1oIFthebtnanYeTmgitccfeSriegooanoduarieabnaytoTmatincrcnenttecorcsponding module's reference count 11.The child is marked as 'has not execed'p->did_exec=0 12.The child is marked as'not-swappable'p->swappable =0 13.The child is put into'uninterruptible sleep'state p->state=TASK UNINTERRUPTIBLE (TODO: why is this done?I think it's not needed-get rid of it,Linus confirms it is not needed) 14.The child's p>flags are set according to the value of clone_flags,for the plain fork(2)it is pflags PF FORKNOEXEC 15.The childs pid p->pid is set using the fast algorithm in kemel/fork.c:get_pid((TODO:lastpid_lock k can be made redundant since get_pid()is always called under big kernel lock from do fork().also remove flags argument of get pid.patch sent to Alan on 20/06/2000-followup later). 16.The rest of the code in do fork()initialises the rest of child's task structure.At the very end.the child's task structure is hashed into pidhas ble and the ild is woken up (TODO s(p)sets p->state adds the pro oly didn'tn runq,therefore do_fork( Just SIGC LD ind sett g】 al t e po nen rg an b y mea fpretl(2) h ay u pointe pretl()isa bit silly-mea culpa.after ndries Brouwer updated the manpage it 2.2 Creation and termination of tasks and kernel threads 17
The following steps are done: 1. Local variable retval is set to −ENOMEM as it is the value errno is set to if fork(2) fails to allocate a new task structure 2. if CLONE_PID is set in clone_flags then return an error (−EPERM) unless the caller is the idle thread (during boot only). So, normal user threads cannot pass CLONE_PID to clone(2) and expect it to succeed. For fork(2) it is irrelevant as clone_flags is set to SIFCHLD − this is only relevant when do_fork() is invoked from sys_clone() which passes the clone_flags from the value requested from userspace 3. current−>vfork_sem is initialised (it is later cleared in the child). This is used by sys_vfork() (vfork(2) system call, corresponds to clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) to make the parent sleep until the child does mm_release() for example as a result of execing another program or exit(2)−ing 4. A new task structure is allocated using arch−dependent alloc_task_struct() macro, on x86 it is just a gfp at GFP_KERNEL priority. This is the first reason why fork(2) system call may sleep. If this allocation fails we return −ENOMEM 5. All the values from current process' task structure are copied into the new one, using structure assignment *p = *current. Perhaps this should be replaced by a memset? Later on, the fields that should not be inherited by the child are set to the correct values 6. Big kernel lock is taken as the rest of the code would otherwise be non−reentrant 7. If the parent has user resources (a concept of UID, Linux is flexible enough to make it a question rather than a fact), then verify if the user exceeded RLIMIT_NPROC soft limit − if so, fail with −EAGAIN, if not, increment the count of processes by given uid p−>user−>count 8. If the system−wide number of tasks exceeds the value of the tunable max_threads, fail with −EAGAIN 9. If the binary being executed belongs to a modularised execution domain, increment the corresponding module's reference count 10. If the binary being executed belongs to a modularised binary format, increment the corresponding module's reference count 11. The child is marked as 'has not execed' p−>did_exec = 0 12. The child is marked as 'not−swappable' p−>swappable = 0 13. The child is put into 'uninterruptible sleep' state p−>state = TASK_UNINTERRUPTIBLE (TODO: why is this done? I think it's not needed − get rid of it, Linus confirms it is not needed) 14. The child's p−>flags are set according to the value of clone_flags, for the plain fork(2) it is p−>flags = PF_FORKNOEXEC. 15. The childs pid p−>pid is set using the fast algorithm in kernel/fork.c:get_pid() (TODO: lastpid_lock spinlock can be made redundant since get_pid() is always called under big kernel lock from do_fork(), also remove flags argument of get_pid, patch sent to Alan on 20/06/2000 − followup later). 16. The rest of the code in do_fork() initialises the rest of child's task structure. At the very end, the child's task structure is hashed into pidhash hashtable and the child is woken up (TODO: wake_up_process(p) sets p−>state = TASK_RUNNING and adds the process to the runq, therefore we probably didn't need to set p−>state to TASK_RUNNING earlier on in do_fork()). The interesting part is setting p−>exit_signal to clone_flags & CSIGNAL which for fork(2) means just SIGCHLD and setting p−>pdeath_signal to 0. The pdeath_signal is used when a process 'forgets' the original parent (by dying) and can be set/get by means of PR_GET/SET_PDEATHSIG commands of prctl(2) system call (You might argue that the way the value of pdeath_signal is returned via userspace pointer argument in prctl(2) is a bit silly − mea culpa, after Andries Brouwer updated the manpage it was too late to fix ;) Linux Kernel Internals 2.2 Creation and termination of tasks and kernel threads 17