A-12 Appendix A Assemblers,Linkers,and the SPIM Simulator An assembler's first pass reads each line of an assembly file and breaks it into its component pieces.These pieces,which are called lexemes,are individual words, numbers,and punctuation characters.For example,the line ble $t0,100,100p contains six lexemes:the opcode ble,the register specifier $to,a comma,the number 100,a comma,and the symbol 1oop. symbol table A table that If a line begins with a label,the assembler records in its symbol table the name matches names of labels to the of the label and the address of the memory word that the instruction occupies. addresses of the memory words The assembler then calculates how many words of memory the instruction on the that instructions occupy. current line will occupy.By keeping track of the instructions'sizes,the assembler can determine where the next instruction goes.To compute the size of a variable- length instruction,like those on the VAX,an assembler has to examine it in detail. Fixed-length instructions,like those on MIPS,on the other hand,require only a cursory examination.The assembler performs a similar calculation to compute the space required for data statements.When the assembler reaches the end of an assembly file,the symbol table records the location of each label defined in the file. The assembler uses the information in the symbol table during a second pass over the file,which actually produces machine code.The assembler again exam- ines each line in the file.If the line contains an instruction,the assembler com- bines the binary representations of its opcode and operands(register specifiers or memory address)into a legal instruction.The process is similar to the one used in Section 2.4 in Chapter 2.Instructions and data words that reference an external symbol defined in another file cannot be completely assembled(they are unre- solved)since the symbol's address is not in the symbol table.An assembler does not complain about unresolved references since the corresponding label is likely to be defined in another file The BIG Assembly language is a programming language.Its principal difference from high-level languages such as BASIC,Java,and C is that assembly lan- Picture guage provides only a few,simple types of data and control flow.Assembly language programs do not specify the type of value held in a variable. Instead,a programmer must apply the appropriate operations(e.g.,integer or floating-point addition)to a value.In addition,in assembly language, programs must implement all control flow with go tos.Both factors make assembly language programming for any machine-MIPS or 80x86- more difficult and error-prone than writing in a high-level language
A-12 Appendix A Assemblers, Linkers, and the SPIM Simulator An assembler’s first pass reads each line of an assembly file and breaks it into its component pieces. These pieces, which are called lexemes, are individual words, numbers, and punctuation characters. For example, the line ble $t0, 100, loop contains six lexemes: the opcode ble, the register specifier $t0, a comma, the number 100, a comma, and the symbol loop. If a line begins with a label, the assembler records in its symbol table the name of the label and the address of the memory word that the instruction occupies. The assembler then calculates how many words of memory the instruction on the current line will occupy. By keeping track of the instructions’ sizes, the assembler can determine where the next instruction goes. To compute the size of a variablelength instruction, like those on the VAX, an assembler has to examine it in detail. Fixed-length instructions, like those on MIPS, on the other hand, require only a cursory examination. The assembler performs a similar calculation to compute the space required for data statements. When the assembler reaches the end of an assembly file, the symbol table records the location of each label defined in the file. The assembler uses the information in the symbol table during a second pass over the file, which actually produces machine code. The assembler again examines each line in the file. If the line contains an instruction, the assembler combines the binary representations of its opcode and operands (register specifiers or memory address) into a legal instruction. The process is similar to the one used in Section 2.4 in Chapter 2. Instructions and data words that reference an external symbol defined in another file cannot be completely assembled (they are unresolved) since the symbol’s address is not in the symbol table. An assembler does not complain about unresolved references since the corresponding label is likely to be defined in another file Assembly language is a programming language. Its principal difference from high-level languages such as BASIC, Java, and C is that assembly language provides only a few, simple types of data and control flow. Assembly language programs do not specify the type of value held in a variable. Instead, a programmer must apply the appropriate operations (e.g., integer or floating-point addition) to a value. In addition, in assembly language, programs must implement all control flow with go tos. Both factors make assembly language programming for any machine—MIPS or 80x86— more difficult and error-prone than writing in a high-level language. symbol table A table that matches names of labels to the addresses of the memory words that instructions occupy. The BIG Picture
A.2 Assemblers A-13 Elaboration:If an assembler's speed is important,this two-step process can be done in one pass over the assembly file with a technique known as backpatching.In its backpatching A method for pass over the file,the assembler builds a(possibly incomplete)binary representation translating from assembly lan- of every instruction.If the instruction references a label that has not yet been defined, guage to machine instructions the assembler records the label and instruction in a table.When a label is defined,the in which the assembler builds a assembler consults this table to find all instructions that contain a forward reference to (possibly incomplete)binary the label.The assembler goes back and corrects their binary representation to incorpo representation of every instruc- rate the address of the label.Backpatching speeds assembly because the assembler tion in one pass over a program only reads its input once.However,it requires an assembler to hold the entire binary and then returns to fill in previ- representation of a program in memory so instructions can be backpatched.This ously undefined labels. requirement can limit the size of programs that can be assembled.The process is com- plicated by machines with several types of branches that span different ranges of instructions.When the assembler first sees an unresolved label in a branch instruction, it must either use the largest possible branch or risk having to go back and readjust many instructions to make room for a larger branch. Object File Format Assemblers produce object files.An object file on UNIX contains six distinct sec- tions(see Figure A.2.1): The object file header describes the size and position of the other pieces of the file. The text segment contains the machine language code for routines in the source text segment The segment of a file.These routines may be unexecutable because of unresolved references UNIX object file that contains the machine language code for The data segment contains a binary representation of the data in the source routines in the source file. file.The data also may be incomplete because of unresolved references to labels in other files. data segment The segment of a UNIX object or executable file The relocation information identifies instructions and data words that that contains a binary represen- depend on absolute addresses.These references must change if portions of tation of the initialized data the program are moved in memory. used by the program. The symbol table associates addresses with external labels in the source file relocation information The and lists unresolved references. segment of a UNIX object file that identifies instructions and The debugging information contains a concise description of the way in data words that depend on which the program was compiled,so a debugger can find which instruction absolute addresses addresses correspond to lines in a source file and print the data structures in readable form. absolute address A variable's or routine's actual address in The assembler produces an object file that contains a binary representation of memory. the program and data and additional information to help link pieces of a pro-
A.2 Assemblers A-13 Elaboration: If an assembler’s speed is important, this two-step process can be done in one pass over the assembly file with a technique known as backpatching. In its pass over the file, the assembler builds a (possibly incomplete) binary representation of every instruction. If the instruction references a label that has not yet been defined, the assembler records the label and instruction in a table. When a label is defined, the assembler consults this table to find all instructions that contain a forward reference to the label. The assembler goes back and corrects their binary representation to incorporate the address of the label. Backpatching speeds assembly because the assembler only reads its input once. However, it requires an assembler to hold the entire binary representation of a program in memory so instructions can be backpatched. This requirement can limit the size of programs that can be assembled. The process is complicated by machines with several types of branches that span different ranges of instructions. When the assembler first sees an unresolved label in a branch instruction, it must either use the largest possible branch or risk having to go back and readjust many instructions to make room for a larger branch. Object File Format Assemblers produce object files. An object file on UNIX contains six distinct sections (see Figure A.2.1): ■ The object file header describes the size and position of the other pieces of the file. ■ The text segment contains the machine language code for routines in the source file. These routines may be unexecutable because of unresolved references. ■ The data segment contains a binary representation of the data in the source file. The data also may be incomplete because of unresolved references to labels in other files. ■ The relocation information identifies instructions and data words that depend on absolute addresses. These references must change if portions of the program are moved in memory. ■ The symbol table associates addresses with external labels in the source file and lists unresolved references. ■ The debugging information contains a concise description of the way in which the program was compiled, so a debugger can find which instruction addresses correspond to lines in a source file and print the data structures in readable form. The assembler produces an object file that contains a binary representation of the program and data and additional information to help link pieces of a probackpatching A method for translating from assembly language to machine instructions in which the assembler builds a (possibly incomplete) binary representation of every instruction in one pass over a program and then returns to fill in previously undefined labels. text segment The segment of a UNIX object file that contains the machine language code for routines in the source file. data segment The segment of a UNIX object or executable file that contains a binary representation of the initialized data used by the program. relocation information The segment of a UNIX object file that identifies instructions and data words that depend on absolute addresses. absolute address A variable’s or routine’s actual address in memory
A-14 Appendix A Assemblers,Linkers,and the SPIM Simulator Object file Text Data Relocation Symbol Debugging header segment segment information table information FIGURE A.2.1 Objeet file.A UNIX assembler produces an object file with six distinct sections. gram.This relocation information is necessary because the assembler does not know which memory locations a procedure or piece of data will occupy after it is linked with the rest of the program.Procedures and data from a file are stored in a contiguous piece of memory,but the assembler does not know where this mem- ory will be located.The assembler also passes some symbol table entries to the linker.In particular,the assembler must record which external symbols are defined in a file and what unresolved references occur in a file. Elaboration:For convenience,assemblers assume each file starts at the same address (for example,location O)with the expectation that the linker will relocate the code and data when they are assigned locations in memory.The assembler produces relocation information,which contains an entry describing each instruction or data word in the file that references an absolute address.On MIPS,only the subroutine call,load, and store instructions reference absolute addresses.Instructions that use PC-relative addressing,such as branches,need not be relocated. Additional Facilities Assemblers provide a variety of convenience features that help make assembler programs short and easier to write,but do not fundamentally change assembly language.For example,data layout directives allow a programmer to describe data in a more concise and natural manner than its binary representation. In Figure A.1.4,the directive asciiz "The sum from 0 .100 is &d\n" stores characters from the string in memory.Contrast this line with the alternative of writing each character as its ASCII value(Figure 2.21 in Chapter 2 describes the ASCII encoding for characters): .byte84,104,101,32,115,117,109,32 .byte102,114,111,109,32,48,32,46 .byte46,32,49,48,48,32,105,115 .byte32,37,100,10,0 The.asciiz directive is easier to read because it represents characters as letters, not binary numbers.An assembler can translate characters to their binary repre- sentation much faster and more accurately than a human.Data layout directives
A-14 Appendix A Assemblers, Linkers, and the SPIM Simulator gram. This relocation information is necessary because the assembler does not know which memory locations a procedure or piece of data will occupy after it is linked with the rest of the program. Procedures and data from a file are stored in a contiguous piece of memory, but the assembler does not know where this memory will be located. The assembler also passes some symbol table entries to the linker. In particular, the assembler must record which external symbols are defined in a file and what unresolved references occur in a file. Elaboration: For convenience, assemblers assume each file starts at the same address (for example, location 0) with the expectation that the linker will relocate the code and data when they are assigned locations in memory. The assembler produces relocation information, which contains an entry describing each instruction or data word in the file that references an absolute address. On MIPS, only the subroutine call, load, and store instructions reference absolute addresses. Instructions that use PC-relative addressing, such as branches, need not be relocated. Additional Facilities Assemblers provide a variety of convenience features that help make assembler programs short and easier to write, but do not fundamentally change assembly language. For example, data layout directives allow a programmer to describe data in a more concise and natural manner than its binary representation. In Figure A.1.4, the directive .asciiz “The sum from 0 .. 100 is %d\n” stores characters from the string in memory. Contrast this line with the alternative of writing each character as its ASCII value (Figure 2.21 in Chapter 2 describes the ASCII encoding for characters): .byte 84, 104, 101, 32, 115, 117, 109, 32 .byte 102, 114, 111, 109, 32, 48, 32, 46 .byte 46, 32, 49, 48, 48, 32, 105, 115 .byte 32, 37, 100, 10, 0 The .asciiz directive is easier to read because it represents characters as letters, not binary numbers. An assembler can translate characters to their binary representation much faster and more accurately than a human. Data layout directives FIGURE A.2.1 Object file. A UNIX assembler produces an object file with six distinct sections. Object file header Text segment Data segment Relocation information Symbol table Debugging information
A.2 Assemblers A-15 specify data in a human-readable form that the assembler translates to binary. Other layout directives are described in Section A.10 on page A-45. String Directive Define the sequence of bytes produced by this directive: EXAMPLE asciiz "The quick brown fox jumps over the lazy dog" .byte84,104,101,32,113,117,105,99 .byte107,32,98,114,111,119,110,32 ANSWER .byte102,111,120,32,106,117,109,112 .byte115,32,111,118,101,114,32,116 .byte104,101,32,108,97,122,121,32 .byte100,111,103,0 Macros are a pattern-matching and replacement facility that provide a simple mechanism to name a frequently used sequence of instructions.Instead of repeat- edly typing the same instructions every time they are used,a programmer invokes the macro and the assembler replaces the macro call with the corresponding sequence of instructions.Macros,like subroutines,permit a programmer to create and name a new abstraction for a common operation.Unlike subroutines,how- ever,macros do not cause a subroutine call and return when the program runs since a macro call is replaced by the macro's body when the program is assembled. After this replacement,the resulting assembly is indistinguishable from the equiv- alent program written without macros. Macros As an example,suppose that a programmer needs to print many numbers. The library routine printf accepts a format string and one or more values EXAMPLE to print as its arguments.A programmer could print the integer in register $7 with the following instructions: data int_str:.asciiz“%d" text la $a0,int_str Load string address 非into first arg
A.2 Assemblers A-15 specify data in a human-readable form that the assembler translates to binary. Other layout directives are described in Section A.10 on page A-45. Macros are a pattern-matching and replacement facility that provide a simple mechanism to name a frequently used sequence of instructions. Instead of repeatedly typing the same instructions every time they are used, a programmer invokes the macro and the assembler replaces the macro call with the corresponding sequence of instructions. Macros, like subroutines, permit a programmer to create and name a new abstraction for a common operation. Unlike subroutines, however, macros do not cause a subroutine call and return when the program runs since a macro call is replaced by the macro’s body when the program is assembled. After this replacement, the resulting assembly is indistinguishable from the equivalent program written without macros. String Directive Define the sequence of bytes produced by this directive: .asciiz “The quick brown fox jumps over the lazy dog” .byte 84, 104, 101, 32, 113, 117, 105, 99 .byte 107, 32, 98, 114, 111, 119, 110, 32 .byte 102, 111, 120, 32, 106, 117, 109, 112 .byte 115, 32, 111, 118, 101, 114, 32, 116 .byte 104, 101, 32, 108, 97, 122, 121, 32 .byte 100, 111, 103, 0 Macros As an example, suppose that a programmer needs to print many numbers. The library routine printf accepts a format string and one or more values to print as its arguments. A programmer could print the integer in register $7 with the following instructions: .data int_str: .asciiz“%d” .text la $a0, int_str # Load string address # into first arg EXAMPLE ANSWER EXAMPLE
A-16 Appendix A Assemblers,Linkers,and the SPIM Simulator mov $al,$7 Load value into 非second arg jal printf #Call the printf routine The.data directive tells the assembler to store the string in the program's data segment,and the.text directive tells the assembler to store the instruc- tions in its text segment. However,printing many numbers in this fashion is tedious and produces a verbose program that is difficult to understand.An alternative is to introduce a macro,print_int,to print an integer: data int_str:.asciiz“%d" .text macro print int(sarg) la $a0,int_str Load string address into 非first arg mov $al,$arg #Load macro's parameter #(sarg)into second arg jal printf Call the printf routine end macro print_int($7) formal parameter A variable The macro has a formal parameter,$arg,that names the argument to the that is the argument to a proce- macro.When the macro is expanded,the argument from a call is substituted dure or macro;replaced by that for the formal parameter throughout the macro's body.Then the assembler argument once the macro is expanded. replaces the call with the macro's newly expanded body.In the first call on print_int,the argument is $7,so the macro expands to the code la $a0,int_str mov $al,$7 jal printf In a second call on print_int,say,print_int($to),the argument is $t0,so the macro expands to la $a0,int_str mov $al,$to jal printf What does the call print_int($a0)expand to?
A-16 Appendix A Assemblers, Linkers, and the SPIM Simulator mov $a1, $7 # Load value into # second arg jal printf # Call the printf routine The .data directive tells the assembler to store the string in the program’s data segment, and the .text directive tells the assembler to store the instructions in its text segment. However, printing many numbers in this fashion is tedious and produces a verbose program that is difficult to understand. An alternative is to introduce a macro, print_int, to print an integer: .data int_str:.asciiz “%d” .text .macro print_int($arg) la $a0, int_str # Load string address into # first arg mov $a1, $arg # Load macro’s parameter # ($arg) into second arg jal printf # Call the printf routine .end_macro print_int($7) The macro has a formal parameter, $arg, that names the argument to the macro. When the macro is expanded, the argument from a call is substituted for the formal parameter throughout the macro’s body. Then the assembler replaces the call with the macro’s newly expanded body. In the first call on print_int, the argument is $7, so the macro expands to the code la $a0, int_str mov $a1, $7 jal printf In a second call on print_int, say, print_int($t0), the argument is $t0, so the macro expands to la $a0, int_str mov $a1, $t0 jal printf What does the call print_int($a0) expand to? formal parameter A variable that is the argument to a procedure or macro; replaced by that argument once the macro is expanded