Lexical Analyzer CS308 Compiler Theory 1
Lexical Analyzer CS308 Compiler Theory 1
Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn't return a list of tokens at one shot, it returns a token when the parser asks a token from it. token source Lexical to semantic Parser program Analyzer analysis getNextToken Symbol Table CS308 Compiler Theory 2
Lexical Analyzer • Lexical Analyzer reads the source program character by character to prod k uce tokens. • Normally a lexical analyzer doesn’t return a list of tokens at one shot, it t t k h th k t k f it it returns a token when the parser asks a token from it. CS308 Compiler Theory 2
Token Token represents a set of strings described by a pattern. Identifier represents a set of strings which start with a letter continues with letters and digits The actual string(newval)is called as lexeme. -Tokens:identifier,number,addop,delimeter,... Since a token can represent more than one lexeme,additional information should be held for that specific lexeme.This additional information is called as the attribute of the token. For simplicity,a token may have a single attribute which holds the required information for that token. For identifiers,this attribute is a pointer to the symbol table,and the symbol table holds the actual attributes for that token. ·Some attributes: -<id,attr> where attr is pointer to the symbol table <assgop, no attribute is needed(if there is only one assignment operator) -<num,val> where val is the actual value of the number. Token type and its attribute uniquely identifies a lexeme. Regular expressions are widely used to specify patterns. CS308 Compiler Theory 3
Token • Token represents a set of strings described by a pattern. – Identifier represents a set of strings which sta Identifier represents a set of strings which start with a letter continues with letters and digits with a letter continues with letters and digits – The actual string (newval) is called as lexeme. – Tokens: identifier, number, addop, delimeter, … • Since a token can represent more than one lexeme additional information should be Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the attribute of the token. • For simplicity a token may have a single attribute which holds the required For simplicity, a token may have a single attribute which holds the required information for that token. – For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds the actual attributes for that token. • Some attributes: – <id,attr> where attr is pointer to the symbol table – <assgop,_ > no attribute is needed (if there is only one assignment operator) no attribute is needed (if there is only one assignment operator) – <num,val> where val is the actual value of the number. • Token type and its attribute uniquely identifies a lexeme. • Regular expressions are widely used to specify patterns 3 Regular expressions are widely used to specify patterns. CS308 Compiler Theory
Input Buffering Why a compiler needs buffers? ·Buffer Pairs ·Sentinels m*c2eof forward lexemeBegin CS308 Compiler Theory 4
Input Buffering • Why a compiler needs buffers? • Buffer Pairs • Sentinels CS308 Compiler Theory 4
Terminology of Languages Alphabet:a finite set of symbols (ASCII characters) String Finite sequence of symbols on an alphabet Sentence and word are also used in terms of string -8 is the empty string -s is the length of string s. ● Language:sets of strings over some fixed alphabet -3 the empty set is a language. -(s}the set containing empty string is a language The set of well-formed C programs is a language -The set of all possible identifiers is a language. Operators on Strings: -Concatenation:xy represents the concatenation of strings x and y.s s =s &s=s -sh =sss..s(n times)s CS308 Compiler Theory 5
Terminology of Languages • Alphabet : a finite set of symbols (ASCII characters) • String : – Finite sequence of symbols on an alphabet – Sentence and ord are also sed in terms of string Sentence and word are also used in terms of string – ε is the empty string – |s| is the length of string s. • Language: sets of strings over some fixed alphabet – ∅ the empty set is a language. – { ε}h ii i i l } t he set containing empty string is a language – The set of well-formed C programs is a language – The set of all possible identifiers is a language. • Operators on Strings: – Concatenation: xy represents the concatenation of strings x and y. s ε = s ε s = s 5 – s n = s s s .. s ( n times) s 0 = ε CS308 Compiler Theory