948 IEEE TRANSACTIONS ON COMPUTERS,VOL.C-21,NO.9,SEPTEMBER 1972 Some Computer Organizations and Their Efectiveness MICHAEL J.FLYNN,MEMBER,IEEE Abstract-A hierarchical model of computer organizations is more "macroscopic"view,yet without reference to a developed,based on a tree model using request/service type re- particular user environment.Clearly,any such effort sources as nodes.Two aspects of the model are distinguished:logical and physical. must be sharply limited in many aspects;some of the General parallel-or multiple-stream organizations are examined more significant are as follows. as to type and effectiveness-especially regarding intrinsic logical 1)There is no treatment of I/O problems or 1/O as a difficulties. limiting resource.We assume that all programs of inter- The overlapped simplex processor (SISD)is limited by data est will either not be limited by I/O,or the I/O limita- dependencies.Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple- tions will apply equally to all computer memory con- data stream (SIMD)]are analyzed.In particular,a nesting type figurations.That is,the I/O device sees a "black box" explanation is offered for Minsky's conjecture-the performance of computer with a certain performance.We shall be con- a parallel processor increases as log Mf instead of M (the number of cerned with how the computer attained a performance data stream processors). potential,while it may never be realized due to I/O con- Multiprocessors(MIMD)are subjected to a saturation syndrome siderations. based on general communications lockout.Simplified queuing models indicate that saturation develops when the fraction of task time spent 2)We make no assessment of particular instruction locked out(L/E)approaches 1/n,where n is the number of proces- sets.It is assumed that there exists a(more or less)ideal sors.Resources sharing in multiprocessors can be used to avoid set of instructions with a basically uniform execution several other classic organizational problems time-except for data conditional branch instructions Index Terms-Computer organization,instruction stream,over- whose effects will be discussed. lapped,parallel processors,resource hierarchy. 3)We will emphasize the notion of effectiveness (or efficiency)in the use of internal resources as a criterion INTRODUCTION for comparing organizations,despite the fact that either TTEMPTS to codify the structure of a computer condition 1)or 2)may dominate a total performance have generally been from one of three points of assessment. Within these limitations,we will first attempt to view:1)automata theoretic or microscopic; 2)individual problem oriented;or 3)global or statisti- classify the forms or gross structures of computer sys- tems by observing the possible interaction patterns cal. between instructions and data.Then we will examine In the microscopic view of computer structure,rela- tionships are described exhaustively.All possible inter- physical and logical attributes that seem fundamental to achieving efficient use of internal resources (execution actions and parameters are considered without respect to their relative importance in a problem environment facilities,memory,etc.)of the system. Measurements made by using individual problem CLASSIFICATION:FORMS OF COMPUTING SYSTEMS yardsticks compare organizations on the basis of their Gross Structures relative performances in a peculiar environment.Such comparisons are usually limited because of their ad hoc In order to describe a machine structure from a nature. macroscopic point of view,on the one hand,and yet Global comparisons are usually made on the basis of avoid the pitfalls of relating such descriptions to a par- elaborate statistical tabulations of relative performances ticular problem,the stream concept will be used [1]. on various jobs or mixtures of jobs.The difficulty here Stream in this context simply means a sequence of items lies in the fact that the analysis is ex post facto and usu- (instructions or data)as executed or operated on by a ally of little consequence in the architecture of the sys- processor.The notion of“instruction”or“datum”is tem since the premises on which they were based (the defined with respect to a reference machine.To avoid particular computer analyzed)have been changed. trivial cases of parallelism,the reader should consider a The object of this paper is to reexamine the principal reference instruction or datum as similar to those used interactions within a processor system so as to generate a by familiar machines (e.g.,IBM 7090).In this descrip- tion,organizations are categorized by the magnitude Manuscript received February 26,1970;revised May 25,1971, (either in space or time multiplex)of interactions of and January 21,1972.This work was supported by the U.S.Atomic their instruction and data streams.This immediately Energy Commission under Contract AT(11-1)3288. gives rise to four broad classifications of machine or- The author is with the Department of Computer Science,The Johns Hopkins Cniversity,Baltimore,Md.21218. ganizations
948 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-21, NO. 9, SEPTEMBER 1972 Some Computer Organizations and Their Effectiveness MICHAEL J. FLYNN, MEMBER, IEEE Abstract-A hierarchical model of computer organizations is more "macroscopic" view, yet without reference to a developed, based on a tree model using request/service type re- particular user environment. Clearly, any such effort sources as nodes. Two aspects of the model are distinguished: logical m ustb shr limitedtim anyasc somefoft and andphysical. ~~~~~~~mustbe sharply limited In many aspects; some of the physical. mr infcn r sflos General parallel- or multiple-stream organizations are examined more significant are as follows. as to type and effectiveness-especially regarding intrinsic logical 1) There is no treatment of I/O problems or I/O as a difficulties. limiting resource. We assume that all programs of interThe overlapped simplex processor (SISD) is limited by data est will either not be limited by I/O, or the I/O limitadependencies. Branching has a particularly degenerative effect. * w The parallel processors [single-instruction stream-multiple- tionsawill ap eul to all c e me mory con- data stream (SIMD)J are analyzed. In particular, a nesting type figurations That is, the I/O device sees a "black box" explanation is offered for Minsky's conjecture-the performance of computer with a certain performance. We shall be cona parallel processor increases as log M instead of M (the number of cerned with how the computer attained a performance data stream processors). potential, wThile it may never be realized due to I/O conMultiprocessors (MIMD) are subjected to a saturation syndrome siderations. based on general communications lockout. Simplified queuing models We t t indicate that saturation develops when the fraction of task time spent 2) \\e make no assessment of particular instruction locked out (L/E) approaches 1!, where n is the number of proces- sets. It is assumed that there exists a (more or less) ideal sors. Resources sharing in multiprocessors can be used to avoid set of instructions with a basically uniform execution several other classic organizational problems. time-except for data conditional branch instructions Index Terms-Computer organization, instruction stream, over- whose effects will be discussed. lapped, parallel processors, resource hierarchy. 3) We will emphasize the notion of effectiveness (or efficiency) in the use of internal resources as a criterion INTRODUCTION for comparing organizations, despite the fact that either ATTEIMPTS to codify the structure of a computer condition 1) or 2) may dominate a total performance lhave generally been from one of three points of assessment. view: 1) automata theoretic or microscopic; Witlhin these limitations, we will first attempt to .. . ..' ~~~~~~~~classify the forms or gross structures of computer sys- 2) individual problem oriented; or 3) global or statisti- y g y cal. tems by observing the possible interaction patterns In the microscopic view of computer structure,rela-between instructions and data. Then we will examine In the microscopic view of computer structure, rela- phscladogaltriuetatemfnaetl tionshis are escribd exliustivel. All ossibl inter physical and logical attributes that seem fundamental tionships are described exhlaustively. All possible inter-.. . to achieving efficient use of internal resources (execution actions and parameters are considered without respect fclte,mmr,ec)o h ytm to their relative importance in a problem environment. Measurements made by using individual problem CLASSIFICATION: FORMS OF COMPUTING SYSTEMS yardsticks compare organizations on the basis of their Gross Structures relative performances in a peculiar environment. Suclh comparisons are usually limited because of tlheir ad hoc In order to describe a machine structure from a nature. macroscopic point of view, on the one lhand, and yet Global comparisons are usually made on the basis of avoid the pitfalls of relating such descriptions to a parelaborate statistical tabulations of relative performances ticular problem, the stream concept will be used [1]. on various jobs or mixtures of jobs. The difficulty hiere Stream in this context simply means a sequence of items lies in the fact that the analysis is ex post facto and usu- (instructions or data) as executed or operated on by a ally of little consequence in the arclhitecture of the sys- processor. The notion of "instruction" or "datum" is tem since the premises on wvliclh they were based (the defined with respect to a reference machine. To avoid particular computer analyzed) have been changed. trivial cases of parallelism, the reader should consider a Tnhe object of thliS paper is to reexamine thle principal reference instruction or datum as similar to those used interactions withlin a processor system so as to generate a by familiar machines (e.g., IBMi 7090). In this description, organizations are categorized by thle magnitude Manscrpt ecive Feruay 6, 970 reisd My 2, 171 (either in space or time multiplex) of interactions of and January 21, 1972. This work was supported by the U.. S. Atomic their instruction and data streams. Thils immlediately Energy Commission under Conltract AT (11-1) 3288. gvsrs ofu ra lsiiain fmcln r The author is with the Department of Computer Science, The giersetforbadcsiiainsfmcheoJohns Hopkins UJniversity, Baltimore, M/d. 21218. ganizations
FLYNN:COMPUTER ORGANIZATIONS 949 1)The single-instruction stream-single-data stream anticipate a hierarchy of requests,where the transition organization (SISD),which represents most conven- between the initiating level p and the next level tasks tional computing equipment available today. R:is viewed as the logical role of each of the Ri,while 2)The single-instruction stream-multiple-data actual service is through a combination of a tree of stream (SIMD),which includes most array processes, lower level logical requests and eventual physical ser- including Solomon [2](Illiac IV). vice at the leaves of the tree 3)Multiple-instruction stream-single-data stream Thus consider type organizations (MISD),which include specialized f(x,T)→(y,*) streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives where f is a functional mapping (in a mathematical thereof.The plug-board machines of a bygone era were sense)of argument x into result y.Here,x,yEB,where a degenerate form of MISD wherein the instruction B is the set of values defined by the modulo class of the streams were single instructions,and a derived datum arithmetic (logical and physical notions of arithmetic (SD)passed from program step i to program step i+1 should be identical).Ther and r*indicate logical time (MI). or precedence;T is a Boolean control variable whose 4)Multiple-instruction stream-multiple-data stream validity is established by one or more predecessor logical (MIMD),which include organizations referred to as functionsf. "multiprocessor."Univac [3],among other corpora- The requirement for a r control variable stems from tions,has proposed various MIMD structures. the need for specification of the validity of f and its These are qualitative notations.They could be quan-argument x.Notice that two tasksf,f are directly tified somewhat by specifying the number of streams of dependent if either r:=1 implies r;*=1 or rj=1 implies each type in the organization or the number of instruc- r:*=1.This precedence specification may be performed tion streams per data stream,or vice versa.But in order implicitly by use of a restrictive convention (e.g.,by to attain a better insight into the notions of organiza- strict sequence)-that the physical time t at which the tion,let us formalize the stream view of computing.ith task control variable r:becomes valid,t(:=1),has Consider a generalized system model consisting of a (r:=1)>t(ri1*=1),for all ior by explicit control of reguestor and a server.For our purposes we will con- T and r*. sider the requestor as synonymous with a program(or That there can be no general way of rearranging the user)and the server as synonymous with the resources, request sequence f(or finding an alternate sequence both physical and logical,which process the elements of g)such that the precedence requirement vanishes,is a the program.Note that,for now,the distinction be- consequence of the composition requirement f(g(x)), tween a program and a resource is not precise;indeed, intrinsic to the definition of a computable function. under certain conditions a resource may be a program That is,f cannot be applied to an argument until g(x) and vice versa.Then a problem can be defined as a has been completed. stream (or sequence)of requests for service (or re- The notion of an explicit precedence control has been sources).Since each request is,in general,a program and formalized by Leiner 16 and others by use of a prece- can also specify a sequence of requests,we have a re- dence matrix. quest hierarchy. Given an unordered set of requests (tasks) Let p be a program.A program is defined simply as a f1<jn,an nXn matrix is defined so that:ai=1 request for service by a structured set of resources.P if we require t(;=1)2t(*=1),i.e.,task f must be specifies a sequence of other(sub)requests:Ri,R2,R3,completed before f;can be begun.Otherwise,=0. R,···,Ru,called tasks.While the tasks appear here The matrix M so defined identifies the initial priority. as a strictly ordered set,in general the tasks will have a By determining M2(in the conventional matrix product more complex control structure associated with them. sense),secondary implicit precedence is determined. Each request,again,consists of a sequence of sub- This process is continued until requests (the process terminates only at the combi- MP+1=0: national circuit level).Regardless of level,any request R:is a bifurcated function having two roles: The fully determined precedence matrix H is defined as R=f,f H=M+M2+M3+···+MP,P≤ the logical role of the requestor f and the combined where is the Boolean union operation:(a+b) logical and physical role of a server f. =a十b: 1)Logical Role of Reguestor:The logical role of the Thus HI defines a scheduling of precedence among the requestor is to define a result given an argument and to n tasks.At any moment of logical time(T),perhaps a define a control structure or precedence among the other set of tasksf;k either a,=0,for all j,or if a=1, tasks directly defined by the initiating program P.We then 7,*=1 are independently executable
FLYNN: COMPUTER ORGANIZATIONS 949 1) The single-instruction stream-single-data stream anticipate a hierarchy of requests, where the transition organization (SISD), which represents most conven- between the initiating level P and the next level tasks tional computing equipment available today. { Ri } is viewed as the logical role of each of the Ri, while 2) The single-instruction stream-multiple-data actual service is through a combination of a tree of stream (SIMD), which includes most array processes, lower level logical requests and eventual physical serincluding Solomon [2] (Illiac IV). vice at the leaves of the tree. 3) Multiple-instruction stream-single-data stream Thus consider type organizations (M ISD), which include specialized fAx, r) (y r* streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives where fil is a functional mapping (in a mathematical thereof. The plug-board machines of a bygone era were sense) of argument x into result y. Here, x, y&B, where a degenerate form of MISD wherein the instruction B is the set of values defined by the modulo class of the streams were single instructions, and a derived datum arithmetic (logical and physical notions of arithmetic (SD) passed from program step i to program step i+1 should be identical). The r and T* indicate logical time (MI). or precedence; r is a Boolean control variable whose 4) Multiple-Instruction stream-multiple-data stream validity is established by one or more predecessor logical (MIMD), which include organizations referred to as functions {fii }. "multiprocessor." Univac [3], among other corpora- The requirement for a r control variable stems from tions, has proposed various 'M\IIMD structures. the need for specification of the validity of fil and its These are qualitative notations. They could be quan- argument x. Notice that two tasks fil, fj1 are directly tified somewhat by specifying the number of streams of dependent if either 7i= 1 implies Tj* = 1 or rj = 1 implies each type in the organization or the number of instruc- ri* = 1. This precedence specification may be performed tion streams per data stream, or vice versa. But in order implicitly by use of a restrictive convention (e.g., by to attain a better insight into the notions of organiza- strict sequence)-that the physical time t at which the tion, let us formalize the stream view of computing. ith task control variable Tr becomes valid, t(r = 1), has Consider a generalized system model consisting of a t(Tr = 1) . t(Ti1* = 1), for all i-or by explicit control of requestor and a server. For our purposes we will con- r and r*. sider the requestor as synonymous with a program (or That there can be no general way of rearranging the user) and the server as synonymous with the resources, request sequence fil (or finding an alternate sequence both physical and logical, which process the elements of gi') such that the precedence requirement vanishes, is a the program. Note that, for now, the distinction be- consequence of the composition requirement f(g(x)), tween a program and a resource is not precise; indeed, intrinsic to the definition of a computable function. under certain conditions a resource may be a program That is, f cannot be applied to an argument until g(x) and vice versa. Then a problem can be defined as a has been completed. stream (or sequence) of requests for service (or re- The notion of an explicit precedence control has been sources). Since each request is, in general, a program and formalized by Leiner [16] and others by use of a prececan also specify a sequence of requests, we have a re- dence matrix. quest hierarchy. Given an unordered set of requests (tasks) Let P be a program. A program is defined simply as a {fjl' 1 j <n}, an n Xn matrix is defined so that: ai3= 1 request for service by a structured set of resources. P if we require t(Tj = 1) > t(ri* = 1), i.e., task fi must be specifies a sequence of other (sub) requests: R1, R2, R3, completed beforefj can be begun. Otherwise, aij,=0. R4, - , RnsR called tasks. While the tasks appear here The matrix M so defined identifies the initial priority. as a strictly ordered set, in general the tasks will have a By determining M2 (in the conventional matrix product more complex control structure associated with them. sense), secondary implicit precedence is determined. Each request, again, consists of a sequence of sub- This process is continued until requests (the process terminates only at the combinational circuit level). Regardless of level, any request MP±= 0. Ri is a bifurcated function having two roles: The fully determined precedence matrix H is defined as R- fuly Hd=eM+rM2i+eM3 +p.e.e.n+eMa, P < n the logical role of the requestor f si and the combined where + is the Boolean union operation: (a +b)~ logical and physical role of a serverfiv. a+b. 1) Logical Role of Requestor: The logical role of the Thus H defines a scheduling of precedence among the requestor is to define a result given an argument and to n tasks. At any moment of logical time (ri), perhaps a define a control structure or precedence among thle other set of tasks {fk'; k |eithler ack = 0, for all j, or if aj = 1, tasks directly defined by the initiating program P. We then Tj*-=1 } are independently executable
950 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 define a memory hierarchy (space)or a set of primitive operations.Note that since partitions may not be unique,resulting tree structures may also differ.Also note that while leaves of the tree must be requests for physical resources,higher level nodes may also be if all inferior nodes are physical. Since physical time is associated with a physical activity,at the terminal nodes we have fix"..(SXS,)→S, where t is the initiation time and i is the completion terminal node time.Initiation is conditioned on the availability of Fig.1.Service hierarchy. operational resource v and the validity of both source operands (=1).When the operation is complete,the result is placed in a specified sink location,and the con- Since "task"and "instruction"are logically equivalent trol variabler*is set to "1." requests for service,the preceding also applies to the The advantage of this model is that it allows a per- problem of detecting independent instructions in a se- spective of a system at a consistent hierarchical level of quence.In practice,tasks are conditionally issued.This control structure.One may replace the subtree with its limits the use of the precedence computation to the equivalent physical execution time (actual or mean “while running”or dynamic environment. value)and mean operational and spatial resource re- 2)Service Role of Reguest:Thus far we have been quirements if task contention is a factor.The net effect concerned with precedence or task control at a single is to establish a vector space with a basis defined by the level.The service for node R:is defined by fi',which is a reference of the observer.The vector of operations O subtree structure(see Fig.1)terminating in the physical may appear as physical resources at a particular level k resources of the system,i.e.,the physically executable in the tree,but actually may represent a subtree of primitives of the system.Thusf:defines the transition requests-similarly with the storage vector S.The con- from P to R:and among Ri,(2=l,···,z),at a given trol structure defined by the set of request functions level,while f,"defines the substructure under node Ri. f1<i will determine the program structure as Thus f"is actually a hierarchy of subrequests terminat- an interaction of resources on OXS.The reader may ing in a primitive physical service resource. note the similarity,in hierarchy treatment at least These terminal nodes are defined as a request for service lying within the physical resource vocabulary of between the preceding model and the general model of Bell and Newell 28. the system,i.e., Implicit in the foregoing statement is the notion that f..≡v8∈V a physical resource exists in space-time,not space or time alone.For example,N requests may be made for where V is the set of available physical resources.Note an "add"operational resource-these may be served by that the request is generally for any element in a par-N servers each completing its operation in time T or ticular resource class rather than a specific resource v. equivalently by one resource that operates in time The elements are usually of two types:operational T/N.Two parameters are useful in characterizing or storage.A storage resource is a device capable of physical resources [1]-latency and bandwidth.La- retaining a representation of a datum after the initial fency is the total time associated with the processing excitation is removed.The specification of a device is (from excitation to response)of a particular data unit usually performed by coordinate location,contents,or at a phase in the computing process.Bandwidth is an implication.An operational resource is a combinational expression of time-rate of occurrence.In particular, circuit primitive (e.g.,add,shift,transfer,...)that operational bandwidth would be the number of operand performs a(usually)binary mapping SXS-S;S the set pairs processed per unit time. storage (space)resource. If,as a hierarchical reference point,we choose opera- Strictly speaking,since a request for a physical re- tions and operands as used by,for example,an IB\I source v is a request for accessing that resource,there is 7090,we can explore arrangements of,and interactions no "request for storage resource"but rather a request to between,familiar physical resources.The IB\7090 an allocation algorithm to access or modify the memory itself has a trivial control tree.In particular,we have map.Thus a storage resource has operational charac-the SISD-single operational resource operating on a teristics if we include the accessing mechanism in the single pair of storage resources.The multiple-stream storage resource partition. organizations are more interesting,however,as well as Partitions are defined on physical resources which two considerations:1)the latency for interstream com-
950 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 p define a memory hierarchy (space) or a set of primitive operations. Note that since partitions may not be 4^t/ \ \'n unique, resulting tree structures may also differ. Also R2 R R note that while leaves of the tree must be requests for R * physical resources, higher level nodes may also be if all Aft inferior nodes are physical. R11 RIm Since physical time is associated with a physical activity, at the terminal nodes we have <~~~~~~~~~~~~~~~~~~~ fijk .(S X SI lb) --+SI tf fill-v where tb is the initiation time and tf is the completion 0 terminal node time. Initiation is conditioned on the availability of Fig. 1. Service hierarchy. operational resource v and the validity of both source operands (T= 1). When the operation is complete, the resuLIt is placed in a specified sink location, and the con- Since "task" and "instruction" are logically equivalent trol variable * is set to "1." requests for service, the preceding also applies to the The advantage of this model is that it allows a perproblem of detecting independent instructions in a se- spective of a system at a consistent hierarchical level of quence. In practice, tasks are conditionally issued. This control structure. One may replace the subtree with its limits the use of the precedence computation to the equivalent physical execution time (actual or mean "while running" or dynamic environmaent. value) and mean operational and spatial resource re- 2) Service Role of JRequest: Thus far wve have been quirements if task contention is a factor. The net effect concerned with precedence or task control at a single is to establislh a vector space with a basis defined by the level. The service for node Ri is defined byfi:, which is a reference of the observer. The vector of operations O, subtree structure (see Fig. 1) terminating in the physical may appear as physical resources at a particular level k resources of the system, i.e., the physically executable in the tree, but actually may represent a subtree of primitives of the system. Thus f'il defines the transition requests-similarly with the storage vector Sk. The confrom P to Ri and among Rs, (i= 1, n), at a given trol structure defined by the set of request functions level, while fiv defines the substructure under node Ri. {fkil 1 .i.n } will determine the program structure as ThusfiP is actually a hierarchy of subrequests terminat- an interaction of resources on Ok>Sk. The reader may ing in a primitive physical service resource. note the similarity, in hierarchy treatment at least These terminal nodes are defined as a request for between the preceding model and the general model of service lying within the plhysical resource vocabulary of Bell and Ntewell [28 ] . the system, i.e., Implicit in the foregoing statement is the notion that ^ A -tav C V a physical resource exists in space-time, not space or IJik. time alone. For example, N requests may be miiade for where V is the set of available plhysical resources. Note an "add" operational resource-thlese mayv be served 'by that the request is generally for any element in a par- N servers each comlpleting its operation in timiie T or ticular resource class ratlher than a specific resource v. equivalentlvy by one resource that operates in timle The elements {v } are usually of two types: operational T N. Two paramiieters are useful in clharacterizing or storage. A storage resource is a device capable of plhysical resources [1 ]-latency and bandwidth. Laretaining a representation of a datum after tlhe initial tency is tl-he total tinme associated witlh the processing excitation is removed. The specification of a device is (from excitation to response) of a particular data unit usually perfornmed by coordinate location, contents, or at a phase in thle computing process. Bandwidth is an implication. An operational resource is a combinational expression of time-rate of occurrence. In particular, circuit prinmitive (e.g., add, shift, transfer, . ) that operational bandwidth would be the numnber of operand performs a (usually) binary mapping SXS- S;S the set pairs processed per unit time. storage (space) resource. If, as a hierarchical reference point, we choose operaStrictly speaking, since a request for a physical re- tions and operands as used by, for example, an IBM\I source v is a request for accessing that resource, there is 7090, we can explore arrangements of, and interactions no "request for storage resource" but rather a request to between, familiar phlysical resources. The IBMI 7090 an allocation algorithm to access or modify the memory itself haas a trivial control tree. In particular, we have map. Thus a storage resource-~has operational chlarac- the SISD-single operational resource operating on a teristics if we include the accessing mechanism in the single pair of storage resources. The multiple-stream storage resource partition. organizations are more interesting, however, as well as Partitions are defined on phlysical resources which two considerations: 1) the latency for interstream com-
FLYNN:COMPUTER ORGANIZATIONS 951 munication;and 2)the possibilities for high computa- tional bandwidth within a stream. Interstream Communications There are two aspects of communications:operational resource accessing of a storage item(OXS)and storage itJ- to storage transfer (SXS).Both aspects can be repre- i+J+1i sented by communications matrices each of whose entry ti is the time to transfer a datum in the jth storage re- Fig.2.Stream inertia. source to the ith resource (operational or storage).The operational communication matrix is quite useful for MIMD organizations,while the storage communica- perf.max L·△t tions matrix is usually more interesting for SIMD organizations.An(OXO)matrix can also be defined for This is illustrated in Fig.2.Successive instructions are describing MISD organizations. offset in this example by At time units. An alternate form of the communications matrix, called the connection matrix,can also be developed for System Classification the square matrix cases.This avoids possibly large or Then to summarize,a technology independent infinite entries possible in the communications matrix macroscopic specification of a large computing system (when interstream communications fail to exist).The would include:1)the number of instruction streams and reciprocal of the normalized access time ta/t(assuming the number of data streams-the "instruction"and ti is the minimum entry for a row)is entered for the "data"unit should be taken with respect to a convenient access time of an element of the ith data storage re- reference;2)the appropriate communications (or con- source by the jth operational or storage resource di.The nection)matrices;and 3)the stream inertia factor J and minimum access time (resolution)is 1.If a particular the number of time units of instruction execution la- item were inaccessible,there would be a zero entry. tency L. Notice that in comparing parallel organizations to the serial organization,the latter has immediate access to EFFECTIVENESS IN PERFORMING THE corresponding data.While it appears that under certain COMPUTING PROCESS conditions an element expression can be zero due to Resolution of Entropy lack of communication between resources,in practice this does not occur since data can be transferred from Measures of the effectiveness are necessarily problem one stream to another in finite time,however slow.Usu- based.Therefore,comparisons between parallel and ally such transfers occur in a common storage hierarchy. simplex organizations frequently are misleading since such comparisons can be based on different problem Stream Inertia environments.The historic view of parallelism in prob- lems is probably represented best by Amdahl [6]and is It is well known that the action of a single-instruction shown in Fig.3.This viewpoint is developed by the ob- stream may be telescoped for maximum performance by servation that certain operations in a problem environ- overlapping the various constituents of the execution of ment must be done in an absolutely sequential basis. an individual instruction [4].Such overlapping usually These operations include,for example,the ordinary does not exceed the issuing of one instruction per in-housekeeping operations in a program.In order to struction decode resolution time At.This avoids the achieve any effectiveness at all,from this point of view, possibly exponentially increasing number of decision parallel organizations processing N streams must have elements required in such a decoder [1],[5].A recent substantially less than 1/NX100 percent of absolutely study [13]provides an analysis of the multiple-instruc- sequential instruction segments.One can then proceed tion issuing problem in a single-overlapped instruction to show that typically for large N this does not exist in stream.In any event,a certain number of instructions conventional programs.A major difficulty with this in a single-instruction stream are being processed during analysis lies in the concept of "conventional programs" the latency time for one instruction execution.This since this implies that what exists today in the way of number may be referred to as the confluence factor or programming procedures and algorithms must also exist inertia factor J of the processor per individual instruc- in the future.Another difficulty is that it ignores the tion stream.Thus the maximum performance per in-possibility of overlapping some of this sequential pro- struction stream can be enhanced by a factor J.If the cessing with the execution of "parallel"tasks. average instruction execution time is L.At time units, To review this problem from a general perspective, the maximum performance per stream would be consider a problem in which Ni words each of p bits
FLYNN: COMPUTER ORGANIZATIONS 951 munication; and 2) the possibilities for high computa- i Lht- - L. tional bandwidth within a stream. IV.t, Interstream Communications i + J - IThere are two aspects of communications: operational resource accessing of a storage item (OXS) and storage i+J to storage transfer (S XS). Both aspects can be repre- - sented by communications matrices each of whose entry tij is the time to transfer a datum in the jth storage re- Fig. 2. Stream inertia. source to the ith resource (operational or storage). The operational communication matrix is quite useful for J MIMD organizations, while the storage communica- perfLma t tions matrix is usually more interesting for SIMD organizations. An (OXO) matrix can also be defined for This is illustrated in Fig 2 Successive instructions are describing MISD organizations. offset in this example by At time units. An alternate form of the communications matrix, System Classification called the connection matrix, can also be developed for the square matrix cases. This avoids possibly large or Then to summarize, a technology independent infinite entries possible in the communications matrix macroscopic specification of a large computing system (when interstream communications fail to exist). The would include: 1) the number of instruction streams and reciprocal of the normalized access time ti/ti, (assuming the number of data streams-the "instruction" and tii is the minimum entry for a row) is entered for the "data" unit should be taken with respect to a convenient access time of an element of the ith data storage re- reference; 2) the appropriate communications (or consource by the jth operational or storage resource dii. The nection) matrices; and 3) the stream inertia factor J and minimum access time (resolution) is 1. If a particular the number of time units of instruction execution laitem were inaccessible, there would be a zero entry. tency L. Notice that in comparing parallel organizations to the serial organization, the latter has immediate access to corresponding data. While it appears that under certain COMPUTING PROCESS conditions an element expression can be zero due to Resolution of Entropy lack of communication between resources, in practice Measures of the effectiveness are necessarily problem this does not occur since data can be transferred from based. Therefore, comparisons between parallel and one stream to another in finite time, however slow. Usu- . . . a ally such transfers occur in a common storage hierarchy. sipe orizons frequen ar ding since such comparisons can be based on different problem Stream Inertia environments. The historic view of parallelism in problems is probably represented best by Amdalhl [6] and is It is well known that the action of a single-instruction shown in Fig. 3. This viewpoint is developed by the obstream may be telescoped for maximum performance by servation that certain operations in a problem environoverlapping the various constituents of the execution of ment must be done in an absolutely sequential basis. an individual instruction [4]. Such overlapping usually These operations include, for example, the ordinary does not exceed the issuing of one instruction per in- housekeeping operations in a progranm. In order to struction decode resolution time At. This avoids the achieve any effectiveness at all, from this point of view, possibly exponentially increasing number of decision parallel organizations processing N streams must hlave elements required in such a decoder [1], [5]. A recent substantially less than 1/NX100 percent of absolutely study [13] provides an analysis of the multiple-instruc- sequential instruction segments. One can then proceed tion issuing problem in a single-overlapped instruction to show that typically for large N this does not exist in stream. In any event, a certain number of instructions conventional programs. A major difficulty with this in a single-instruction stream are being processed during analysis lies in the concept of "conventional programs" thle latency time for one instruction execution. This since this implies that wvhat exists today in the way of number may be referred to as the confluence factor or programming procedures and algorithms must also exist inertia factor J of the processor per individual instruc- in the future. Anothaer difficulty is that it ignores the tion stream. Thlus the maximum performance per in- possibility of overlapping some of this sequential prostruction stream can be enhanced by a factor J. If the cessing with the execution of "parallel" tasks. average instruction execution time is L-At time units, To reviewr this problem from a general perspective, the maximum performance per stream would be consider a problem in which N1 words each of p bits
952 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 Thus from this most general point of view there is little. difference in the resolution of entropy in space or time. The reader will note,however,that space resolution is not necessarily as effcient as time sequences.In fact,the N/2 number of time sequence operations Ne is a linear func- tion of input size n: N:=k+2 % 2xt洛骆2 while Muller 32 has shown that for combinational Fig.3."Parallelism"in problems. circuits of arbitrary functional basis a measure for the number of circuits required N.is serve as input data.The program or algorithm operates 2n 2n on this input data and produces output results corres- k1一≤N。≤k2 ponding to N2 words each of p bits.If we presume that as a maximum each of the input bits could affect each where n is the number of input lines.See Cook [33]and of the bits in the output,then there are NiXp bits of Cook and Flynn 34 for a more general and complete uncertainty or entropy which specify an output bit. treatment of space-time functional measures.Later in Thus a table-oriented solution algorithm (i.e.,generate this paper we will discuss similar inefficiencies in parallel a table of solutions-one solution for each input combi- SIMD processors. nation)requires 2DNentries and uses N2 bits per entry. The total number of bits required is pN22pN. Recursive Functions We could generalize this for variable length input and Substantial difficulties arise when the preceding "gen- output by treating Na and Ni as the maximum number eral point of view"is reduced to the specific.In particu- of words required to specify output or input informa-lar,the class of functions for which arbitrary "space- tion.Note that this imposes a limit on the size of output time"transformations can be developed is not equiva- strings,and hence the table algorithm is too restrictive lent to the class of all recursive(computable)functions. for generalized computable functions.However,within Recursion (in Kleene's sense [7])is based on the appli- this restricted class the entropy O to be resolved by an cation of the composition operation on a finite set of algorithm is initial functions.This composition operation 8]is the pN2≤Q≤pN22pw1. association of functions h(Xi,Xg,···,Xn)with F(X,··,X,g1(X,··,X)g2(X,··,X, The lower bound also needs special interpretation for ···,gn(Xi,···,Xn,so that(Xi,···,Xm) trivial algorithms (e.g.,strings of identical ones);the =F(g1(Xi,···,Xn),···,gn(X,···,Xn).Clearly, pN2 bits were assumed to be independently derived. the application of F on the results of g:is a sequential Hence the N2 bits must be reduced by their depen- operation.Any general attempts to resolve the func- dency.(See Kolmogarov [10]for an alternate notion on tional entropy without such sequential (time)depen- the specification of Q.) dence leads to "recursion"based on infinite sets of initial In any event,O bits of uncertainty must be resolved. functions. This resolution may be performed in space or time. Bernstein [9]has developed three (rather strong) Typically it is resolved in space by combinatorial logic. sufficient conditions for two programs to operate in Each decision element may resolve between zero and parallel based on the referencing of partitions of storage. one bit of information depending upon the associated As in the preceding discussion,these conditions specify probabilities of the binary outcomes.Usually a useful limitations on the interactions between programs. repertoire or vocabulary of logical decisions is available Notice that none of the foregoing serves to limit the to an instruction stream.Let an element of the vocabu-capability of a processor organized in a parallel fashion lary consist of M decisions.Depending upon the type of to perform computation,but rather serves as a limit operation to be performed,these execution unit decision on the efficiency of such an operation.Also note that the elements resolve less than M bits of information;thus composing function F may induce similar inefficiencies Q≤m'MN, in a confluent simplex processor (depending on the nature of F and the last g:to be evaluated).Such per- where m'is the number of data stream execution units formance degradation will be discussed later.The com- and N.is the number of time operations that were used position mechanism that causes a problem here is an (i.e.,number of sequential instructions).In order to interprogram action,while the stream inertia difficulty retire the complete algorithm,of course,a sequence of occurs more prominently in purely intraprogram condi- operations is performed to execute the instruction tional actions.Indeed,there are techniques for the stream;each operation in the sequence resolves or ex- elimination of branches in simple programs by use of ceeds the required amount of entropy for a solution.Boolean test variables (0 or 1)operating multiplica-
952 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 Thus from this most general point of view there is little. . difference in the resolution of entropy in space or time. .4 [\ The reader will note, however, that space resolution is not necessarily as efficient as time sequences. In fact, the NlR' \^|t/2 number of time sequence operations Nt is a linear function of input size n: _ _ _ _ Nt = k-n too °%of ptublem requzefxq iv sequent4a7 executw# while Muller [32] has shown that for combinational Fig. 3. "Parallelism" in problems. circuits of arbitrary functional basis a measure for the number of circuits required N, is serve as input data. The program or algorithm operates 2" 2" on this input data and produces output results corres- k1-. Ns . k2 - ponding to N2 words each of p bits. If we presume that as a maximum each of the input bits could affect each where n is the number of input lines. See Cook [33] and of the bits in the output, then there are NjXp bits of Cook and Flynn [34] for a more general and complete uncertainty or entropy which specify an output bit. treatment of space-time functional measures. Later in Thus a table-oriented solution algorithm (i.e., generate this paper we will discuss similar inefficiencies in parallel a table of solutions-one solution for each input combi- SIMD processors. nation) requires 2PN1 entries and uses pN2 bits per entry. The total number of bits required is pN22PN1. We could generalize this for variable length input and Substantial difficulties arise when the preceding "genoutput by treating N2 and N1 as the maximum number eral point of view" is reduced to the specific. In particuof words required to specify output or input informa- lar, the class of functions for which arbitrary "spacetion. Note that this imposes a limit on the size of output time" transformations can be developed is not equivastrings, and hence the table algorithm is too restrictive lent to the class of all recursive (computable) functions. for generalized computable functions. However, within Recursion (in Kleene's sense [7]) is based on the applithis restricted class the entropy Q to be resolved by an cation of the composition operation on a finite set of algorithm is initial functions. This composition operation [8] is the pN2 Q < pN2.2PN1. association of functions h(X1, X2, , Xn) with - - ~~~~~~~~F(Xlt* X,), gl(XI,* Xn), g2(X17 . , Xn) i The lower bound also needs special interpretation for , gn(Xl, , Xn), so that h(X1, , Xj) trivial algorithms (e.g., strings of identical ones); the =F(gi(Xj, Xn), , gn(Xl, Xn)). Clearly, pN2 bits were assumed to be independently derived. the application of F on the results of gi is a sequential Hence the pN2 bits must be reduced by their depen- operation. Any general attempts to resolve the funcdency. (See Kolmogarov [10] for an alternate notion on tional entropy without such sequential (time) depenthe specification of Q.) dence leads to "recursion" based on infinite sets of initial In any event, Q bits of uncertainty must be resolved. functions. This resolution may be performed in space or time. Bernstein [9] has developed three (rather strong) Typically it is resolved in space by combinatorial logic. sufficient conditions for two programs to operate in Each decision element may resolve between zero and parallel based on the referencing of partitions of storage. one bit of information depending upon the associated As in the preceding discussion, these conditions specify probabilities of the binary outcomes. Usually a useful limitations on the interactions between programs. repertoire or vocabulary of logical decisions is available Notice that none of the foregoing serves to limit the to an instruction stream. Let an element of the vocabu- capability of a processor organized in a parallel fashion lary consist of M decisions. Depending upon the type of to perform computation, but rather serves as a limit operation to be performed, these execution unit decision on the efficiency of such an operation. Also note that the elements resolve less than M bits of information; thus composing function F may induce similar inefficiencies Q<m'MAT, - in a confluent simplex processor (depending on the ~~~~~~nature of F and the last gi to be evaluated). Such perwhere m' is the number of data stream execution units formance- degradation will be discussed later. The comand N1 is the number of time operations that were used position mechanism that causes a problem here is an (i.e., number of sequential instructions). In order to interprogram action, while the stream inertia diffculty retire the complete algorithm, of course, a sequence of occurs more prominently in purely intraprogram condioperations is performed to execute thae instruction tional actions. Indeed, there are techniques for the stream; eachl operation in the sequence resolves or ex- elimination of branches in simple programs by use of ceeds thae required amount of entropy for a solution. Boolean test variables (0 or 1) operating multiplica-