Revision 2.0 2.3 Performance Considerations Deep pipelining capability allows the A.G.P.to achieve a total memory READ throughput equal to that which is possible for memory WRITE transactions.This capability,coupled with optional higher transfer rates and address de-multiplexing,allows a full order of magnitude increase in memory read throughput over today's PCI implementations.However,many typical desktop platforms will not have sufficient total memory performance to allow full utilization of the A.G.P.capabilities,and ultimately platform issues are likely to predominate in the upper performance limit deliverable through the A.G.P.This makes it very difficult to provide as part of this interface specification a single set of performance guarantees or targets the graphics designers can depend upon.It is clear that in order to optimize a graphics design for the most effective use of the A.G.P.,it will need to be targeted at a specific subset of platforms and/or corelogic devices. In an attempt to provide the best possible information for such an optimization,this interface specification defines a few common performance parameters that may be of general interest,and recommends that corelogic vendors and/or OEMs provide these parameters for their systems to respective graphics IHVs. The following are the basic parameters that each corelogic set and/or system implementation should provide as a performance baseline for IHVs targeting that platform: Guaranteed Latency:a useable worst case A.G.P.memory access latency via the high priority queue,as measured from the clock on which the request(REQ#)signal is asserted until the first clock of data transfer. Assumptions:no outstanding A.G.P.Requests(pipeline empty)and no waitstates or control flow asserted by the graphics master-master is ready to transfer data on any clock(inserting n clocks of control flow may delay response by more than n clocks). Typical Latency:the typical A.G.P.memory access latency via the low priority queue,as measured from the clock on which the request(REQ#)signal is asserted until the first clock of data transfer. Assumptions:no outstanding A.G.P.Requests(pipeline empty)and no waitstates or control flow asserted by the graphics master-master is ready to transfer data on any clock(inserting n clocks of control flow may delay response by more than n clocks). Mean bandwidth:deliverable A.G.P.memory bandwidth via the low priority queue,averaged across~10 ms (one frame display time). Assumptions:no accesses to the high priority queue;graphics master maintains optimal pipeline depth ofx; average access length of y;no waitstates or control flow asserted by the graphics master. 2 Memory read throughput on PCI is about half of memory write throughput,since memory read access time is visible as wait states on this unpipelined bus. 26
Revision 2.0 26 2.3 Performance Considerations Deep pipelining capability allows the A.G.P. to achieve a total memory READ throughput equal to that which is possible for memory WRITE2 transactions. This capability, coupled with optional higher transfer rates and address de-multiplexing, allows a full order of magnitude increase in memory read throughput over today’s PCI implementations. However, many typical desktop platforms will not have sufficient total memory performance to allow full utilization of the A.G.P. capabilities, and ultimately platform issues are likely to predominate in the upper performance limit deliverable through the A.G.P. This makes it very difficult to provide as part of this interface specification a single set of performance guarantees or targets the graphics designers can depend upon. It is clear that in order to optimize a graphics design for the most effective use of the A.G.P., it will need to be targeted at a specific subset of platforms and/or corelogic devices. In an attempt to provide the best possible information for such an optimization, this interface specification defines a few common performance parameters that may be of general interest, and recommends that corelogic vendors and/or OEMs provide these parameters for their systems to respective graphics IHVs. The following are the basic parameters that each corelogic set and/or system implementation should provide as a performance baseline for IHVs targeting that platform: • Guaranteed Latency: a useable worst case A.G.P. memory access latency via the high priority queue, as measured from the clock on which the request (REQ#) signal is asserted until the first clock of data transfer. Assumptions: no outstanding A.G.P. Requests (pipeline empty) and no waitstates or control flow asserted by the graphics master - master is ready to transfer data on any clock (inserting n clocks of control flow may delay response by more than n clocks). • Typical Latency: the typical A.G.P. memory access latency via the low priority queue, as measured from the clock on which the request (REQ#) signal is asserted until the first clock of data transfer. Assumptions: no outstanding A.G.P. Requests (pipeline empty) and no waitstates or control flow asserted by the graphics master - master is ready to transfer data on any clock (inserting n clocks of control flow may delay response by more than n clocks). • Mean bandwidth: deliverable A.G.P. memory bandwidth via the low priority queue, averaged across ~10 ms (one frame display time). Assumptions: no accesses to the high priority queue; graphics master maintains optimal pipeline depth of x ; average access length of y ; no waitstates or control flow asserted by the graphics master. 2 Memory read throughput on PCI is about half of memory write throughput, since memory read access time is visible as wait states on this unpipelined bus
Revision 2.0 2.4 Platform Dependencies Due to the close coupling of the A.G.P.and main memory subsystem,there are several behaviors of the A.G.P.that will likely end up being platform dependent.While the objective of any specification is to minimize such differences,it is apparent that several are probable among A.G.P.corelogic and platform implementations.This should not,however,have as much impact as it would in other buses for two reasons: The A.G.P.is a point-to-point connection,intended for use by a 3D graphics accelerator only,and, Due to performance issues(Section 2.3),an A.G.P.graphics master will likely need to be optimized to a specific subset of platform or corelogic implementations anyway The purpose of this section is to identify by example some of the areas where behavioral differences are likely,and accordingly establish the scope of this interface specification. As one example of potential variation,consider the two corelogic architectures shown in Figure 2-3.An integrated approach,typical of desktop and volume systems,is shown on the left,and a symmetric multiprocessor partitioning, typical of MP servers,is shown on the right. Sys Sys Proc roc Proc Mem Mem MP Bus Gfx AGP Chipset Sys Br.] B Accel Mem Gfx AGP PCI PCI Accel _FB 阿阿阿 LFB Figure 2-3:Different Corelogic Architectures The following items are examples of areas where behavioral differences between these or other implementations could well occur: GART Implementation For various reasons,different systems may opt for different GART table implementations and layouts. This,however,is not visible since the actual table implementation is abstracted to a common API by the HAL or miniport driver supplied with the corelogic. Coherency with Processor Cache Due to the high potential access rate on the A.G.P.,it is not advisable from a performance perspective to snoop all accesses.Selective snooping in an integrated corelogic architecture presents serious queue 3 This means that active communication can only occur between two A.G.P.agents that reside on the interface, where one agent is referred to as the A.G.P.target and the other the A.G.P.master.The simplest implementation is to only have two devices attached to the bus.Attaching more than two devices to the interface is not precluded as long as there is only one active master and one active target.Any other device must not respond to or interfere with the interface operation.When more than two devices are attached to the interface,the system designer is responsible to ensure that all requirements of this interface specification are met,since the component and/or add-in card designer has no control how the devices are used. 27
Revision 2.0 27 2.4 Platform Dependencies Due to the close coupling of the A.G.P. and main memory subsystem, there are several behaviors of the A.G.P. that will likely end up being platform dependent. While the objective of any specification is to minimize such differences, it is apparent that several are probable among A.G.P. corelogic and platform implementations. This should not, however, have as much impact as it would in other buses for two reasons: • The A.G.P. is a point-to-point3 connection, intended for use by a 3D graphics accelerator only, and, • Due to performance issues (Section 2.3), an A.G.P. graphics master will likely need to be optimized to a specific subset of platform or corelogic implementations anyway. The purpose of this section is to identify by example some of the areas where behavioral differences are likely, and accordingly establish the scope of this interface specification. As one example of potential variation, consider the two corelogic architectures shown in Figure 2-3. An integrated approach, typical of desktop and volume systems, is shown on the left, and a symmetric multiprocessor partitioning, typical of MP servers, is shown on the right. Proc. LFB Gfx Accel Chipset I/O I/O I/O AGP PCI Proc. Proc. Sys Mem Sys Mem Sys Mem MP Bus I/O I/O I/O PCI Br. Br. LFB Gfx Accel AGP Figure 2-3: Different Corelogic Architectures The following items are examples of areas where behavioral differences between these or other implementations could well occur: GART Implementation For various reasons, different systems may opt for different GART table implementations and layouts. This, however, is not visible since the actual table implementation is abstracted to a common API by the HAL or miniport driver supplied with the corelogic. Coherency with Processor Cache Due to the high potential access rate on the A.G.P., it is not advisable from a performance perspective to snoop all accesses. Selective snooping in an integrated corelogic architecture presents serious queue 3 This means that active communication can only occur between two A.G.P. agents that reside on the interface, where one agent is referred to as the A.G.P. target and the other the A.G.P. master. The simplest implementation is to only have two devices attached to the bus. Attaching more than two devices to the interface is not precluded as long as there is only one active master and one active target. Any other device must not respond to or interfere with the interface operation. When more than two devices are attached to the interface, the system designer is responsible to ensure that all requirements of this interface specification are met, since the component and/or add-in card designer has no control how the devices are used
Revision 2.0 management problems,while the MP bus in an MP corelogic architecture could well deal with selective snoops very easily.As a result,processor cache snooping on A.G.P.accesses is chipset dependent,and may not be counted upon in general.The exact coherency requirements for this interface specification are as follows: Processor cache coherency must be guaranteed by the corelogic for all PCI transactions (transactions initiated with the FRAME#signal)regardless of which bus segment(A.G.P.or PCI) they originate on.This is consistent with normal PCI operation.Coherency means that at the moment a PCI/A.G.P.memory request is serviced(completed),the subject data are fully consistent with any valid contents of the processor caching mechanism(excluding processor write combining buffers)for any of the target locations in the access For A.G.P.transactions (transactions initiated with the PIPE#signal or on the de-multiplexed address bus,SBA port),there are no specific coherency requirements,and behavior is chipset dependent.It is entirely possible that some implementations will return stale read data,or allow write data to be overwritten,e.g.,by a processor cache write back.Other implementations may provide full coherency support.For this reason,any device driver managing an A.G.P.device is required to insure that A.G.P.transactions targeted at cacheable memory are safe.In practice,this caution applies mostly to A.G.P.transactions outside the GART address range,since memory allocated inside the GART address range by the normal A.G.P.memory allocation procedure will be of a cache type(e.g.,WC)consistent with the hardware's native ability to provide coherent access.In general,an A.G.P.device driver may determine the level of chipset coherency support via the chipset ID available in the Microsoft*Windows*95/Windows NT*registry. Note:Options for supporting A.G.P.coherent access to cacheable memory (ie.,snooping required)are under consideration.Specific requirements on this topic are subject to change/elaboration. Bus-to-Bus Traffic Capability Bus masters on either the A.G.P.or the PCI bus will routinely access system memory.However,it is possible to also address(PCI)targets on the other bus or port,which effectively requires a PCI-to-PCI bridge in the integrated chipset.Pushing WRITES through this bridge is fairly simple,whereas pushing READS through requires a complete bridge implementation,and it is not clear this would ever be utilized. Therefore,this interface specification only requires support for memory WRITEs between a PCI master on the PCI bus and a(PCI)target on the A.G.P.bus.Support of any other transaction between the two interfaces is optional and in general,should not be assumed to work. Address Re-Mapping Support for PCI Bus Master Accesses to Memory GART range address remapping support is provided by the chipset for graphics devices attached to the A.G.P.interface.Remapping support is,in general,not provided for devices attached to standard PCI bus(es).However,to provide a consistent addressing model across the system,this interface specification requires the following processor dependent,address remapping support for accesses presented at a PCI port:1)Chipsets that allow the processor to generate physical addresses in the GART range(i.e.,GART address remapping is supported for processor accesses as well as for A.G.P.accesses)must support an identical remapping service for the PCI port.2)Chipsets that do not support remapping for processor 4 By way of example,this allows for a video stream generator(e.g,capture,decode,etc.)on PCI to write to the graphics frame buffer on the A.G.P.interface. 5 The chipset implementation of GART remapping is related to the memory management implementation of the attached processor,including,for example,the behavior of MTRRs.However,this reference to processor implementation is made as an example for clarification purposes only.This specification does not place any requirements on processor behavior,nor stipulate any particular processor behavior. 28
Revision 2.0 28 management problems, while the MP bus in an MP corelogic architecture could well deal with selective snoops very easily. As a result, processor cache snooping on A.G.P. accesses is chipset dependent, and may not be counted upon in general. The exact coherency requirements for this interface specification are as follows: • Processor cache coherency must be guaranteed by the corelogic for all PCI transactions (transactions initiated with the FRAME# signal) regardless of which bus segment (A.G.P. or PCI) they originate on. This is consistent with normal PCI operation. Coherency means that at the moment a PCI/A.G.P. memory request is serviced (completed), the subject data are fully consistent with any valid contents of the processor caching mechanism (excluding processor write combining buffers) for any of the target locations in the access. • For A.G.P. transactions (transactions initiated with the PIPE# signal or on the de-multiplexed address bus, SBA port), there are no specific coherency requirements, and behavior is chipset dependent. It is entirely possible that some implementations will return stale read data, or allow write data to be overwritten, e.g., by a processor cache write back. Other implementations may provide full coherency support. For this reason, any device driver managing an A.G.P. device is required to insure that A.G.P. transactions targeted at cacheable memory are safe. In practice, this caution applies mostly to A.G.P. transactions outside the GART address range, since memory allocated inside the GART address range by the normal A.G.P. memory allocation procedure will be of a cache type (e.g., WC) consistent with the hardware’s native ability to provide coherent access. In general, an A.G.P. device driver may determine the level of chipset coherency support via the chipset ID available in the Microsoft* Windows* 95/Windows NT* registry. Note: Options for supporting A.G.P. coherent access to cacheable memory (i.e., snooping required) are under consideration. Specific requirements on this topic are subject to change/elaboration. Bus-to-Bus Traffic Capability Bus masters on either the A.G.P. or the PCI bus will routinely access system memory. However, it is possible to also address (PCI) targets on the other bus or port, which effectively requires a PCI-to-PCI bridge in the integrated chipset. Pushing WRITES through this bridge is fairly simple, whereas pushing READS through requires a complete bridge implementation, and it is not clear this would ever be utilized. Therefore, this interface specification only requires support for memory WRITES between a PCI master on the PCI bus and a (PCI) target on the A.G.P.4 bus. Support of any other transaction between the two interfaces is optional and in general, should not be assumed to work. Address Re-Mapping Support for PCI Bus Master Accesses to Memory GART range address remapping support is provided by the chipset for graphics devices attached to the A.G.P. interface. Remapping support is, in general, not provided for devices attached to standard PCI bus(es). However, to provide a consistent addressing model across the system, this interface specification requires the following processor dependent5 , address remapping support for accesses presented at a PCI port: 1) Chipsets that allow the processor to generate physical addresses in the GART range (i.e., GART address remapping is supported for processor accesses as well as for A.G.P. accesses) must support an identical remapping service for the PCI port. 2) Chipsets that do not support remapping for processor 4 By way of example, this allows for a video stream generator (e.g., capture, decode, etc.) on PCI to write to the graphics frame buffer on the A.G.P. interface. 5 The chipset implementation of GART remapping is related to the memory management implementation of the attached processor, including, for example, the behavior of MTRRs. However, this reference to processor implementation is made as an example for clarification purposes only. This specification does not place any requirements on processor behavior, nor stipulate any particular processor behavior
Revision 2.0 accesses(i.e.,the processor resolves its own GART range addresses to valid physical memory ranges)must NOT do any remapping of addresses presented at the PCI port. In the case where GART remapping is not provided at the PCI port,any address in the GART address range that is presented at the PCI port must be deemed an access error,and dealt with consistent with the error handling policies of that platform.In general,out of bounds or otherwise erroneous PCI requests do not receive any response and result in a master abort.In no case may an address falling in the GART range be propagated through the corelogic without remapping. Device drivers managing PCI bus master devices should always use the standard system call to de-reference addresses being passed to the hardware.This will guarantee that the correct address is used for the current platform and level of chipset support.Conversely,device drivers managing A.G.P.bus master devices should always de-reference virtual addresses via the new or alternate system call associated with the new A.G.P.VMM services described in the Windows DDK.This will guarantee that the linear GART address is always used on the A.G.P.port,independent of platform differences described here.PCI bus master device drivers should never use this new de-referencing call. Monochrome Device Adapter Support Existing PCI platforms typically support coexistence of a VGA device with a second monochrome device adapter (MDA),for debug and software development.A.G.P.corelogic implementations may elect to support this capability depending on their specific market needs.Possible implementations include static detection of MDA adapters present off the PCI bus,and rerouting of MDA accesses to the PCI bus under BIOS control;or snooping of A.G.P.directed accesses to dynamically detect disabling of MDA resources on the A.G.P.device and subsequent re-routing to the PCI bus.To ensure interoperation with possible corelogic implementations,additional requirements for A.G.P.graphics controllers with respect to MDA resources are specified in this document(see Section 6.2). Performance As already discussed in Section 2.3,a variety of performance parameters will likely have chipset and/or platform dependencies. These are examples of platform dependencies that fall outside the scope of this interface specification.In general, the scope of this interface specification is limited to the electrical and logical behavior of the actual A.G.P.interface signals,the mechanical definition of an A.G.P.add-in board,and the A.G.P.configuration registers which control the graphics address remapping function. 29
Revision 2.0 29 accesses (i.e., the processor resolves its own GART range addresses to valid physical memory ranges) must NOT do any remapping of addresses presented at the PCI port. In the case where GART remapping is not provided at the PCI port, any address in the GART address range that is presented at the PCI port must be deemed an access error, and dealt with consistent with the error handling policies of that platform. In general, out of bounds or otherwise erroneous PCI requests do not receive any response and result in a master abort. In no case may an address falling in the GART range be propagated through the corelogic without remapping. Device drivers managing PCI bus master devices should always use the standard system call to de-reference addresses being passed to the hardware. This will guarantee that the correct address is used for the current platform and level of chipset support. Conversely, device drivers managing A.G.P. bus master devices should always de-reference virtual addresses via the new or alternate system call associated with the new A.G.P. VMM services described in the Windows DDK. This will guarantee that the linear GART address is always used on the A.G.P. port, independent of platform differences described here. PCI bus master device drivers should never use this new de-referencing call. Monochrome Device Adapter Support Existing PCI platforms typically support coexistence of a VGA device with a second monochrome device adapter (MDA), for debug and software development. A.G.P. corelogic implementations may elect to support this capability depending on their specific market needs. Possible implementations include static detection of MDA adapters present off the PCI bus, and rerouting of MDA accesses to the PCI bus under BIOS control; or snooping of A.G.P. directed accesses to dynamically detect disabling of MDA resources on the A.G.P. device and subsequent re-routing to the PCI bus. To ensure interoperation with possible corelogic implementations, additional requirements for A.G.P. graphics controllers with respect to MDA resources are specified in this document (see Section 6.2). Performance As already discussed in Section 2.3, a variety of performance parameters will likely have chipset and/or platform dependencies. These are examples of platform dependencies that fall outside the scope of this interface specification. In general, the scope of this interface specification is limited to the electrical and logical behavior of the actual A.G.P. interface signals, the mechanical definition of an A.G.P. add-in board, and the A.G.P. configuration registers which control the graphics address remapping function