Field Programmable Gate Arrays

Industry background

Xiaoyao Liang , in Ascend AI Processor Architecture and Programming, 2020

2.1.4 FPGA

Field-programmable gate arrays (FPGAs) are developed as hardware prototype systems in electronics [12]. While GPUs and TPUs play important roles in the AI field, FPGAs enable developers familiar with hardware description languages to quickly implement AI algorithms and achieve considerable acceleration, thanks to their highly flexible hardware programmability, the parallelism of computing resources and relatively mature toolchains. FPGAs are now widely used in artificial intelligence after years of being in a "supporting role" and have brought a new architecture choice to the industry.

The first FPGA XC2064 was launched in 1985 by Xilinx who gradually developed a series of FPGA processors dedicated to flexible programmable electronic hardware systems. Due to the need for neural network computing power, FPGAs were first applied in the field of neural networks in 1994. With the evolution of modern deep learning algorithms into more difficult and complex directions, FPGAs began to show their unique network acceleration capabilities. FPGAs consist of programmable logic hardware units that can be dynamically programmed to achieve the logical functions required. This capability enables FPGA to be highly applicable to many domains and they are widely used. Optimized properly, FPGAs feature high performance and low power consumption. The flexible nature of FPGAs gives them an unparalleled advantage compared to other hardware with fixed capabilities.

A unique and distinctive feature of FPGAs is reconfigurability. This allows reprogramming and changing the hardware functionality, allowing multiple different hardware designs to be tested in order to achieve optimal performance. In this way, optimal solutions for neural networks in a specific application scenario can be obtained. Reconfigurability is categorized into static and dynamic. The former refers to the reconfiguration and reprogramming of hardware before hardware execution in order to adapt to system functions. The latter refers to hardware reconfiguration based on specific requirements during program execution.

Reconfigurability gives FPGAs advantages in deep neural networks, but it incurs some costs. For example, reprogramming may be slow and often unacceptable for real-time programs. In addition, FPGAs have a high cost. For large-scale use, the cost is higher than dedicated ASICs. The reconfigurability of FPGAs is based on hardware description languages (HDLs) and reconfiguring often requires using HDLs (such as Verilog and VHDL) for programming. These languages are more difficult and complex than high-level software programming languages and are not easily mastered by most programmers.

In June 2019, Xilinx launched its next-generation Versal series components [13]. It is an adaptive computing acceleration platform (ACAP) and a new heterogeneous computing device. With it as the most recent example, FPGAs have evolved from a basic programmable logic gate array to dynamically configurable domain-specific hardware. The Versal adopts 7   nm technology. It integrates programmable and dynamically configurable hardware for the first time, integrating a scalar engine, AI inference engine, and FPGA hardware programming engine for embedded computing. This gives it a flexible multifunction capability. In some applications, its computing performance and power efficiency exceed that of GPUs. Versal has high computing performance and low latency. It focuses on AI inference engines and benefits from its powerful and dynamic adaptability in automatic driving, data center, and 5G network communication.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128234884000023

The Software-Defined Radio as a Platform for Cognitive Radio

Max Robert , Bruce A. Fette , in Cognitive Radio Technology (Second Edition), 2009

Field-Programmable Gate Arrays

Field-programmable gate arrays are programmable devices that are different in nature from GPPs and DSPs. An FPGA comprises some discrete set of units, sometimes referred to as logic elements (LE), logic modules (LM), slices, or some other reference to a self-contained Boolean logical operation. Each of these logical devices has at least one logic unit; this logic unit could be one or more multipliers and one or more accumulators, or a combination of such units in some FPGA chip selections. Logical devices are also likely to contain some memory, usually a few bits. The developer then has some freedom to configure each of these logical devices, where the level of reconfigurability is arbitrarily determined by the FPGA manufacturer. The logical devices are set in a logic fabric, a reconfigurable connection matrix that allows the developer to describe connections between different logical devices. The logic fabric usually also has access to some additional memory that logical devices can share. Timing of the different parts of the FPGA can be controlled by establishing clock domains. This allows the implementation of multirate systems on a single FPGA.

To program an FPGA, the developer describes the connections between logical devices as well as the Boolean functionality of each of these logical devices. The final design that the developer generates is a circuit rather than a program in the traditional sense, even though the FPGA is ostensibly a firmware programmable device. Development for the FPGA is done by using languages such as VHSIC Hardware Design Language (VHDL), which can also be used to describe application-specific integrated circuits (ASICs), essentially nonprogrammable chips. Variants of C exist, such as System-C, that allow the developer to use C-like constructs to develop FPGA code, but the resulting program still describes a logic circuit running on the FPGA.

The most appealing aspect of FPGAs is their computational power. For example, the typical signal-processing FPGA can have anywhere from 1000 to 44,000 slices, where each slice is composed of two lookup tables, two flip-flops, some math, some logic, and some memory. To be able to implement 802.11a, a communications standard that is beyond the abilities of any traditional DSP in 2005, would require approximately 3000 slices, or less than 50 percent of the FPGA's capabilities, showing the importance of a high degree of parallelism in the use of many multiply accumulators to implement many of the complex waveform signal processes in parallel.

From a performance standpoint, the most significant drawback of an FPGA is that it consumes a significant amount of power, making it less practical for battery-powered handheld subscriber solutions. For example, an FPGA with about 9000 slices mentioned before is rated at slightly over 2 W of power expenditure, whereas a low-power DSP for handheld use is rated at 65 to 160 mW, depending on clock speed and version.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123745354000035

Configurable Computing

Wayne Luk , ... Nabeel Shirazi , in The Electrical Engineering Handbook, 2005

3.3.1 Run-Time Reconfigurable Devices

FPGAs, by definition, are configurable; most of them are also reconfigurable unless they are based on technologies such as Antifuse, that are one-time programmable. Several commercial devices support partial reconfiguration, including the Virtex (Xilinx, 2001) and 6200 (Churcher et al., 1995) devices from Xilinx, the CLAy chip from National Semiconductor (National Semiconductor, 1993), and the AT 40 K devices from Ateml (Atmel, 1997). Useful reviews of FPGA architectures are available (Buell et al., 1996; Hauck, 1998; Kean, 2000; Mangione-Smith, 1997; Trumberger, 1994; Villasenor and Hutchings, 1998). Although some devices such as Xilinx 6200 FPGAs are no longer supported commercially, the ideas in the relevant publications may still inspire future advances.

A simple FPGA model is shown in Figure 3.3. In this figure, processing elements, typically containing configurable logic and storage blocks, are represented by squares. The processing elements are connected to configurable switches, represented as circles, that control data flow by establishing the desired connectivity between the busses. Much of the area in an FPGA is usually taken up by the configurable switches and the busses; local and global busses can also be organized hierarchically. The figure demonstrates the regularity found in most FPGAs; practical FPGAs often contain additional resources, such as configurable memory blocks and special-purpose input/output blocks supporting boundary-scan testing (Trimberger, 1994).

FIGURE 3.3. A Simple Model of an FPGA. Squares represent configurable processing elements, and circles represent configurable switches to control routing.

Many experimental FPGA architectures support run-time reconfiguration. Tau et al., (1995) have come up with an FPGA that stores multiple configurations in memory banks. In a single clock cycle, which is in the order of tens or hundreds of nanoseconds, the chip can replace configuration by another without erasing partially processed data.

A similar FPGA that can perform a context switch in one cycle has been developed by Trimberger et al. (1997). The FPGA can store up to eight configurations in on-chip memory.

This FPGA is based on a Xilinx 4000E device and includes extensions for dealing with saving state from one context to another.

The Colt Group led by Athanas is investigating a run-time reconfiguration technique called Wormhole that lends itself to distributed processing (Bittner and Athanas, 1997). The unit of computing is a stream of data that creates custom logic as it moves through the reconfigurable hardware.

Schmit et al. (2000) have developed a reconfigurable FPGA targeted toward pipelined designs. Reconfiguration is performed at the level of individual pipeline stages, similar to that described in Figure 3.2. Others have shown that commercial partially reconfigurable FPGAs can also support efficient reconfiguration of pipelined designs (Luk et al., 1997).

There are also configurable devices based on coarse-grain programmable elements (Conquist et al., 1998), multiple-bit arithmetic units (Marshall et al., 1999), and low-power techniques (Rabaey, 1997). Kean (2000) provides an overview of commercial devices available in the year 2000.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780121709600500293

3-D Circuit Architectures

Vasilis F. Pavlidis , ... Eby G. Friedman , in Three-Dimensional Integrated Circuit Design (Second Edition), 2017

20.4 3-D FPGAs

FPGAs are programmable ICs that implement abstract logic functions with a considerably smaller design turnaround time as compared to other design styles, such as application-specific (ASIC) or full custom ICs. Due to this flexibility, the share of the IC market for FPGAs has steadily increased. The tradeoff for the reduced time to market and versatility of the FPGAs is lower speed and increased power consumption as compared to ASICs. A traditional physical structure of an FPGA is depicted in Fig. 20.26, where the logic blocks (LB) can implement any digital logic function with some sequential elements and arithmetic units [760]. The switch boxes (SBs) provide the interconnections among the LBs. The SBs include pass transistors, which connect (or disconnect) the incoming routing tracks with the outgoing routing tracks. Memory circuits control these pass transistors and program the LBs for a specific application. In FPGAs, the SBs constitute the primary delay component of the interconnect delay between the LBs and can consume a great amount of power.

Figure 20.26. Typical FPGA architecture, (A) 2-D FPGA, (B) 2-D switch box, and (C) 3-D switch box. A routing track can connect three outgoing tracks in a 2-D SB, while in a 3-D SB, a routing track can connect five outgoing routing tracks.

Extending FPGAs to the third dimension can improve performance while decreasing power consumption as compared to conventional planar FPGAs. A generalization of FPGAs to the third dimension would include multiple planar FPGAs, wafer or die bonded to form a 3-D system. The crucial difference between a 2-D and 3-D FPGA is that the SB provides communication to five LBs in a 3-D system rather than three neighboring LBs as in a 2-D FPGA (see Figs. 20.26B and 20.26C). Consequently, each incoming interconnect segment connects to five outgoing segments rather than three outgoing segments. The situation is somewhat different for the bottom and topmost tier of a 3-D FPGA, but in the following discussion, for simplicity this difference is neglected. Since the connectivity of a 3-D SB is greater, additional pass transistors are required in each SB, increasing the power consumption, memory requirements to configure the SB, and, possibly, the interconnect delay. The decreased interconnect length and greater connectivity can compensate, however, for the added complexity and power of the 3-D SBs.

To estimate the size of the array beyond which the third dimension is beneficial, the shorter average interconnect length offered by the third dimension and the increased complexity of the SBs should be simultaneously considered [761]. Incorporating the hardware resources (e.g., the number of transistors) required for each SB and the average interconnect length for a 2-D and 3-D FPGA, the minimum number of LBs for a 3-D FPGA to outperform a 2-D FPGA is determined from the solution of the following equation,

(20.27) F s , 2 D 2 3 N 1 / 2 = F s , 3 D N 1 / 3 ,

where F s,2-D and F s,3-D are, respectively, the channel width of a 2-D and 3-D FPGA, respectively, and N is the number of LBs. Solving (20.27) yields N=244, a number that is well exceeded in modern FPGAs.

Since the pass transistors, employed both in 2-D and 3-D SBs, contribute significantly to the interconnect delay, degrading the performance of an FPGA, those interconnects that span more than one LB can be utilized. These interconnect segments are named after the number of LBs that is traversed by these segments, as shown in Fig. 20.27. Wires that span two, four, or even six LBs are quite common in contemporary FPGAs. Interconnects that span one fourth to a half of an IC edge are also possible [762].

Figure 20.27. Interconnects that span more than one logic block. L i denotes the length of these interconnects and i is the number of LBs traversed by these wires.

The opportunities that the third dimension offers in SRAM based 2-D FPGAs have also been investigated [763]. Analytic models that estimate the channel width in 2-D FPGAs have been extended to 3-D FPGAs. Hence, the channel width W for an FPGA with N LBs, exclusively consisting of unit length interconnect segments and implemented in n physical tiers, can be described by

(20.28) W = l = 1 2 N / n 2 + ( n 1 ) d v χ fpga ( 2 N + ( n 1 ) N n ) e t ,

where f 3-D (l) is a stochastic interconnect length distribution similar to those discussed in Chapter 7, Interconnect Prediction Models. χ fpga converts a point-to-point distance into an interconnect length and e t is the utilization parameter of the wiring tracks. These two factors can be determined from statistical data characterizing the placement and routing of benchmarks circuits on FPGAs. Note that these factors depend both on the architecture of the FPGA and the automated layout algorithm used to route the FPGA.

Several characteristics of FPGAs have been estimated from benchmarks circuits and randomized netlists placed and routed with the SEGment Allocator (SEGA) [764], the versatile place and route (VPR) [765] tools, and analytic expressions, such as (20.28). In these benchmark circuits, each FPGA is assumed to contain 20,000 four input LBs and is manufactured in a 0.25   μm CMOS technology. The area, channel density, and average wirelength are measured in LB pitches, which is the distance between two adjacent LBs, and are listed in Table 20.7 for different number of physical tiers.

Table 20.7. Area, Wirelength, and Channel Density Improvement in 3-D FPGAs

Number of Tiers Area (cm2) Channel Density Avg. Wirelength (LB Pitch)
1 (2-D) 7.84 41 8
2 (3-D) 3.1 24 6
3 (3-D) 1.77 20 5
4 (3-D) 1.21 18 5

The improvement in the interconnect delay with a length equal to the die edge is depicted in Fig. 20.28 for different number of tiers. Those wires that span multiple LBs use unit length segments (i.e., no SBs are interspersed along these wires) whereas the die edge long wires use interconnect segments with a length equal to a quarter of the die edge. A significant decrease in delay is projected; however, these gains diminish for more than four tiers, as indicated by the saturated portion of the delay curves depicted in Fig. 20.28. The components of the power dissipated in a 3-D FPGA assuming a 2.5 volt power supply are shown in Fig. 20.29. The power consumed by the LBs remains constant since the structure of the LBs does not vary with the third dimension. However, due to the shorter interconnect length, the power dissipated by the interconnects is less. This improvement, however, is smaller than the improvement in the interconnect delay, as indicated by the slope of the curves illustrated in Figs. 20.28A and 20.28B. This behavior is attributed to the extra pass transistors in a 3-D SB, which increase the power consumption, compromising the benefit of the shorter interconnect length. Due to the reduced interconnect length, the power dissipated by the clock distribution network is also less.

Figure 20.28. Interconnect delay for several number of physical tiers, (A) average length wires, and (B) die edge length interconnects.

Figure 20.29. Power dissipated by 2-D and 3-D FPGAs.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124105010000204

Traditional microarchitectures

Shigeyuki Takano , in Thinking Machines, 2021

2.5.4 Applying FPGAs to computing system

An FPGA was invented for a logic circuit verification (testing), and it was designed to configure any digital logic circuit by changing the configuration data having such configuration information; an SRAM-based FPGA can be used many times based on its reconfiguration by restoring the configuration data to the SRAM.

At the end of the 1990s, FPGA vendors claimed that the time from design to market (called Time-to-Market) had been shortened [39]. Current trends include a less variant production and huge nonrecurring engineering (NRE) costs on current ASICs using FPGAs to suppress the costs. The ability of reconfiguration introduces to patch updating configuration data on FPGA, little bug of user logic circuit can be fixed by the reconfiguration after release to market.

A current FPGA has many memory blocks and numerous DSP blocks consisting of a multiplier and an adder, and therefore, an FPGA can be applied to a system-on-chip (SoC). Many memory blocks placed on a chip reduce the number of external memory accesses, and thus can reduce the latency for data access; in addition, multiple memory blocks can perform multiple data accesses in parallel, and thus, an FPGA can be used for DLP applications. Through the integration of a DSP block, the user logic circuit designed for DSP can be applied with a relatively higher clock frequency. The current DSP block can perform integer and fixed-point arithmetic operations and also floating-point arithmetic operations, and thus, a variety of applications can use an FPGA [62].

An FPGA has a general purpose configuration for any user logic circuit. Therefore, compared with its implementation using an ASIC, it has a smaller equivalent number of gates, a lower clock frequency, and a higher power consumption, as shown in Table 2.1. In the case of tiny benchmark logic circuits, the equivalent number of gates, clock frequency, and power consumption on an FPGA are equivalent to those of the older third- and fourth-generation ASICs. The integration of memory blocks and a DSP block enhance the old design space. The use of an FPGA having such an older equivalent generation is supported by the higher NRE and fabrication costs of an ASIC implementation; however, there is also a risk that bugs in the logic circuit after fabrication of an ASIC are impossible to fix. Thus, today's FPGAs are used to reduce the costs and shorten the time-to-market.

Table 2.1. Implementation gap (FPGA/ASIC) [233].

Comparison Point Logic Only Logic and Mem Logic and DSP Logic, DSP, Mem
Area 32 32 24 17
Critical-path Delay 3.4 3.5 3.5 3.0
Dynamic Power Consumption 14 14 12 7.1

Recently, an FPGA was used for high-performance computing. Intel Co. acquired Altera (an FPGA vendor) [54] and proposed integrating an FPGA with an enterprise microprocessor [68].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128182796000128

Hardware Accelerator Systems for Artificial Intelligence and Machine Learning

Joo-Young Kim , in Advances in Computers, 2021

2.3 FPGA based acceleration systems

Recently, FPGA draws a lot of attention as a compelling ML acceleration platform due to its adaptability to frequent algorithm changes and relatively good power efficiency compared to conventional CPU/GPU based systems. Fig. 3 shows three different architectures of FPGA based acceleration systems. The first architecture is the most typical architecture, which interconnects the host CPU and FPGA accelerator through PCIe bus. In this setting, the host CPU offloads a bulk of data to the FPGA accelerator and the FPGA accelerator performs coarse-grained tasks just like GPU. The second architecture is a network attached architecture, which is also used in Microsoft's Catapult project [4]. As the FPGA accelerator resides between the NIC card and a network switch, it always has the network traffic passing through and can apply in-line processing if needed. At the same time, the FPGA accelerator is connected to the host CPU through the PCIe bus like in the first case, it can be used as a local compute accelerator as well. In both architectures, the FPGA accelerator usually has its own local DRAMs on the board aside from the main memory of the host CPU. Third architecture integrates the FPGA accelerator much closer to the traditional CPU systems. In Intel's Xeon-FPGA hybrid chipset named Purley [5], recently developed for datacenter market, FPGA is connected to Intel Xeon processor through Ultra Path Interconnect (UPI) and 2 PCIe Gen 3 channels on the same die while the Xeon processor and FPGA logic share the main memory in a coherent way.

Fig. 3

Fig. 3. FPGA based acceleration systems. (A) Local accelerator (B) Network attached (C) On-die integration.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245820300899

Sensor validation and hardware–software codesign

Manish J. Gajjar , in Mobile Sensors and Context-Aware Computing, 2017

FPGA Platform

An FPGA system can include the following main components, as shown in Fig. 7.3:

Figure 7.3. Sensor hub FPGA prototyping block diagram.

A host system, which has sensor software and applications along with test content and device drivers to drive the FPGA system connected to its PCIe slot.

A PCIe bridge that connects the DUT in FPGA to the host PC.

The sensor hub RTL code as DUT along with its MCU (micro-controller unit) if it is MCU-based sensor hub.

Additional RTL code for units like security engine or audio subsystem to which the sensor hub interacts and needs to be part of the validation.

Any RAM and ROM memories in DUT. These needs to be ported to FPGA RAMs.

Any other interface or routers for communication to the sensor hub.

Physical sensors that connect to the FPGA board through HAPSTRACK or FMC connectors.

A FPGA system can be based on presilicon simulation models where the clocks, resets, and power control in FPGA are similar to those on the presilicon test benches but are generated using FPGA board PLLs or from the PCIe clock.

Since the sensor hub interacts with the security engine for boot and power management flow, it is important to have a prototyping system that can also emulate the security engine so that the firmware can execute handshake protocols and boot or power management flows.

If the actual security engine cannot be prototyped into the FPGA system due to FPGA resource or timing limitations, then a software emulator can be developed to mimic sensor hub-security engine handshake protocols. Such an FPGA architecture is shown in Fig. 7.4. The software emulator that mimics all the security engine flows and behavior is loaded on the host PC. The security engine emulator will update the sensor hub-security engine handshake registers according to handshake flows (like boot up flow or power management flow) and thus enables a communication channel between the security engine and the sensor hub.

Figure 7.4. Sensor hub-security engine software communication model in FPGA.

The sensor hub and the register read/write unit can also initiate read/write transfers to the sensor hub-security engine handshake registers. Such a software emulator–based handshake mechanism enables validation of complete firmware/software–hardware interaction flows fairly early in the hardware–software development cycle and finds important interaction bugs.

Fig. 7.5 shows a typical FPGA model build starting from the design RTL database to the final validation report out.

Figure 7.5. FPGA model build process.

The process of image generation involves generating a presilicon model, synthesis, partitioning, and bit image generation. During FPGA presilicon model, the RTL is integrated with other FPGA infrastructure components, the embedded memory blocks in design are replaced with FPGA equivalent memories (BRAM), constraints are defined, and clock sources are defined and generated. During synthesis, FPGA vendor synthesis tools (such as SynplifyPro and SynplifyPremier) are used to generate the net list, which is followed by physical synthesis, or PAR (partitioning), (using tools like ISE or Vivado). Finally, the bit file is generated.

Prototype users need to understand the difference between an FPGA model and real silicon. The limitations of the FPGA image needs to be understood and documented for users (such as system validation or software validation teams).

The FPGA image can be qualified with presilicon simulation tests ported to FPGA. This ensures that the users and validation phase are not impacted due to a dead-on-arrival or lower quality FPGA image. For example, 30–40% of presilicon simulation tests can be used as gate tests for quality. Apart from basic FPGA gate tests, the customers and validation teams can also run their acceptance tests to ensure that feature readiness in the released FPGA bit image. Such acceptance testing is important before running automation and executing a large number of tests on the FPGA bit image.

If any test cases fail during the validation phase, then various FPGA debug paths are explored and followed. Some of such paths are shown in Fig. 7.5. Xilinx Chipscope or Synopsys Identify can be used to compile and build an FPGA image with a predetermined debug signal list. These debug signals can then be observed during the test failure. Various protocol analyzers such as I2C, UART, PCIe, and SPI can also be used for the debug of the failing test case4s.

Once the root cause of the failure is determined, a fix is identified and a new FPGA image is released for verification. The failure is also evaluated for any lessons that can be used to improve the design and validation methodology and infrastructure.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128016602000070

Dark Silicon and Future On-chip Systems

Zeinab Seifoori , ... Hossein Asadi , in Advances in Computers, 2018

1 Introduction

Field-Programmable Gate Arrays (FPGAs) were first emerged four decades ago. Since their inception, these devices have experienced a rapid growth in industry and have been a viable alternative infrastructure for Application-Specific Integrated Circuits (ASICs). Nowadays, FPGAs have become an indispensable part of digital systems and have widespread usage in a wide domain of digital systems from embedded processors to parallel and high-performance computing to safety-critical applications. This ubiquity of FPGA usage is motivated by the intrinsic and distinctive characteristics such as flexibility to implement a variety of digital designs and adaption with workload variations, high degree of parallelism, low Nonrecurring Engineering cost, inexpensive design update, and short time to market.

The evolution path of FPGA had a great growth pace so that today's typical FPGA platforms, e.g., Stratix 10GX5500, comprise more than 30   billion transistors. However, such aggressive trend in transistor density necessitates the massive power and more importantly, cooling equipment so that all transistors on the chip can be employed in parallel. In recent years, however, this aggressive trend in integration of transistors in semiconductor industry has hit a Power Wall due to the failure of Dennard Scaling.

Dennard scaling states that transistor feature size and voltage scale commensurately by the same factor in each process generation, so the power density remains constant. This facilitates designers to increase performance by raising the clock frequency in every generation. Nonetheless, since the last decade, the reduction in supply voltage was not adequate to handle the significant increase of static power, which was the new power bottleneck. Therefore, the increasing static power and subsequently, the breakdown of Dennard scaling hinders cramming ever-more silicon into tiny areas and fully utilize the entire transistor pool on the chip within a given and safe Thermal Design Power. This means one may fabricate dense chips, but cannot afford the power and hence obligated to restrict the active fraction of silicon die, which is known as dark silicon.

When it comes to FPGAs, Dark silicon gets more pronounced, since the flexibility of FPGA devices is not only the reason of their lower performance and less area efficiency in contradiction to ASIC counterpart, but also the cause of higher static power dissipation that stems from the abundance of transistors and SRAM configuration bits aligned in logic and routing fabrics [1]. In particular, FPGA-based designs consume more static power (5–87   ×) relative to their ASIC-based counterparts [1]. This high static power dissipation obstructs their usage in low power and embedded mobile systems.

Previously, FPGA practitioners focused to enhance primarily the performance and area efficiency of FPGAs and spent quantities of transistors to achieve higher performance. However, considering importance of today's power-constrained computations and increasing rate of dark portion of silicon, they conversely spend (or waste) abundance of transistors to buy energy, managing the operating temperature, or longer battery life. With the emergence of reconfigurable devices, FPGAs have created an opportunity to be promising alternatives of processors since they tend to be more energy-efficient and outperform processors by an order-of-magnitude or even more, especially in applications that are highly parallel [2,3]. Thus, the higher power reduction of FPGAs would suggest further opportunity of their popularity.

This chapter targets the evolution of logic and routing fabrics in FPGAs toward the era of dark silicon. To this end, the rest of this chapter is organized as follows. First, an overview of SRAM-based FPGA architecture including logic and routing resources is presented in Section 2. Afterward, Section 3 provides an introduction on the concepts of power wall and dark silicon. Sections 4 and 5 summarize a few studies which offer new FPGA logic and routing architectures, respectively, targeting the mitigation of dark silicon problem. Finally, we will conclude this chapter and address important questions in the future trends.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245818300421

Next Steps and Beyond

Thomas Sterling , ... Maciej Brodowicz , in High Performance Computing, 2018

21.3.3 Field Programmable Gate Arrays

An FPGA is, as the name implies, a component comprising a large number of logic gates and other functional parts connected by a network, the connectivity of which can be determined by "programming" the device. That is, there is a protocol by which the end user can determine the logic circuitry of the component. While less dense and somewhat slower than application-specific integrated circuits (ASICs), FPGAs enable custom designs to be produced to optimize them for special-purpose functionality. This permits the rapid development of prototype designs and gives a means to distribute a small number of parts to end users. One area of use that may prove promising is application-specific FPGA logic circuits optimized for specific algorithms. Such structures as systolic arrays can be implemented readily with FPGAs to accelerate important applications by one to two orders of magnitude with respect to conventional microprocessors. Other uses may include logic designs to support future system software to reduce overheads.

The major challenge is to provide efficient functionality that best suits application algorithms and the means of rapidly programming FPGAs. Much work has been done in both domains, but use still demands expertise. Another problem is the integration of FPGAs with otherwise conventional systems. This is in part addressed through industry-standard interfaces to which custom boards are designed with FPGA components. But this still has its limitations. Now hybrid subsystems with both processors and FPGAs integrated together are being made available, again improving their mutual connectivity.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124201583000216

Introduction

Gary Stringham , in Hardware/Firmware Interface Design, 2010

FPGAs

FPGAs (field-programmable gate arrays) have the flexibility of having their functionality changed through reprogramming, but they typically use more power, have slower performance, and cost more per part.

FPGAs can be programmed with a customized mix of content. This custom mix makes it similar to ASICs in that typically one firmware team is paired with the one hardware team. However, because FPGAs can be modified in a matter of hours, it is possible for many versions of the FPGA programming to exist. This requires close collaboration between the hardware and firmware teams to ensure that the version of firmware paired with the version of FPGA code will work together.

FPGAs are also used when changes to the design are needed after the product has been deployed to customers. A new programming file can be distributed and downloaded to the product.

FPGAs can also be programmed with a standard mix of content. This is similar to ASSPs in that companies can sell the same package to many customers. This means that there could be many firmware teams writing device drivers for this standard, FPGA-based content. Companies use this method to sell small quantities of their designs without incurring the NRE (nonrecurring engineering) expenses associated with sending the design to a foundry for fabrication as is the case with non–FPGA-based ASSPs.

References in this book about the time and expense required to respin the chip do not apply to FPGAs. Firmware cannot tell the difference between an ASIC/ASSP or FPGA, as the register/interrupt interface is the same.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781856176057000034