Abstract
Supercomputing or High Performance Computing (HPC) are terms commonly used to categorize systems composed by computers which operate in parallel by the means of commercially available high speed interconnects, such that the system can virtually operate as a unified system. This paper provides a brief overview of HPC and its evolution throughout the years starting from the Control Data Corporations (CDC) 6600 and exploring until the most recent supercomputer, such as the IBM Roadrunner.
1. Introduction: Supercomputers
Contemporarily, supercomputers are mostly understood as computer clusters. Computer clusters are loosely coupled computing systems generally comprising of multiple processors linked together with interconnects. These interconnects are usually implemented by fast local area networks.
Despite the current common definition of a supercomputer, this concept has highly changed during years. The term supercomputer itself is rather fluid, and today’s supercomputer tends to become tomorrow’s ordinary computer. CDC’s early machines were simply very fast scalar processors, in the order of ten times the speed of the fastest machines offered by other companies. In the 1970s, the term supercomputer was mostly referred to specialized high performance vector processors. The early and mid-1980s saw machines with a modest number of vector processors working in parallel to become the standard. Typical numbers of processors were in the range of four to sixteen. In the later 1980s and 1990s, attention turned from vector processors to massive parallel processing systems with thousands of “ordinary” CPUs, some being off the shelf units and others being custom designs. Today, supercomputers are commonly understood as computer cluster which are built based on “off the shelf” server-class microprocessors, such as the PowerPC, Opteron, or Xeon which are combined using custom interconnects.
One of the advantages of using common components to build a cluster, is that these systems can be massed produced and therefore be offered at a lower price. Although a high degree of technical skill is necessary to create such a system, the benefits such as their flexibility, relatively low power consumption and cost, outweigh the drawback.
The overall progress of high performance computing over time can be tracked through the increase in performance of the world’s fastest supercomputers. This is typically measured in FLOPS (floating point operations per second), and a list of the world’s 500 fastest supercomputers exists online at [8]. This list is updated twice a year, most recently in November 2008.
Supercomputing should not be confused with Grid Computing (also known as Distributed Computing) which is a special type of parallel computing that use the resources of many computers in a network to work on a single problem at the same time. It employs the use of large-scale cluster computing and thus means such a type of computing is geographically dispersed in nature. One of the main issues of grid computing is security. Whereas in supercomputing the system remains private in nature, in grid computing the system is public. This is because in grid computing, a computer from anywhere in the world can become part of the grid resulting in important security issues if not taken into consideration. A study about grid computing is beyond the scope of this paper.
2. Supercomputing benefits and applications
Supercomputing was defined before as the usage of supercomputers or computer clusters. When using computer cluster, it can be questioned the reasoning on building such system instead of using only one big processor to leverage the intended functionality. Reasons are that a computer cluster, enables scalability, increases flexibility, and reduces cost by allowing building the cluster from off-the-shelf standard components and decreases power consumption. Increasing computational power on a single processor was usually approached as the same as increasing operating clock frequency, however recent research have demonstrated that the same computational power can be achieved by using multiple processors operating at lower clock frequencies. Lower clock frequencies allow usage of smaller digital voltages, and therefore reduction in dynamic and static power consumptions. Additionally, the usage of several processors allows improvements on instruction level parallelism (ILP). It is very important to highlight with respect to the mentioned increased flexibility in supercomputers to the fact that in these systems the parallel components are loosely coupled, which allows removal and addition of more components (computers) to the cluster easily.
Supercomputers are used for highly calculation-intensive tasks such as problems involving quantum mechanical physics, weather forecasting, climate research, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), physical simulations (such as simulation of airplanes in wind tunnels, simulation of the detonation of nuclear weapons, and research into nuclear fusion), cryptanalysis, and the like. Major universities, military agencies and scientific research laboratories are heavy users. A particular class of problems, known as Grand Challenge problems, is problems whose full solution requires semi-infinite computing resources.
The advancement of supercomputing over the years has been marked by high levels of competitions between manufacturers and nations. By enabling sophisticated computations and simulations, these machines have been invaluable to the solution of several important problems found in science. The high importance value of the problems that supercomputers can solve has promoted continuous research to improve their capabilities.
3. The History of Supercomputers
The CDC 6600 is believed to have been the first computer to be designated as a supercomputer, offering the fastest clock speed for its day (100 nanoseconds). It was one of the first computers to use Freon refrigerant cooling and was also the first commercial computer to use a CRT console. The machine was operated for nearly a year at the 30th street location in Boulder until the Mesa Laboratory was ready in December 1966. The CDC 6600 was a large-scale, solid-state, general-purpose computing system. It had a distributed architecture (central scientific processor supported by ten very fast peripheral machines) and was a reduced instruction set (RISC) machine many years before such a term was invented. Input to the computer was by punch cards or seven-channel digital magnetic tape. Output was available from two line printers, a cardpunch, a photographic plotter, and standard magnetic tape. An interactive display console allowed users to view graphical results as data were being processed. The CDC 6600 had 65,000 60-bit words of memory. It was equipped with a large disk storage device and six high-speed drums as storage intermediate in speed and accessibility between the central core storage and magnetic tapes. The 6600 supported the FORTRAN 66 compiler and a program library. The CDC 6600 was decommissioned in 1977.
Modern supercomputing began in the 1970s with the introduction of vector processors. Many of the newer players developed their own such processors at a lower price to enter the market. In the early to mid 1980s, performance advancement was obtained through improvements in vector processor technology and the introduction of symmetric multiprocessors (SMPs). Dongarra [3] defined SMP as “[A] computer system that has two or more processors connected in the same cabinet, managed by one operating system, sharing the same memory, and having equal access to input/output devices.”
The early and mid-1980s saw machines with a modest number of vector processors working in parallel to become the standard. Typical numbers of processors were in the range of four to sixteen. During the first half of the 1980s, much attention was paid to the marketability of supercomputers so that manufactures were able to sell enough systems to stay in business. This included development of “standard programming environments, operating systems, and key applications” [3].
Supercomputers built in the early-80s used the shared memory model (SM), which limited scalability in such systems. In the later 1980s and early 1990s, attention turned from SM to distributed memory (DM) implemented by massive parallel processing systems with thousands of “ordinary” CPUs, some being off the shelf units and others being custom designs. This shift in focus was accompanied by an increase in performance of standard off-the-shelf processors due to the transition to RISC architectures and the introduction of CMOS technology. This leads to the introduction of the concept of massively parallel processors (MPP) [3].Vaughan-Nichols[7] defines a MPP as a system using many CPUs, each with its own memory, running in parallel and linked by high-speed interconnections to execute the various parts of a program. His use of “many” to describe the number of processors is a rather relative term. IBM Roadrunner, currently the world’s fastest supercomputer is listed as a hybrid design with 12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors in specially designed server blades connected by Infiniband [8] but a computer with a few orders of magnitude fewer processors could also be described as having “many” processors. Dongarra suggests considering “many” to mean “larger than the current largest number of processors in a shared-memory machine.”
In order to provide a more reliable basis for statistics on high-performance computers, the Top500 list was introduced in 1993. Initially this ranking made usage of the Linpack benchmark. LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Dongarra and other researchers between late 1970s and early 1980s for use with supercomputers. This decision was controversial since LINPACK is considered to over-estimate the capability of microprocessor-based machines [4]. LINPACK has actually been largely superseded by LAPACK, which runs more efficiently on modern architectures.
Until the introduction of the Earth Simulator supercomputer in 2002, MPP machines dominated the supercomputer industry [6]. The percentage of machines on the Top500 list utilizing vector processors had been steadily decreasing over the years, and had appeared to stabilize around 10% in 2003 (Feitelson 2005). When introduced, Earth Simulator was the world’s fastest supercomputer and featured vector processors in a MPP architecture [4]. According to Dongarra, this “demonstrated that many scientific applications could benefit greatly from other computer architectures.” Feitelson highlights that researchers in Japan, where the Earth Simulator is located, have preferred the vector processing approach, while American companies have preferred to use commodity microprocessors.
Currently, the focus in supercomputing is on clustered systems, as Strohmaier and Meuer note that the number of clustered systems on the Top500 list has grown considerably. Dongarra defines “cluster” as a “commonly found computing environment consisting of many PCs or workstations connected together by a local-area network.” He also comments that a long running trend indicates that it is increasingly rewarding to aggregate the computational power of relatively small machines. As the typical desktop computer has become more powerful over the years, it can now be considered a significant computing resource [3]. Coupling groups of commodity computers together with high-speed interconnects has proved to be a successful venture, lowering costs and putting supercomputers within reach of new users [7]. As an example, Vaughan-Nichols [7] cites Viriginia Tech’s new Terascale Cluster, built with 1,100 off-the-shelf Apple computers, Cisco Systems’ 4500 Gigabit switches, and 24 96-port Infiniband switches. The Terascale cluster was built in 4 months and cost a relatively small $5.2 million. In summary, clusters of PCs and workstations have become the prevalent architecture for many HPC application areas in all ranges of performance in recent years [3].
4. LINPACK and LAPACK
LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Pete Stewart, and was intended for use on supercomputers in the 1970s and early 1980s. It has been largely superseded by LAPACK, which will run more efficiently on modern architectures.
LINPACK is a collection of FORTRAN subroutines that analyze and solve linear equations and linear least-squares problems. The package solves linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square. In addition, the package computes the QR and singular value decompositions of rectangular matrices and applies them to least-squares problems. LINPACK uses column-oriented algorithms to increase efficiency by preserving locality of reference. The result is reported in millions of floating point operations per second (MFLOP/s, sometimes simply called FLOPS).
LAPACK can be seen as the successor to the original LINPACK, which was designed to run on the then-modern vector computers with shared memory. LAPACK, in contrast, depends upon the Basic Linear Algebra Subprograms (BLAS) in order to effectively exploit the caches on modern cache-based architectures, and thus can run orders of magnitude faster than LINPACK on such machines, given a well-tuned BLAS implementation. LAPACK has also been extended to run on distributed-memory systems in later packages such as ScaLAPACK and PLAPACK.
5. Top supercomputers
IBM Roadrunner currently occupies the first position in the 32nd edition of the prestigious top500 ranking [8]. It was built by IBM for the computer for the U.S. Department of Energy’s (DOE) National Nuclear Security Administration and it is located at Los Alamos National Laboratory. It is a hybrid design with 12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors in specially designed server blades connected by Infiniband. The Roadrunner uses Red Hat Enterprise Linux along with Fedora as its operating systems and is managed with xCAT distributed computing software. Top500 [8] reports an average performance of 1105000 GFLOPS and a peak performance of 1456704 GFLOPS for the LINPACK benchmark. Roadrunner uses the Open MPI Message Passing Interface implementation. Roadrunner occupies approximately 6,000 square feet (560 m2) and became operational in 2008. The DOE plans to use the computer for simulating how nuclear materials age in order to predict whether the USA’s aging arsenal of nuclear weapons is safe and reliable. Other uses for the Roadrunner include the sciences, financial, automotive and aerospace industries.
At second position is the Jaguar XT supercomputer by Cray, built for the US Department of Energy’s Oak Ridge National Laboratory in Tennessee. Top500 reports currently the usage 150152 cores and an average performance of 1059000 GFLOPS and a peak performance of 1381400 GFLOPS for the LINPACK benchmark. DOE establishes the purpose of Jaguar as unclassified research.
It is very interesting to highlight from the top500 ranking that from the top 500 supercomputers in the world, 290 are located in US (58%), followed by 49 in United Kingdom (9.20%), 26 in France (5.20%), 25 in Germany (5.00%), 17 in Japan (3.40%) and 15 in China (3.00%). Among Latin American countries, Brazil has 2 supercomputers and Mexico 1 supercomputer. With respect to operating systems, 389 supercomputers use Linux which accounts for a 77.80% of the top500 ranking. However, the real percentage of supercomputers using Linux is higher is higher if we consider that specific Linux distributions as SUSE or Redhat are considered separately.
6. Conclusions
We have presented a brief evolution on supercomputing architectures over the last 40-50 years. It isn’t surprising to find that a small cluster built from the technology available today will surpass LINPACK FLOP measurements from ten years ago. Actually we have seen during the last year the surpass of the 1 petaflop barrier with the construction of the IBM RoadRunner and later with the expansion of the Cray Jaguar XT.
As, super computers depend on communication between clusters and cluster size, in the future as more methods of improvement of communication and increase in cluster size in attained, the next generation of supercomputers (some of which are presently in the early stages of development) is predicted to increase the number of leveraged computational power.
Moreover, supercomputing now is more about parallelizing systems than just focusing on one system alone because of the limitations in ILP, memory and power that have stopped the development of high performance unique processors. This reflects why there is growing importance on the focus of shift towards parallel computing.
7. References
[1] G. Bell, Bay Area Research Center, Microsoft Research, “ A Brief History of Supercomputing: “the Cray’s”, Clusters and Beowulf’s, Centers. What Next?” April 19, 2003.
[2] J. Copeland, “A brief history of Computing“, June 2000
[3] J. Dongarra. Trends in high performance computing: a historical overview and examination of future developments. Circuits and Devices Magazine, IEEE , vol.22, no.1, pp. 22-27, Jan.-Feb. 2006
[4] D. G. Feitelson. The supercomputer industry in light of the Top500 data. Comput. in Science & Engineering 7(1), pp. 42-47, Jan/Feb 2005
[5] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, Second Edition, 1995, ISBN 1-55860-329-8.
[6] E. Strohmaier, H. Meuer. Supercomputing: What Have We Learned from the TOP500 Project?. Proceeding Algorithms 2002, http://ftg.lbl.gov/ToP500/SPComput-learned.pdf
[7] S. Vaughan-Nichols. New trends revive supercomputing industry. Computer, vol.37, no.2, pp. 10-13, Feb 2004
[8] www.top500.org – 32nd Edition of TOP500 List of World’s Fastest Supercomputers Released, Big Turnover Among the Top 10 Systems. November 2008.