Overview of Supercomputer Architectures, Hardware and Programming Methods
The basic concept
behind HPC (High Performance Computing) is parallelism. HPC hardware (and
software) falls into 3 categories -
SMP is a multiprocessor computer
architecture where to or more identical processors are connected with a single
uniformly shared memory. Most common commercial systems use SMP for parallel
processing.
Almost, all
the modern operating systems are SMP capable i.e they
can simultaneously execute code on multiple processors to achieve parallelism.
Recent O(1) (
Read Order 1 or constant order) scheduling capability of Linux kernel has put it in lime-light.
The
scalability matrix of various systems is -
Currently
IBM ans Sun ( Sun Fire X
series) offer linux SMP systems with 16 64bit AMD Opteron or Intel Zeon/Itanium CPUs. But there are currently
some issues with linux scalability on SMP systems
mainly attributed to its kernel locking mechanism.
HP Boasts
of its offerings in HP Superdome Servers which has upto
256 64Bit Intel Itanium or HP PA-RISC processors and 2TB or RAM. They only run
HP-UX which is again licensed on per CPU bases.
The only
discouraging factor against SMP is the high cost of owner ship and
maintainability. Because even if one CPU or even its cache
goes out of order entire system has to be replaced.
On the
contrary the hot-swappable nodes in a linux compute
cluster makes it more popular for HPC market. Also, the very thin Linux layer
(if any) running these cluster nodes make all the CPU power available for the
application, as compared to a SMP system where OS itself eats up significant
resources.
Not very surprisingly a
compressed HP UX kernel image is > 180MB and sometimes even 300MB which hogs
Huge memory space when running in uncompressed form. A Linux kernel image
usually is 1.5 MB and can be made as small as 200 KB.
The product
information about SMP systems can be achieved from -
Vector processors which were invented in early 1980s
are still a subject of research and still far-away from meeting any enterprise
requirements.
Clusters are groups of loosely bound
independent servers (called nodes) that work together closely so that in many
aspects they can be viewed as though they are a single large server.
Before
cluster-based computing, the typical supercomputer was a vector processor that
could typically cost over a million dollars due to the specialized hardware and
software.
Fundamentally,
both vector and scalar processors execute instructions based on a clock; what
sets them apart is the ability of vector processors to parallelize computations
involving vectors (such as matrix multiplication), which are commonly used in
High Performance Computing. To illustrate this, suppose you have two double
arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].
Any
floating-point operation like addition or multiplication is achieved in a few
discrete steps:
·
Exponents
are adjusted
·
Significants are added
·
The
result is checked for rounding errors and so on
A vector
processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE
arithmetic hardware) in a floating-point addition as shown in Figure 1.

A vector
processor does these six steps in parallel. As you can see, for a six-step
floating-point addition, the speedup factor will be very close to six times. A
big benefit is that parallelization is happening behind the scenes, and you
need not explicitly code this in your programs.
Compared to
vector processing, cluster-based computing takes a fundamentally different
approach. Instead of using specially optimized vector hardware, it uses
standard scalar processors, but uses the software layer to break the
given task in parallel independent sub-tasks and compute then on different
nodes.
Some
features of clusters are as follows:
·
Clusters
are built using commodity hardware and cost a fraction of the vector
processors. In many cases, the price is lower by more than an order of
magnitude.
·
Clusters
use a message-passing paradigm for communication, and programs have to be
explicitly coded to make use of distributed hardware.
·
With
clusters, you can add more nodes to the cluster based on need.
·
Open
source software components and Linux lead to lower software costs.
·
Clusters
have a much lower maintenance cost (they take up less space, take less power,
and need less cooling).
Clusters
are usually deployed to improve speed and/or reliability over that provided by
a single computer. All the nodes inside a cluster are almost always connected
by a fast low-latency network, which restricts the cluster nodes to same LAN
(local area network).
With the
introduction of technologies like Infiniband, PVFS ( parallel virtual file-system), gigabit Ethernet etc.
clustering has emerged as the most promising form of supercomputing. You can
build powerful clusters with a very small budget and keep adding extra nodes
based on need.
Clusters
reveal exceptional potential and are recognized as the possible future leader
in the arena of high-performance computing
The first
cluster based supercomputer made an entry to top500 supercomputer list in 1997
and figure 1 shows the current share of supercomputing cluster in the top500
supercomputers list [1].

These clusters
have become very popular over the last few years because of the dramatic
performance improvements in PC processors, to the point where they are now comparable
to high-end UNIX workstations for scientific computations.
Clusters
can further be categorized into -
This site
will primarily be devoted to providing information on the HPC (High Performance
Computing) variety of cluster supercomputing. From now on, we will restrict
this report to HPC clusters and the term “cluster” would refer specifically
to the HPC cluster.
As shown in
the diagram below, typically a cluster comprises of –

Since the
tasks handled by compute nodes are independent these nodes can be added or
removed from a cluster even while it’s up and working.
Based on
the way these clusters break down a massive tasks, they can be divided into two
broad categories -
Master-Slave clusters run specially written
applications which by design take advantage of the underlying parallel
architecture. Applications run on the Master which distributes
parallel/independent the work to the slaves. Such applications make use of
parallel compilers which breaks the application code into independent parallel
instructions capable of running on a MPI (Message parsing interface) layer.
Programs have to be specifically written to take full advantage of this
particular architecture.
SSI clusters create a single (operating) system image from the collection
of independent nodes.
Here the
SSI architecture and modified kernel take care of the implementation of parallelism,
rather than the application. All the applications such as database and
web-server can run un-modified, but can take full advantage of the cluster only
if they are multi-threaded, so that they can be scheduled on different CPU
simultaneously. This system has roots from NUMA based MMP or SMP systems.

Both these
approaches have advantages and disadvantages and used by entirely different
audience. We shall see this in details now.
We have
specific sections of this site covering the major distributions of Master Slave
Clusters and SSI Clusters including a review of the distributions and
information on their installations.
Clustering
Hardware Requirements
By
definition, a cluster must consist of at least two nodes, a master and a slave.
It is not necessary that these machines have the same levels of performance.
The only requirement is that they both share the same architecture. For
instance, the cluster should only consist of all Intel machines or all Apple
machines but not a mixture of the two.
Strictly
speaking, the only hardware requirement when building a cluster is two
computers and some type of networking hardware to connect them with. To
maximize the benefits of a cluster, the right hardware must be used. If at all possible, use identical systems for your nodes.
Life will be much simpler. You'll need to develop and test only one
configuration. This is due to the fact that one node which takes longer to do
its work can slow the entire cluster down as the rest of the nodes must stop
what they are doing and wait for the slow node
to catch up. This is not always the case, but it is a consideration that
must be made. Having identical hardware specs also simplifies the setup process
a great deal as it will allow each hard drive to be imaged from a master
instead of configuring each node individually.
In constructing
a cluster, you can scrounge for existing computers, buy assembled computers, or
buy the parts and assemble your own. Scrounging is the cheapest way to go, but
this approach is often the most time consuming. Usually, using scrounged
systems means you'll end up with a wide variety of hardware, which creates both
hardware and software problems. With older scrounged systems, you are also more
likely to have even more hardware problems. If this is your only option, try to
standardize hardware as much as possible.
Buying new,
preassembled computers may be the simplest approach if money isn't the primary
concern. This is often the best approach for mission-critical applications or
when time is a critical factor. Buying new is also the safest way to go if you
are uncomfortable assembling computers.
Building
your own system is cheaper, provides higher performance and reliability, and
allows for customization. Assembling your own computers may seem daunting, but
it isn't that difficult. You'll need time, personnel, space, and a few tools.
It's a good idea to build a single system and test it for hardware and software
compatibility before you commit to a large bulk order. Even if you do buy
preassembled computers, you will still need to do some testing and maintenance.
If you are
constructing a dedicated cluster, you will not need
full systems. The more you can leave out of each computer, the more computers
you will be able to afford, and the less you will need to maintain on
individual computers. For example, with dedicated clusters you can do without
monitors, keyboards, and mice for each individual compute node. With a minimal
configuration, wiring is usually significantly easier, particularly if you use
rack-mounted equipment. (However, heat dissipation can be a serious problem
with rack-mounted systems.)
The First Step:
Hardware Considerations
To build a
cluster, one must have access to computers on which to install the software.
Therefore, it makes sense to cover this early in the process. For a dedicated
cluster, you determine your needs and there may be a lot you won't need—audio
cards and speakers, video capture cards, etc. Beyond these obvious expendables,
there are other additional parts you might want to consider omitting such as
disk drives, keyboards, mice, and displays.
CPU & Motherboards
A choice of
CPU should be made from two families: Intel x86 [IA32] compatible (such as the
Pentium4) or Compaq Alpha systems [Formerly DEC]. Alternate vendor CPUs are
supported by Linux and cluster-able operating systems (such as the IBM Power PC
chip and the SUN SPARC), however there is only limited support and accompanying
software distributions available for these architectures. In general, clusters
are mainly built with Intel or Alpha architectures. Intel based systems are
considered as commodity systems because there are multiple sources (Intel, AMD,
Cyrix) and obviously ubiquitous. Compaq Alpha on the other hand, is a clear
performance winner, but does represent a single source part, and therefore can
be hard to source at a good price.
Within
the Intel based systems, the Pentium 3 and 4 [23] have the best floating point
performance and they support SMP motherboards.
Selecting a
processor is a balancing act. Your choice will be constrained by cost, performance,
and compatibility. Remember, the rationale behind a commodity off-the-shelf
(COTS) cluster is buying machines that have the most favorable price to
performance ratio, not pricey individual machines. Typically you'll get the
best ratio by purchasing a CPU that is a generation behind the current cutting
edge. This means comparing the numbers. When comparing CPUs, you should look at
the increase in performance versus the increase in the total
cost of a node.
Since Linux
works with most major chip families, stay mainstream and you shouldn't have any
software compatibility problems. Nonetheless, it is a good idea to test a
system before committing to a bulk purchase. Since a primary rationale for
building your own cluster is the economic advantage, you'll probably want to
stay away from the less common chips. While clusters built with UltraSPARC systems may be wonderful performers, few people
would describe these as commodity systems. So unless you just happen to have a
number of these systems that you aren't otherwise using, you'll probably want
to avoid them.
When
comparing motherboards, look to see what is integrated into the board. There
are some significant differences. Serial, parallel, and USB ports along with
EIDE disk adapters are fairly standard. You may also find motherboards with
integrated FireWire ports, a network interface, or even a video interface. You
may be able to save money with built-in network or display interfaces (provided
they actually meet your needs), If you are really certain that some fully
integrated motherboard meets your needs, eliminating the need for daughter
cards may allow you to go with a small case.
So a
motherboard with built-in components i.e. AGP, LAN card (preferably with both
100 Mbps and 1 Gbps NICs)
etc. is a feasible option because of the following reasons,
Example:
Intel Entry level Server Board S875WP1-E
The
standard Pentium 4 processor(without extended
addressing) can only address 4 GB of virtual memory.
Hence Intel
Pentium4 with EMT64 or AMD Athlon64 would be better if choosing a low price CPU, they can address up-to 2TB of virtual memory.
If choosing
server class CPU Intel Xeon or AMD Optron would be a
better choice.
Memory per
node is another deciding factor that can affect the through put of the cluster.
Amongst the hardware resources on a cluster node, memory is the one that is
comparatively cheaper and can be easily upgraded. Increasing any hardware
resource on the cluster will definitely affect the performance of the cluster
positively. Subject to your budget, the more cache and RAM in your system, the
better. Typically, the faster the processor, the more RAM you will need. A very
crude rule of thumb is one byte of RAM for every floating-point operation per
second. So a processor capable of 100 MFLOPs would
need around 100 MB of RAM.
Each node
should at least have 2GB of memory to facilitate the process check-pointing
techniques various clusters solutions use.
Almost all
the clustering solutions allow each node to have either decentralized private
storage or a common shared storage.
Decentralized
storage – Each node can have any SATA or SCSI disk with speed > 5400.
PVFS –
Parallel virtual file-system would be a good choice, it creates a single
file-system across different private disks.
Shared
Storage – A iSCSI setup
would be ideal for shared storage. In this case the (centralized) shared
storage would be connected using network to all the nodes. Each node would have
a preferably gigabyte Ethernet NIC for communication with iSCSI
drives. If using iSCSI, GFS (Global file system) from
Redhat can be installed on stared storage, so that
all the nodes can access the disks concurrently. GFS comes by default with RHEL
installation.
In both
PVFS and GFS all the nodes would same namespace of the file-system. GFS is
known to be more secure, but iSCSI setup costs much
more.
If you are
buying hard disks, there are three issues: interface type (EIDE vs. SCSI), disk
latency (a function of rotational speed), and disk capacity. From a
price-performance perspective, EIDE is probably a better choice than SCSI since
virtually all motherboards include a built-in EIDE interface. And unless you
are willing to pay a premium, you won't have much choice with respect to disk
latency. Almost all current drives rotate at 7,200 RPM. While a few 10,000 RPM
drives are available, their performance, unlike their price, is typically not
all that much higher. With respect to disk capacity, you'll need enough space
for the operating system, local paging, and the data sets you will be
manipulating. Unless you have extremely large data sets, when recycling older
computers a 10 GB disk should be adequate for most uses, however 40 GB IDE disk
is recommended mostly because of other non-cluster requirements.
In many clusters,
on the compute nodes, usually no keyboard, mice and keyboard are required.
Its
advantage includes low cost, less equipment to maintain, and a smaller
equipment footprint. However monitor, keyboard and mouse is required at head
node. Also it is recommended that an extra set of these peripherals should be
kept for troubleshooting any node.
Adapters, power supplies, and cases
Each node
should include a video adapter. The network adapter is a key component.
Gigabyte
Ethernet or Infiniband network would be preferred and
each compute node should have at lease two NICs (network interface cards) both
configured for private networks.
The master
node should have 3 NICs, two for private network and
one for public connection which would facilitate remote administration.
The Linux kernel
> 2.6.16 has support for both Infiniband and
Gigabyte Ethernet NICs. RHEL 4 Update2 also gave out
support of these.
You must
buy an adapter that is compatible with the cluster network. If you are planning
to boot a diskless system over the network, you'll need an adapter that
supports it. This translates into an adapter with an appropriate network BOOT
ROM, i.e., one with pre-execution environment (PXE)
support. Many adapters come with a built-in (but empty) BOOT ROM socket so that
the ROM can be added. You can purchase BOOT ROMs for these cards or burn your
own. However, it may be cheaper to buy a new card with an installed BOOT ROM
than to add the BOOT ROMs.
A cluster
can be easily constructed from cheap commodity computers. The faster they are, the better. The two most important requirements of a
cluster are -
Supercomputing Programming Overview
If you will
be writing your own applications for your cluster, this information should get
you started. After the operating system and other basic system software and
selected and installed, you need to select and install core software
development tools including libraries that support parallel processing. If you
are using the OSCAR or Rocks clustering toolkits, then these libraries and
tools are already in your cluster.
While there
are hundreds of programming languages available FORTRAN or C/C++ are still
reasonable choices for writing code for high performance clusters. FORTRAN has changed considerably over the
years, so the term can mean different things to different people. While there
are more recent versions of FORTRAN, your choice will likely be between FORTRAN
77 and FORTRAN 90. For a variety of reasons, FORTRAN 77 is likely to get the
nod over FORTRAN 90 despite the greater functionality of FORTRAN 90. First, the
GNU implementation of FORTRAN 77 is likely to already be on your machine. If it
isn't, it is freely available and easily obtainable. If you really want FORTRAN
90, don't forget to budget for it. But you should also realize that you may
face compatibility issues. When selecting parallel
programming libraries to use with your compiler, your choices will be more
limited with FORTRAN 90.
C and C++ are the obvious alternatives to FORTRAN. For new
applications that don't depend on compatibility with legacy FORTRAN
applications, C is probably the best choice. In general, you have greater
compatibility with libraries. And at this point in time, you are likely to find
more programmers trained in C than FORTRAN. So when you need help, you are more
likely to find a helpful C than FORTRAN programmer.
Most common
and widely used parallel programming libraries are –
Parallel Virtual Machine (PVM)
Message Passing Interface (MPI)
library
Development
of PVM started in 1989 and continued into 90’s as a joing
effort among Oak Ridge National Laboratory, the
This PVM
implementation provides both libraries and tools based on a message-passing
model.
MPI is a
newer standard that seems to be generally preferred over PVM by many users.
MPI is an
API for parallel programming based on a message-passing model for parallel
computing. MPI processes execute in parallel. Each process has a separate
address space. Sending processes specify data to be sent and a destination
process. The receiving process specifies an area in memory for the message, the
identity of the source, etc.
Primarily,
MPI can be thought of as a standard that specifies a library. Users can write
code in C, C++, or FORTRAN using a standard compiler and then link to the MPI
library. The library implements a predefined set of function calls to send and
receive messages among collaborating processes on the different machines in the
cluster. You write your code using these functions and link the completed code
to the library.
The MPI
specification was developed by the MPI Forum, a collaborative effort with
support from both academia and industry. It is suitable for both small clusters
and "big-iron" implementations. It was designed with functionality, portability,
and efficiency in mind. By providing a well-designed set of function calls, the
library provides a wide range of functionality that can be implemented in an
efficient manner. As a clearly defined standard, the library can be implemented
on a variety of architectures, allowing code to move easily among machines.
MPI has
gone through a couple of revisions since it was introduced in the early '90s.
MPI-2 is the latest version available. While there are several different
implementations of MPI, there are two that are widely used—LAM/MPI and MPICH.
Both LAM/MPI and MPICH go beyond simply providing a library. Both include
programming and runtime environments providing mechanisms to run programs
across the cluster. Both are widely used, robust, well supported, and freely
available. Excellent documentation is provided with both
LAM/MPI
The Local
Area Multicomputer/Message Passing Interface
(LAM/MPI)
was originally developed by the
Documentation
can be downloaded from the LAM site,
There are
also tutorials, a FAQ, and archived mailing lists.
MPI
Message
Passing Interface Chameleon (MPICH) was developed by
William Gropp and Ewing Lusk and is freely available from
Argonne National Laboratory
http://www-unix.mcs.anl.gov/mpi/mpich/
Like LAM,
it is both a library and an execution environment. It runs on a wide variety of
UNIX platforms and is even available for Windows NT.
Documentation
can be downloaded from the web site. There are separate manuals for each of the
communication models.
Examples Using MPI
While MPI
does not offer some of the specialized features available in PVM, it is based
on agreed-upon standards, is increasingly used for code development, and has
adequate features for most parallel applications. Hence, the coding examples
presented here are written in C using MPI. The codes have been tested on a
Beowulf cluster using the Gnu C compiler (gcc) with the MPICH implementation of
MP
Program 1
is a "Hello World!" program that illustrates the basic MPI calls
necessary to startup and end an MPI program.
|
#include
<stdio.h>
#include
"mpi.h"
void
main(int argc, char **argv)
{
int me, nprocs, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,
&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,
&me);
MPI_Get_processor_name(processor_name, &namelen);
printf("Hello
World! I'm process %d of %d on
%s\n", me, nprocs,
processor_name);
MPI_Finalize();
} |
In order to
successfully compile the code, the MPI header file (mpi.h) must be included at the top of the
code. Just inside
main(),
MPI_Init() must be called and handed the command line arguments so
that the environment is setup correctly for the program to run in parallel.
The next
three MPI routines in Program 1 return information about the parallel
environment for use later in the code. In this example, we merely print out the
information, but in most parallel codes this information is used to do
automatic problem decomposition and to setup communications between processes.
MPI_Comm_size() provides the number of processes,
which is subsequently stored in
nprocs, in the communicator group
MPI_COMM_WORLD.
MPI_COMM_WORLD is a special communicator which
denotes all of the processes available at initialization. MPI_Comm_rank() provides the rank or process number
(ranging from 0 to
nprocs-1) of the calling process. The rank
is subsequently stored in
me.
MPI_Get_processor_name() provides the hostname of the node
(not the individual processor) being used, stored in
processor_name, as well as the length of this
hostname, stored in
namelen.
Next, the
code prints "Hello World!" and the values of the variables obtained
in the three previous MPI calls. Finally,
MPI_Finalize() is called to terminate the parallel
environment.
MPI programs
can be compiled in many ways, but most MPI implementations provide an
easy-to-use script which will set desired compiler flags, point the compiler at
the right directory for MPI header files, and include the necessary libraries
for the linker. The MPICH implementation provides a script called mpicc which will use the desired
compiler, in this case
gcc, and will pass the other command line arguments to it.
Output 1 shows how to compile Program 1 with
mpicc.
# mpirun -np 6 hello
Hello
World! I'm process 4 of 6 on beowulf005
Hello
World! I'm process 1 of 6 on beowulf002
Hello
World! I'm process 5 of 6 on beowulf006
Hello
World! I'm process 2 of 6 on beowulf003
Hello
World! I'm process 3 of 6 on beowulf004
Hello
World! I'm process 0 of 6 on beowulf001