Overview

Overview of Supercomputer Architectures, Hardware and Programming Methods


The basic concept behind HPC (High Performance Computing) is parallelism. HPC hardware (and software) falls into 3 categories -

 

  1. SMP (Symmetric Multi-Processors)
  2. Vector processors
  3. Clustering

 

 

SMP is a multiprocessor computer architecture where to or more identical processors are connected with a single uniformly shared memory. Most common commercial systems use SMP for parallel processing.

 

Almost, all the modern operating systems are SMP capable i.e they can simultaneously execute code on multiple processors to achieve parallelism. Recent O(1)  ( Read Order 1 or constant order) scheduling capability of Linux  kernel has put it in lime-light.

 

The scalability matrix of various systems is -

 

  1. Windows 2000/NT – 2 CPUs (Although they claim 4 CPUs)
  2. Windows 2003 R2 server – 4/8 CPUs ( performance doesn't increase after 4 CPUs).
  3. Linux – 32 CPU ( commercial servers support 16 CPU servers only).
  4. AIX – 64 CPUs
  5. Solaris – 128 CPUs
  6. HP-UX – 256 CPUs ( Superdome Servers which even have 2TB = 2000Gb Ram !!!)

 

 

Currently IBM ans Sun ( Sun Fire X series) offer linux SMP systems with 16 64bit AMD Opteron or Intel Zeon/Itanium CPUs. But there are currently some issues with linux scalability on SMP systems mainly attributed to its kernel locking mechanism.

 

HP Boasts of its offerings in HP Superdome Servers which has upto 256 64Bit Intel Itanium or HP PA-RISC processors and 2TB or RAM. They only run HP-UX which is again licensed on per CPU bases.

 

The only discouraging factor against SMP is the high cost of owner ship and maintainability. Because even if one CPU or even its cache goes out of order entire system has to be replaced.

 

On the contrary the hot-swappable nodes in a linux compute cluster makes it more popular for HPC market. Also, the very thin Linux layer (if any) running these cluster nodes make all the CPU power available for the application, as compared to a SMP system where OS itself eats up significant resources.

 

Not very  surprisingly a compressed HP UX kernel image is > 180MB and sometimes even 300MB which hogs Huge memory space when running in uncompressed form. A Linux kernel image usually is 1.5 MB and can be made as small as 200 KB.

 

The product information about SMP systems can be achieved from -

 

  1. http://www.sun.com/software/linux/
  2. http://www.ibm.com/servers/eserver/linux/home.html?c=serversintro&n=Linux2001&t=ad
  3. http://www.alinuxservers.com/
  4. http://www.ibm.com/linux/

 

 

Vector processors which were invented in early 1980s are still a subject of research and still far-away from meeting any enterprise requirements.

 

 

Clusters are groups of loosely bound independent servers (called nodes) that work together closely so that in many aspects they can be viewed as though they are a single large server.

 

Before cluster-based computing, the typical supercomputer was a vector processor that could typically cost over a million dollars due to the specialized hardware and software.

 

Fundamentally, both vector and scalar processors execute instructions based on a clock; what sets them apart is the ability of vector processors to parallelize computations involving vectors (such as matrix multiplication), which are commonly used in High Performance Computing. To illustrate this, suppose you have two double arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].

 

Any floating-point operation like addition or multiplication is achieved in a few discrete steps:

 

·       Exponents are adjusted

·       Significants are added

·       The result is checked for rounding errors and so on

 

 

A vector processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE arithmetic hardware) in a floating-point addition as shown in Figure 1.

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

A vector processor does these six steps in parallel. As you can see, for a six-step floating-point addition, the speedup factor will be very close to six times. A big benefit is that parallelization is happening behind the scenes, and you need not explicitly code this in your programs.

 

Compared to vector processing, cluster-based computing takes a fundamentally different approach. Instead of using specially optimized vector hardware, it uses standard scalar processors, but uses the software layer to break the given task in parallel independent sub-tasks and compute then on different nodes.

 

Some features of clusters are as follows:

 

·       Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.

·       Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.

·       With clusters, you can add more nodes to the cluster based on need.

·       Open source software components and Linux lead to lower software costs.

·       Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).

 

 

Clusters are usually deployed to improve speed and/or reliability over that provided by a single computer. All the nodes inside a cluster are almost always connected by a fast low-latency network, which restricts the cluster nodes to same LAN (local area network).

 

With the introduction of technologies like Infiniband, PVFS ( parallel virtual file-system), gigabit Ethernet etc. clustering has emerged as the most promising form of supercomputing. You can build powerful clusters with a very small budget and keep adding extra nodes based on need.

 

Clusters reveal exceptional potential and are recognized as the possible future leader in the arena of high-performance computing

 

The first cluster based supercomputer made an entry to top500 supercomputer list in 1997 and figure 1 shows the current share of supercomputing cluster in the top500 supercomputers list [1].

 

 

 

These clusters have become very popular over the last few years because of the dramatic performance improvements in PC processors, to the point where they are now comparable to high-end UNIX workstations for scientific computations.

 

 

Clusters can further be categorized into -

 

  1. High Availability (HA) cluster
  2. Load Balancing cluster
  3. High Performance Computing (HPC) cluster

 

This site will primarily be devoted to providing information on the HPC (High Performance Computing) variety of cluster supercomputing. From now on, we will restrict this report to HPC clusters and the term “cluster” would refer specifically to the HPC cluster.

 

 

As shown in the diagram below, typically a cluster comprises of –

 

  1. head node - which controls the clustering and computing operations
  2. compute node – which takes in independent compute tasks from the head node and deliver them back.

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Since the tasks handled by compute nodes are independent these nodes can be added or removed from a cluster even while it’s up and working.

 

Based on the way these clusters break down a massive tasks, they can be divided into two broad categories -

 

  1. Master-Slave clusters.
  2. SSI – Single System Image clusters.

 

Master-Slave clusters run specially written applications which by design take advantage of the underlying parallel architecture. Applications run on the Master which distributes parallel/independent the work to the slaves. Such applications make use of parallel compilers which breaks the application code into independent parallel instructions capable of running on a MPI (Message parsing interface) layer. Programs have to be specifically written to take full advantage of this particular architecture.

 

SSI clusters create a single (operating) system image from the collection of independent nodes.

 

Here the SSI architecture and modified kernel take care of the implementation of parallelism, rather than the application. All the applications such as database and web-server can run un-modified, but can take full advantage of the cluster only if they are multi-threaded, so that they can be scheduled on different CPU simultaneously. This system has roots from NUMA based MMP or SMP systems.

 

 

Both these approaches have advantages and disadvantages and used by entirely different audience. We shall see this in details now.

 

We have specific sections of this site covering the major distributions of Master Slave Clusters and SSI Clusters including a review of the distributions and information on their installations.

 

 

Clustering Hardware Requirements

By definition, a cluster must consist of at least two nodes, a master and a slave. It is not necessary that these machines have the same levels of performance. The only requirement is that they both share the same architecture. For instance, the cluster should only consist of all Intel machines or all Apple machines but not a mixture of the two.

 

Strictly speaking, the only hardware requirement when building a cluster is two computers and some type of networking hardware to connect them with. To maximize the benefits of a cluster, the right hardware must be used. If at all possible, use identical systems for your nodes. Life will be much simpler. You'll need to develop and test only one configuration. This is due to the fact that one node which takes longer to do its work can slow the entire cluster down as the rest of the nodes must stop what they are doing and wait for the slow node  to catch up. This is not always the case, but it is a consideration that must be made. Having identical hardware specs also simplifies the setup process a great deal as it will allow each hard drive to be imaged from a master instead of configuring each node individually.

 

In constructing a cluster, you can scrounge for existing computers, buy assembled computers, or buy the parts and assemble your own. Scrounging is the cheapest way to go, but this approach is often the most time consuming. Usually, using scrounged systems means you'll end up with a wide variety of hardware, which creates both hardware and software problems. With older scrounged systems, you are also more likely to have even more hardware problems. If this is your only option, try to standardize hardware as much as possible.

 

Buying new, preassembled computers may be the simplest approach if money isn't the primary concern. This is often the best approach for mission-critical applications or when time is a critical factor. Buying new is also the safest way to go if you are uncomfortable assembling computers.

 

Building your own system is cheaper, provides higher performance and reliability, and allows for customization. Assembling your own computers may seem daunting, but it isn't that difficult. You'll need time, personnel, space, and a few tools. It's a good idea to build a single system and test it for hardware and software compatibility before you commit to a large bulk order. Even if you do buy preassembled computers, you will still need to do some testing and maintenance.

 

If you are constructing a dedicated cluster, you will not need full systems. The more you can leave out of each computer, the more computers you will be able to afford, and the less you will need to maintain on individual computers. For example, with dedicated clusters you can do without monitors, keyboards, and mice for each individual compute node. With a minimal configuration, wiring is usually significantly easier, particularly if you use rack-mounted equipment. (However, heat dissipation can be a serious problem with rack-mounted systems.)

 

The First Step: Hardware Considerations

To build a cluster, one must have access to computers on which to install the software. Therefore, it makes sense to cover this early in the process. For a dedicated cluster, you determine your needs and there may be a lot you won't need—audio cards and speakers, video capture cards, etc. Beyond these obvious expendables, there are other additional parts you might want to consider omitting such as disk drives, keyboards, mice, and displays.

 

CPU & Motherboards

A choice of CPU should be made from two families: Intel x86 [IA32] compatible (such as the Pentium4) or Compaq Alpha systems [Formerly DEC]. Alternate vendor CPUs are supported by Linux and cluster-able operating systems (such as the IBM Power PC chip and the SUN SPARC), however there is only limited support and accompanying software distributions available for these architectures. In general, clusters are mainly built with Intel or Alpha architectures. Intel based systems are considered as commodity systems because there are multiple sources (Intel, AMD, Cyrix) and obviously ubiquitous. Compaq Alpha on the other hand, is a clear performance winner, but does represent a single source part, and therefore can be hard to source at a good price.

 

Within the Intel based systems, the Pentium 3 and 4 [23] have the best floating point performance and they support SMP motherboards.

 

Selecting a processor is a balancing act. Your choice will be constrained by cost, performance, and compatibility. Remember, the rationale behind a commodity off-the-shelf (COTS) cluster is buying machines that have the most favorable price to performance ratio, not pricey individual machines. Typically you'll get the best ratio by purchasing a CPU that is a generation behind the current cutting edge. This means comparing the numbers. When comparing CPUs, you should look at the increase in performance versus the increase in the total cost of a node.

 

Since Linux works with most major chip families, stay mainstream and you shouldn't have any software compatibility problems. Nonetheless, it is a good idea to test a system before committing to a bulk purchase. Since a primary rationale for building your own cluster is the economic advantage, you'll probably want to stay away from the less common chips. While clusters built with UltraSPARC systems may be wonderful performers, few people would describe these as commodity systems. So unless you just happen to have a number of these systems that you aren't otherwise using, you'll probably want to avoid them.

 

When comparing motherboards, look to see what is integrated into the board. There are some significant differences. Serial, parallel, and USB ports along with EIDE disk adapters are fairly standard. You may also find motherboards with integrated FireWire ports, a network interface, or even a video interface. You may be able to save money with built-in network or display interfaces (provided they actually meet your needs), If you are really certain that some fully integrated motherboard meets your needs, eliminating the need for daughter cards may allow you to go with a small case.

 

So a motherboard with built-in components i.e. AGP, LAN card (preferably with both 100 Mbps and 1 Gbps NICs) etc. is a feasible option because of the following reasons,

Example: Intel Entry level Server Board S875WP1-E

 

The standard Pentium 4 processor(without extended addressing) can only address 4 GB of virtual memory.

 

Hence Intel Pentium4 with EMT64 or AMD Athlon64 would be better if choosing a low price CPU, they can address up-to 2TB of virtual memory.

 

If choosing server class CPU Intel Xeon or AMD Optron would be a better choice.

 

Memory and disks

Memory per node is another deciding factor that can affect the through put of the cluster. Amongst the hardware resources on a cluster node, memory is the one that is comparatively cheaper and can be easily upgraded. Increasing any hardware resource on the cluster will definitely affect the performance of the cluster positively. Subject to your budget, the more cache and RAM in your system, the better. Typically, the faster the processor, the more RAM you will need. A very crude rule of thumb is one byte of RAM for every floating-point operation per second. So a processor capable of 100 MFLOPs would need around 100 MB of RAM.

 

Each node should at least have 2GB of memory to facilitate the process check-pointing techniques various clusters solutions use.

 

Almost all the clustering solutions allow each node to have either decentralized private storage or a common shared storage.

 

Decentralized storage – Each node can have any SATA or SCSI disk with speed > 5400.

 

PVFS – Parallel virtual file-system would be a good choice, it creates a single file-system across different private disks.

 

Shared Storage – A iSCSI setup would be ideal for shared storage. In this case the (centralized) shared storage would be connected using network to all the nodes. Each node would have a preferably gigabyte Ethernet NIC for communication with iSCSI drives. If using iSCSI, GFS (Global file system) from Redhat can be installed on stared storage, so that all the nodes can access the disks concurrently. GFS comes by default with RHEL installation.

 

In both PVFS and GFS all the nodes would same namespace of the file-system. GFS is known to be more secure, but iSCSI setup costs much more.

 

If you are buying hard disks, there are three issues: interface type (EIDE vs. SCSI), disk latency (a function of rotational speed), and disk capacity. From a price-performance perspective, EIDE is probably a better choice than SCSI since virtually all motherboards include a built-in EIDE interface. And unless you are willing to pay a premium, you won't have much choice with respect to disk latency. Almost all current drives rotate at 7,200 RPM. While a few 10,000 RPM drives are available, their performance, unlike their price, is typically not all that much higher. With respect to disk capacity, you'll need enough space for the operating system, local paging, and the data sets you will be manipulating. Unless you have extremely large data sets, when recycling older computers a 10 GB disk should be adequate for most uses, however 40 GB IDE disk is recommended mostly because of other non-cluster requirements.

Monitors, keyboards, and mice

In many clusters, on the compute nodes, usually no keyboard, mice and keyboard are required.

 

Its advantage includes low cost, less equipment to maintain, and a smaller equipment footprint. However monitor, keyboard and mouse is required at head node. Also it is recommended that an extra set of these peripherals should be kept for troubleshooting any node.

 

Adapters, power supplies, and cases

Each node should include a video adapter. The network adapter is a key component.

 

Gigabyte Ethernet or Infiniband network would be preferred and each compute node should have at lease two NICs  (network interface cards) both configured for private networks.

 

The master node should have 3 NICs, two for private network and one for public connection which would facilitate remote administration.

 

The Linux kernel > 2.6.16 has support for both Infiniband and Gigabyte Ethernet NICs. RHEL 4 Update2 also gave out support of these.

 

You must buy an adapter that is compatible with the cluster network. If you are planning to boot a diskless system over the network, you'll need an adapter that supports it. This translates into an adapter with an appropriate network BOOT ROM, i.e., one with pre-execution environment (PXE) support. Many adapters come with a built-in (but empty) BOOT ROM socket so that the ROM can be added. You can purchase BOOT ROMs for these cards or burn your own. However, it may be cheaper to buy a new card with an installed BOOT ROM than to add the BOOT ROMs.

 

A cluster can be easily constructed from cheap commodity computers. The faster they are, the better. The two most important requirements of a cluster are -

 

  1. High speed and low latency network.
  2. High amount of RAM > 2GB.

 

 

Supercomputing Programming Overview

If you will be writing your own applications for your cluster, this information should get you started. After the operating system and other basic system software and selected and installed, you need to select and install core software development tools including libraries that support parallel processing. If you are using the OSCAR or Rocks clustering toolkits, then these libraries and tools are already in your cluster.

 

Programming Languages

While there are hundreds of programming languages available FORTRAN or C/C++ are still reasonable choices for writing code for high performance clusters.  FORTRAN has changed considerably over the years, so the term can mean different things to different people. While there are more recent versions of FORTRAN, your choice will likely be between FORTRAN 77 and FORTRAN 90. For a variety of reasons, FORTRAN 77 is likely to get the nod over FORTRAN 90 despite the greater functionality of FORTRAN 90. First, the GNU implementation of FORTRAN 77 is likely to already be on your machine. If it isn't, it is freely available and easily obtainable. If you really want FORTRAN 90, don't forget to budget for it. But you should also realize that you may face compatibility issues. When selecting parallel programming libraries to use with your compiler, your choices will be more limited with FORTRAN 90.

 

C and C++ are the obvious alternatives to FORTRAN. For new applications that don't depend on compatibility with legacy FORTRAN applications, C is probably the best choice. In general, you have greater compatibility with libraries. And at this point in time, you are likely to find more programmers trained in C than FORTRAN. So when you need help, you are more likely to find a helpful C than FORTRAN programmer.

 

Selecting a Library

Most common and widely used parallel programming libraries are –

 

Parallel Virtual Machine (PVM)

Message Passing Interface (MPI) library

 

Development of PVM started in 1989 and continued into 90’s as a joing effort among Oak Ridge National Laboratory, the University of Tennessee, Emory University, and Carnegie-Mellon University. An implementation of PVM is available from

 

http://www.netlib.org/pvm3/

 

This PVM implementation provides both libraries and tools based on a message-passing model.

 

MPI is a newer standard that seems to be generally preferred over PVM by many users.

MPI is an API for parallel programming based on a message-passing model for parallel computing. MPI processes execute in parallel. Each process has a separate address space. Sending processes specify data to be sent and a destination process. The receiving process specifies an area in memory for the message, the identity of the source, etc.

 

Primarily, MPI can be thought of as a standard that specifies a library. Users can write code in C, C++, or FORTRAN using a standard compiler and then link to the MPI library. The library implements a predefined set of function calls to send and receive messages among collaborating processes on the different machines in the cluster. You write your code using these functions and link the completed code to the library.

 

The MPI specification was developed by the MPI Forum, a collaborative effort with support from both academia and industry. It is suitable for both small clusters and "big-iron" implementations. It was designed with functionality, portability, and efficiency in mind. By providing a well-designed set of function calls, the library provides a wide range of functionality that can be implemented in an efficient manner. As a clearly defined standard, the library can be implemented on a variety of architectures, allowing code to move easily among machines.

MPI has gone through a couple of revisions since it was introduced in the early '90s. MPI-2 is the latest version available. While there are several different implementations of MPI, there are two that are widely used—LAM/MPI and MPICH. Both LAM/MPI and MPICH go beyond simply providing a library. Both include programming and runtime environments providing mechanisms to run programs across the cluster. Both are widely used, robust, well supported, and freely available. Excellent documentation is provided with both

 

LAM/MPI

The Local Area Multicomputer/Message Passing Interface (LAM/MPI) was originally developed by the Ohio Supercomputing Center. It is now maintained by the Open Systems Laboratory at Indiana University. As previously noted, LAM/MPI (or LAM for short) is both an MPI library and an execution environment.  LAM was designed to include an extensible component framework known as System Service Interface (SSI), one of its major strengths. It works well in a wide variety of environments and supports several methods of inter-process communications using TCP/IP. LAM will run on most UNIX machines (but not Windows). New releases are tested with both Red Hat and Mandrake Linux.

Documentation can be downloaded from the LAM site,

 

http://www.lam-mpi.org/.

 

There are also tutorials, a FAQ, and archived mailing lists.

 

MPI

Message Passing Interface Chameleon (MPICH) was developed by William Gropp and Ewing Lusk and is freely available from Argonne National Laboratory

 

http://www-unix.mcs.anl.gov/mpi/mpich/

 

Like LAM, it is both a library and an execution environment. It runs on a wide variety of UNIX platforms and is even available for Windows NT.

Documentation can be downloaded from the web site. There are separate manuals for each of the communication models.

 

Examples Using MPI

While MPI does not offer some of the specialized features available in PVM, it is based on agreed-upon standards, is increasingly used for code development, and has adequate features for most parallel applications. Hence, the coding examples presented here are written in C using MPI. The codes have been tested on a Beowulf cluster using the Gnu C compiler (gcc) with the MPICH implementation of MP

 

Program 1 is a "Hello World!" program that illustrates the basic MPI calls necessary to startup and end an MPI program.

 

 

Program 1: hello.c

#include <stdio.h>

#include "mpi.h"

 

void main(int argc, char **argv)

{

        int me, nprocs, namelen;

        char processor_name[MPI_MAX_PROCESSOR_NAME];

 

        MPI_Init(&argc, &argv);

        MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

        MPI_Comm_rank(MPI_COMM_WORLD, &me);

        MPI_Get_processor_name(processor_name, &namelen);

 

            printf("Hello World!  I'm process %d of %d on %s\n", me, nprocs,

                        processor_name);

 

        MPI_Finalize();

}

 

In order to successfully compile the code, the MPI header file (mpi.h) must be included at the top of the code. Just inside main(), MPI_Init() must be called and handed the command line arguments so that the environment is setup correctly for the program to run in parallel.

 

The next three MPI routines in Program 1 return information about the parallel environment for use later in the code. In this example, we merely print out the information, but in most parallel codes this information is used to do automatic problem decomposition and to setup communications between processes.

 

MPI_Comm_size() provides the number of processes, which is subsequently stored in nprocs, in the communicator group MPI_COMM_WORLD. MPI_COMM_WORLD is a special communicator which denotes all of the processes available at initialization. MPI_Comm_rank() provides the rank or process number (ranging from 0 to nprocs-1) of the calling process. The rank is subsequently stored in me. MPI_Get_processor_name() provides the hostname of the node (not the individual processor) being used, stored in processor_name, as well as the length of this hostname, stored in namelen.

 

Next, the code prints "Hello World!" and the values of the variables obtained in the three previous MPI calls. Finally, MPI_Finalize() is called to terminate the parallel environment.

 

MPI programs can be compiled in many ways, but most MPI implementations provide an easy-to-use script which will set desired compiler flags, point the compiler at the right directory for MPI header files, and include the necessary libraries for the linker. The MPICH implementation provides a script called mpicc which will use the desired compiler, in this case gcc, and will pass the other command line arguments to it. Output 1 shows how to compile Program 1 with mpicc.

# mpicc -O -o hello hello.c

 

# mpirun -np 6 hello

 

Hello World!  I'm process 4 of 6 on beowulf005

Hello World!  I'm process 1 of 6 on beowulf002

Hello World!  I'm process 5 of 6 on beowulf006

Hello World!  I'm process 2 of 6 on beowulf003

Hello World!  I'm process 3 of 6 on beowulf004

Hello World!  I'm process 0 of 6 on beowulf001