Master-Slave Clusters

Master-Slave Clusters

 

 

Typically a Master – Slave clustering software stack looks something like this -

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Custom written clustering applications make use of a Message Parsing layer to interact and take advantage of various compute nodes.

 

The message parsing layer abstracts underlying network type and gives you an API using which you can write software code which dispatches jobs on remote nodes. Most commonly used message parsing layers are -

 

  1. MPI – (Message Parsing Interface), is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users. MPI as 2 open source implementations -

1.       MPICH - http://www-unix.mcs.anl.gov/mpi/mpich/

2.       Open-MPI – http://open-mpi.org/  This is a preferred implementation because of its speed of execution.

 

  1. PVM – Parallel Virtual Machine. This is another but complete different implementation of message parsing layer. MPI has found better support and acceptance than PVM. But, PVM is much more fault tolerant.

           

            More on there differences - www.mcs.anl.gov/~gropp/bib/papers/2002/mpiandpvm.pdf

 

 

A High level MPI program looks very similar to a program using BSD-socket API, with the primary difference being that MPI abstracts any lower level TCP/IP details and network topology and used node id's instead. Here is an example -

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

But, MPI fails to recover from any fault.

 

This brought about the need of PVM, which was designed and better suited to recover application and its data through faults using advanced check-pointing techniques.

 

Job scheduling and launching refers to the way you launch and monitor your applications on clusters. It tightly works with clustering management software which keeps tracks on online nodes using a software technique called heartbeat. It provides your methods for automatically starting jobs on nodes when they recover from a fault.

 

Clustering Management Software generally provide a CLI or web interface to monitor performance and health of individual nodes.

 

Typically the distributions come with a database server which keeps track of cluster memberships and a daemon which monitors the nodes. Whenever a node is added/removed/recovered , the new task (a description a parallel computation as a part of the application) is pushed to the node. And its response is logged in a cluster wide log. This log can later be viewed and analyzed using a web interface.

 

An example of Ganglia Muster Manager from Rocks distribution.

Most Master-Slave distributions support different HPC drivers  for interconnect and storage. The most commonly shipped drivers are -

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  1. Interconnect

1.       Gigabyte ethernet. Comes handly if you are working with MPI.

2.       Infiniband

 

  1. Storage

1.       GFS – creates single file-system image on shared storage for concurrent access from all the nodes

2.       PVFS – creates a single file-system image on distributed storage. This is less fault tolerant compared to GFS, as whenever a single node (with some local storage) goes down the overall availability of the file system suffers.

 

 

Master – Slave Distributions

The main project which started it all was the Beowulf Project. Its an open source  effort to bring Linux clustering to mainstream. There are many distributions which are all based on Beowulf, but provide different tools - compilers, libraries, file-systems etc.

 

 

These distributions are -

 

  1. Beowulf (http://www.beowulf.org/) - It all started from here. Its a distribution based on RHEL-3 which runs Linux kernel 2.4 . They Open-MPI as a message parsing layer, and GFS, PVFS as file-systems. They have most of the network drivers (infiniband and gigabyte ethernet) back-ported to 2.4 kernel as well.

 

  1. Rocks (http://www.rocksclusters.org/) - The Rocks Clustering Toolkit, from the folks at NPACI, is a collection of Open Source tools to help build, manage, and monitor, clusters. They bundle a wide variety of compilers, message interface and drivers. This distribution is very well maintained and based on RHEL 4 running 2.6 kernel with all the latest drivers.

 

Compiler and libraries for Open-MPI, MPICH, PVM and LAM-MPI (old cousin of Open-MPI) are bundled. PVFS and GFS drivers are also provided. And comes with a web interface to monitor memory, disk and cpu stats of all the nodes.

 

One great plus point with Rocks is that it comes with Sun Grid Engine a powerful platform for creating and batching parallel applications. I would strongly recommend using this.        

           

  1. RHEL-AS/ES  - Redhat Advanced Server. Its a enterprise quality solution which is lot like Beowulf, but supports only GFS and very specific drivers and tools. Its more concentrated on providing few but quality choices. They support Open-MPI and GFS only.

 

AS like any other enterprise solution from RHEL is commercial but open source, the souce rpms for the same can be found at -

ftp://ftp.redhat.com/pub/redhat/linux/enterprise/2.1AS/en/os

 

  1. Mandriva HPC – Its Mandrakes attempt to do exactly what Redhat is doing.No difference from RHEL – AS.

 

  1. Parallel-Knoppix (http://idea.uab.es/mcreel/ParallelKnoppix/) - It's a Live CD distribution and can comes pre-loaded and configured with all clustering goodies. Its for those who want to build a cluster in 30 mins and test your hardware. The Open-MPI and MPICH layers/compilers are installed. PVM is also included. It supports only PVFS drivers.

 

 

All these installations are Anaconda (Redhat distribution Installer) based. Hence installing them is no different from installing any Fedora RHEL versions. Same goes for configuring master node for any day-to-day operations.

 

Most distributions support central installation where, installation for all the nodes is kick-started from the head-node. The following page shows one example setup.

 

The following is an example distribution setup.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Notes :-

 

  1. All compute nodes with run variant of Linux as provided by the distribution. But the nodes are installed with a very thin layer of Linux sufficient to take parallel tasks and compute them. The can be disk-less depending upon if you are using GFS or PVFS

 

  1. Head/Master node contains all the necessary packages for compilation and kick-starting of the applications

 

  1. Frontend servers (which may be same as Head Node) contain the cluster configuration database and management software. For a large cluster it is recommended to have a seperate frontend server.

 

  1. Please observe that Head node is connected to two different network. This is to keep the cluster's internal network free from any noise. Also it makes good sense for security. Also it is advised to configure cluster with two internal networks to make it more fault tolerant.

 

 

Two very significant documents covering entire cluster setup (including hardware) are -

 

http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/index.html

http://www.rocksclusters.org/rocksapalooza/2006/lab-building-cluster.pdf