Array "Hello World": A Slightly More Advanced "Hello World" Program
Before diving into the next version of the "Hello World" program, we must first introduce a few Charm++ concepts: chare arrays and readonly variables.
Introduction to Chare Arrays and Readonly Variables
It would be cumbersome to have to create many chare objects one by one (especially if a program needs to create several instances of the same chare class). Because of this, the concept of a chare array was introduced to Charm++. As the name implies, a chare array is simply an array of chare objects. The big difference between an array of chare objects and any other array is that not all of the chare objects in the array have to be located on the same processing element. By default, the elements of a chare array are assigned to the processing elements in a round-robin order (the programmer can specify which processing element each chare object in the chare array is created on, however, for this example, the default placement will be used). That is, each processing element is given one chare array element until all processing elements have one element. Then, if there are more chare objects in the chare array, each processing element gets a second chare object. This process continues until all of the chare objects in the chare array have been assigned to processing elements. While this program uses a 1D chare array, please note that it is also possible to create multi-dimensional chare arrays.
There are two special variables that can be used by elements in a chare array: thisProxy and thisIndex. Both of these variables are similar to the this keyword in C++. The thisProxy variable points to the array that this chare object is in. The thisIndex variable holds the index in the chare array that this chare object occupies (i.e. the first element is at index 0, the second at index 1, and so on).
Readonly variables are special global variables that can be accessed from any chare object in the global object space. As should be clear by now, a Charm++ program begins with the constructor of the main chare object (Main::Main() in this example). When the main chare's constructor executes, it should set the values for all of the readonly variables in the program. Once the main chare's constructor has completed, the values contained in the readonly variables are distributed to all of the other processing elements available to the application. At this point, the values of the readonly variables should only be read since all of the processing elements have a private copy of the values (i.e. changing the value of a readonly on one processing element will not effect the value of the same readonly on another processing element).
For the purpose of this example program, readonly variables serve two purposes. The first is to place a reference to the Main chare object into the global object space so that all of the Hello chare objects can invoke entry methods on it. While all of the chare objects in the global object space can invoke entry methods on one another, they first need a reference (called a Proxy) to the target chare object. The second purpose is to contain global constants. In this case, the number of Hello chare objects, which can be specified on the command-line by the user, is contained in the numElements readonly variable.
How the Array "Hello World" Program Works
|Figure: Control Flow of the Array "Hello World" Program
There are two chare classes in this version of the "Hello World" program (aka. the Array "Hello World" program). The source code for the chare classes and their entry methods is located below.
The first chare class is the Main chare class. This class will be the main chare of the application. It serves three purposes. First, it will create the other chare objects in the application.. Second, it will initiate the computation. Third, it will call CkExit() when the calculation is finished to cause the program to exit. There will only be a single Main chare object (which is created by the Charm++ Runtime System).
The second chare class is the Hello chare class. The Main chare object will create an array of Hello chare object of length N. Each of the Hello objects will print "Hello World" one by one starting with the Hello chare object at index 0, then the Hello chare object at index 1, and so on. Once the last Hello chare object in the array has printed "Hello World," it will then send a message to the Main chare object indicating that the computation has completed (at which point, the Main chare object will call CkExit()).
There are four entry methods in this application (ignoring the migration constructors for each chare class; two constructors, Main::Main(...) and Hello::Hello(...), and two standard member functions, Hello::sayHi(...) and Main::done(...)). The constructor of the Main chare class is the entry point of the application. The constructor for the Hello chare class does nothing. Normally, this is where the local state of the chare object (member variables, etc.) would be initialized, however, the Hello chare objects do not have any local state. The Hello chare class' sayHi() entry method will print out the "Hello World" message for the current Hello chare object and then send a message to the next Hello chare object (or, in the case of the last Hello chare object in the array, call the done() entry method on the Main chare). The done() entry method for the Main chare class simply calls the CkExit() function which terminates the Charm++ program.
This example is just meant to demonstrate how a Charm++ program operates (how forward progression is made, how chare objects communicate by invoking entry methods, and so on). This program, while it could be executed on multiple processing elements with the chare objects spread out across the processing elements, does not have any inherent parallelism in it. The next "Hello World" program (the Broadcast "Hello World" Program) will actually exhibit parallelism.
Array "Hello World" Code
The source code for this example can be found here (ArrayHelloWorld.tar.gz).
The "Hello" Chare Class
The "Main" Chare Class
For the most part, the Makefile is the same as the Basic "Hello World" program. The only difference is there are now two chare classes that need to be compiled (the addtional rules and dependencies related to the Hello chare class).
CHARMDIR = [put Charm++ install directory here]
The output of the Array "Hello World" program is below. Because there are multiple chare objects in this program, the chare objects can be spread out over the various processing elements in use. For these runs, three processing elements (x86 workstations running linux) were used. The specific command used was ./charmrun +p3 ./hello 10. The charmrun program is used to launch the application on all the hosts. The +p3 indicates that three processing elements should be used (the first three hosts in the nodelist file). The ./hello indicates the Charm++ program that should be executed. Everything following the ./hello is passed to the Charm++ program as command-line arguments.
$ ./charmrun +p3 ./hello 10
$ ./charmrun +p3 ./hello 10
|Output for Run 1
|Output for Run 2
|Figure: Output of the Array "Hello World" Program
The output above may seem a bit confusing at first. The output from Run 1 is probably what one would expect, that is, until the last two lines are read. With just a quick glance, it seems that the helloArray.sayHi() entry method was called before the helloArray.sayHi() entry method. However, with closer inspection, one will notice that the "told by" portion of the lines indicates that they were indeed executed in the correct order. So, why are the lines out of order? Also notice that those chare objects were located on different processing elements. Even though helloArray.sayHi() was executed before helloArray.sayHi(), it's output still needs to be packed into a message and passed back to a single location where the output is collected and presented to the user as a single stream of output. The output of helloArray.sayHi(), while it was sent later, was received earlier. The output of Run 2 is an even more extreme version of what can happen. Even though the program is executing as expected, there is a race condition in the process of the output being serialized since the processors do not synchronize with one another for the sake of printing output.
This interleaving of tasks is a common issue in parallel programming in general and something that makes writing a parallel program more difficult than writing a serial program. By their very nature, parallel programs have multiple processing elements acting independently of one another. When two or more processing elements have to perform a set of tasks in a particular order (i.e. task A must complete on processing element X before task B can begin on processing element Y) the processing elements must synchronize. With little synchronization, the processing elements are, for the most part, acting independently from one another (i.e. the processing elements are free to do local computation without having to communicate with other processing elements). Adding more synchronization will help keep the processing elements performing tasks in a specified order, however, this also tends to introduce more overhead and will likely decrease concurrency which hurts overall performance. In this example program, the tasks that can change order are the printing of output statements, however, in many parallel applications, the various subtasks in the computation itself are usually free to execute in any order per the dependencies in the calculation.