Bulk synchronous parallel


The bulk synchronous parallel abstract computer is a bridging model for designing parallel algorithms. It serves a purpose similar to the parallel random access machine model. BSP differs from PRAM by not taking communication and synchronization for granted. An important part of analyzing a BSP algorithm rests on quantifying the synchronization and communication needed.

History

The BSP model was developed by Leslie Valiant of Harvard University during the 1980s. The definitive article was published in 1990.
Between 1990 and 1992, Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model, in Princeton and at Harvard. Between 1992 and 1997, McColl led a large research team at Oxford that developed various BSP programming libraries, languages and tools, and also numerous massively parallel BSP algorithms. With interest and momentum growing, McColl then led a group from Oxford, Harvard, Florida, Princeton, Bell Labs, Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996.
Valiant developed an extension to the BSP model in the 2000s, leading to the publication of the Multi-BSP model in 2011.
In 2017, McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large-scale parallel computations in AI, Analytics and HPC.

The model

A BSP computer consists of
  1. components capable of processing and/or local memory transactions,
  2. a network that routes messages between pairs of such components, and
  3. a hardware facility that allows for the synchronisation of all or a subset of components.
This is commonly interpreted as a set of processors which may follow different threads of computation, with each processor equipped with fast local memory and interconnected by a communication network.
A BSP algorithm relies heavily on the third feature; a computation proceeds in a series of global supersteps, which consists of three components:
The computation and communication actions do not have to be ordered in time. Communication typically takes the form of the one-sided put and get Direct Remote Memory Access calls, rather than paired two-sided send and receive message passing calls.
The barrier synchronization concludes the superstep: it ensures that all one-sided communications are properly concluded. Systems based on two-sided communication include this synchronisation cost implicitly for every message sent. The method for barrier synchronisation relies on the hardware facility of the BSP computer. In Valiant's original paper, this facility periodically checks if the end of the current superstep is reached globally. The period of this check is denoted by.
The figure below shows this in a diagrammatic form. The processes are not regarded as having a particular linear order, and may be mapped to processors in any way.
The BSP model is also well-suited to enable automatic memory management for distributed-memory computing through overdecomposition of the problem and oversubscription of the processors. The computation is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfect load balancing, both of work and communication.

Communication

In many parallel programming systems, communications are considered at the level of individual actions: sending and receiving a message, memory to memory transfer, etc. This is difficult to work with since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.
The BSP model considers communication actions en masse. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit, and assumes all individual messages sent as part of this unit have a fixed size.
The maximum number of incoming or outgoing messages for a superstep is denoted by. The ability of a communication network to deliver data is captured by a parameter, defined such that it takes time for a processor to deliver messages of size 1.
A message of length obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of or messages of length 1. In either case the cost is said to be.
The parameter is dependent on the following factors:
In practice, is determined empirically for each parallel computer. Note that is not the normalised single-word delivery time, but the single-word delivery time under continuous traffic conditions.

Barriers

The one-sided communication of the BSP model requires barrier synchronization.
Barriers are potentially costly, but avoid the possibility of deadlock or livelock,
since barriers cannot create circular data dependencies. Tools to detect them and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance.
The cost of barrier synchronization is influenced by a couple of issues:
  1. The cost imposed by the variation in the completion time of the participating concurrent computations. Take the example where all but one of the processes have completed their work for this superstep, and are waiting for the last process, which still has a lot of work to complete. The best that an implementation can do is ensure that each process works on roughly the same problem size.
  2. The cost of reaching a globally consistent state in all of the processors. This depends on the communication network, but also on whether there is special-purpose hardware available for synchronizing, and on the way in which interrupts are handled by processors.
The cost of a barrier synchronization is denoted by. Note that if the synchronisation mechanism of the BSP computer is as suggested by Valiant.
In practice, a value of is determined empirically.
On large computers barriers are expensive, and this is increasingly so on large scales. There is a large body of literature on removing synchronization points from existing algorithms, both in the context of BSP computing and beyond. For example, many algorithms allow for the local detection of the global end of a superstep simply by comparing local information to the number of messages already received. This drives the cost of a global synchronisation, compared to the minimally required latency of communication, to zero. Yet also this minimal latency is expected to increase further for future supercomputer architectures and network interconnects; the BSP model, along with other models for parallel computation, require adaptation to cope with this trend. Multi-BSP is one BSP-based solution.

The cost of a BSP algorithm

The cost of a superstep is determined as the sum of three terms; the cost of the longest running local computation, the cost of global communication between the processors, and the cost of the barrier synchronisation at the end of the superstep. The cost of one superstep
for processors:
where is the cost for the local computation in process, and is the number of messages sent or received by process. Note that homogeneous processors are assumed here. It is more common for the expression to be written as where and are maxima. The cost of the algorithm then, is the sum of the costs of each superstep.
where is the number of supersteps.
,, and are usually modelled as functions, that vary with problem size. These three characteristics of a BSP algorithm are usually described in terms of asymptotic notation, e.g..

Extensions and uses

Interest in BSP has soared in recent years, with Google adopting it as a major technology for graph analytics at massive scale via technologies like Pregel and MapReduce. Also, with the next generation of Hadoop decoupling the MapReduce model from the rest of the Hadoop infrastructure, there are now active open source projects to add explicit BSP programming, as well as other high performance parallel programming models, on top of Hadoop. Examples are Apache Hama and Apache Giraph.
BSP has been extended by many authors to address concerns about BSP's unsuitability for modelling specific architectures or computational paradigms. One example of this is the decomposable BSP model. The model has also been used in the creation of a number of new programming languages and interfaces, such as Bulk Synchronous Parallel ML, BSPLib, Apache Hama, and Pregel.
Notable implementations of the BSPLib standard are the Paderborn University BSP library and the Oxford BSP Toolset by Jonathan Hill. Modern implementations include BSPonMPI, and MulticoreBSP. MulticoreBSP for C is especially notable for its capability of starting nested BSP runs, thus allowing for explicit Multi-BSP programming.