Earth Simulator


The Earth Simulator, developed by the Japanese government's initiative "Earth Simulator Project", was a highly parallel vector supercomputer system for running global climate models to evaluate the effects of global warming and problems in solid earth geophysics. The system was developed for Japan Aerospace Exploration Agency, Japan Atomic Energy Research Institute, and Japan Marine Science and Technology Center in 1997. Construction started in October 1999, and the site officially opened on 11 March 2002. The project cost 60 billion yen.
Built by NEC, ES was based on their SX-6 architecture. It consisted of 640 nodes with eight vector processors and 16 gigabytes of computer memory at each node, for a total of 5120 processors and 10 terabytes of memory. Two nodes were installed per 1 metre × 1.4 metre × 2 metre cabinet. Each cabinet consumed 20 kW of power. The system had 700 terabytes of disk storage and 1.6 petabytes of mass storage in tape drives. It was able to run holistic simulations of global climate in both the atmosphere and the oceans down to a resolution of 10 km. Its performance on the LINPACK benchmark was 35.86 TFLOPS, which was almost five times faster than the previous fastest supercomputer, ASCI White. As of 2020, 3 Nvidia RTX 2080 Ti graphics cards can deliver comparable performance, at 14 TFLOPS per card.
ES was the fastest supercomputer in the world from 2002 to 2004. Its capacity was surpassed by IBM's Blue Gene/L prototype on 29 September 2004.
ES was replaced by the Earth Simulator 2 in March 2009. ES2 is an NEC SX-9/E system, and has a quarter as many nodes each of 12.8 times the performance, for a peak performance of 131 TFLOPS. With a delivered LINPACK performance of 122.4 TFLOPS, ES2 was the most efficient supercomputer in the world at that point. In November 2010, NEC announced that ES2 topped the Global FFT, one of the measures of the HPC Challenge Awards, with the performance number of 11.876 TFLOPS.
ES2 was replaced by the Earth Simulator 3 in March 2015. ES3 is a NEC SX-ACE system with 5120 nodes, and a performance of 1.3 PFLOPS.
ES3, from 2017 to 2018, ran alongside Gyoukou, a supercomputer with immersion cooling that can achieve up to 19 PFLOPS.

System overview

Hardware

The Earth Simulator was developed as a national project by three governmental agencies: the National Space Development Agency of Japan, the Japan Atomic Energy Research Institute, and the Japan Marine Science and Technology Center. The ES is housed in the Earth Simulator Building. The Earth Simulator 2 uses 160 nodes of NEC's SX-9E. The upgrade of the Earth Simulator has been completed in March 2015. The Earth Simulator 3 system uses 5120 nodes of NEC's SX-ACE.

System configuration

The ES is a highly parallel vector supercomputer system of the distributed-memory type, and consisted of 160 processor nodes connected by Fat-Tree Network. Each Processor nodes is a system with a shared memory, consisting of 8 vector-type arithmetic processors, a 128-GB main memory system. The peak performance of each Arithmetic processors is 102.4Gflops. The ES as a whole thus consists of 1280 arithmetic processors with 20 TB of main memory and the theoretical performance of 131Tflops.

Construction of CPU

Each CPU consists of a 4-way super-scalar unit, a vector unit, and main memory access control unit on a single LSI chip. The CPU operates at a clock frequency of 3.2 GHz. Each VU has 72 vector registers, each of which has 256 vector elements, along with 8 sets of six different types of vector pipelines: addition /shifting, multiplication, division, logical operations, masking, and load/store. The same type of vector pipelines works together by a single vector instruction and pipelines of different types can operate concurrently.

Processor Node (PN)

The processor node is composed of 8 CPU and 10 memory modules.

Interconnection Network (IN)

The RCU is directly connected to the crossbar switches and controls inter-node data communications at 64 GB/s bidirectional transfer rate for both sending and receiving data. Thus the total bandwidth of inter-node network is about 10 TB/s.

Processor Node (PN) Cabinet

The processor node is composed two nodes of one cabinet, and consists of power supply part 8 memory modules and PCI box with 8 CPU modules.

Software

Below is the description of software technologies used in the operating system, Job Scheduling and the programming environment of ES2.

Operating system

The operating system running on ES, "Earth Simulator Operating System", is a custom version of NEC's SUPER-UX used for the NEC SX supercomputers that make up ES.

Mass storage file system

If a large parallel job running on 640 PNs reads from/writes to one disk installed in a PN, each PN accesses to the disk in sequence and performance degrades terribly. Although local I/O in which each PN reads from or writes to its own disk solves the problem, it is a very hard work to manage such a large number of partial files. Then ES adopts Staging and Global File System that offers a high-speed I/O performance.

Job scheduling

ES is basically a batch-job system. Network Queuing System II is introduced to manage the batch job.
Queue configuration of the Earth Simulator.
ES has two-type queues. S batch queue is designed for single-node batch jobs and L batch queue is for multi-node batch queue.
There are two-type queues. One is L batch queue and the other is S batch queue. S batch queue is aimed at being used for a pre-run or a post-run for large-scale batch jobs, and L batch queue is for a production run. Users choose the appropriate queue for their job.
  1. The nodes allocated to a batch job are used exclusively for that batch job.
  2. The batch job is scheduled based on elapsed time instead of CPU time.
Strategy enables to estimate the job termination time and to make it easy to allocate nodes for the next batch jobs in advance. Strategy contributes to an efficient job execution. The job can use the nodes exclusively and the processes in each node can be executed simultaneously. As a result, the large-scale parallel program is able to be executed efficiently.
PNs of L-system are prohibited from access to the user disk to ensure enough disk I/O performance. herefore the files used by the batch job are copied from the user disk to the work disk before the job execution. This process is called "stage-in." It is important to hide this staging time for the job scheduling.
Main steps of the job scheduling are summarized as follows;
  1. Node Allocation
  2. Stage-in
  3. Job Escalation
  4. Job Execution
  5. Stage-out
When a new batch job is submitted, the scheduler searches available nodes. After the nodes and the estimated start time are allocated to the batch job, stage-in process starts. The job waits until the estimated start time after stage-in process is finished. If the scheduler find the earlier start time than the estimated start time, it allocates the new start time to the batch job. This process is called "Job Escalation". When the estimated start time has arrived, the scheduler executes the batch job. The scheduler terminates the batch job and starts stage-out process after the job execution is finished or the declared elapsed time is over.
To execute the batch job, the user logs into the login-server and submits the batch script to ES. And the user waits until the job execution is done. During that time, the user can see the state of the batch job using the conventional web browser or user commands. The node scheduling, the file staging and other processing are automatically processed by the system according to the batch script.

Programming environment

Programming model in ES
The ES hardware has a 3-level hierarchy of parallelism: vector processing in an AP, parallel processing with shared memory in a PN, and parallel processing among PNs via IN. To bring out high performance of ES fully, you must develop parallel programs that make the most use of such parallelism. the 3-level hierarchy of parallelism of ES can be used in two manners, which are called hybrid and flat parallelization, respectively. In the hybrid parallelization, the inter-node parallelism is expressed by HPF or MPI, and the intra-node by microtasking or OpenMP, and you must, therefore, consider the hierarchical parallelism in writing your programs. In the flat parallelization, the both inter- and intra-node parallelism can be expressed by HPF or MPI, and it is not necessary for you to consider such complicated parallelism. Generally speaking, the hybrid parallelization is superior to the flat in performance and vice versa in ease of programming. Note that the MPI libraries and the HPF runtimes are optimized to perform as well as possible both in the hybrid and flat parallelization.
Languages
Compilers for Fortran 90, C and C++ are available. All of them have an advanced capability of automatic vectorization and microtasking. Microtasking is a sort of multitasking provided for the Cray's supercomputer at the same time and is also used for intra-node parallelization on ES. Microtasking can be controlled by inserting directives into source programs or using the compiler's automatic parallelization.
Parallelization
Message Passing Interface
MPI is a message passing library based on the MPI-1 and MPI-2 standards and provides high-speed communication capability that fully exploits the features of IXS and shared memory. It can be used for both intra- and inter-node parallelization. An MPI process is assigned to an AP in the flat parallelization, or to a PN that contains microtasks or OpenMP threads in the hybrid parallelization. MPI libraries are designed and optimizedcarefully to achieve highest performance of communication on the ES architecture in both of the parallelization manner.
High Performance Fortrans
Principal users of ES are considered to be natural scientists who are not necessarily familiar with the parallel programming or rather dislike it. Accordingly, a higher-level parallel language is in great demand.
HPF/SX provides easy and efficient parallel programming on ES to supply the demand. It supports the specifications of HPF2.0, its approved extensions, HPF/JA, and some unique extensions for ES
Tools
-Integrated development environment
Integrated development environment is integration of various tools to develop the program that operates by SUPER-UX. Because PSUITE assumes that various tools can be used by GUI, and has the coordinated function between tools, it comes to be able to develop the program more efficiently than the method of developing the past the program and easily.
-Debug Support
In SUPER-UX, the following are prepared as strong debug support functions to support the program development.

Facilities

Features of the Earth Simulator building

Protection from natural disasters

The Earth Simulator Center has several special features that help to protect the computer from natural disasters or occurrences. A wire nest hangs over the building which helps to protect from lightning. The nest itself uses high-voltage shielded cables to release lightning current into the ground. A special light propagation system utilizes halogen lamps, installed outside of the shielded machine room walls, to prevent any magnetic interference from reaching the computers. The building is constructed on a seismic isolation system, composed of rubber supports, that protect the building during earthquakes.

Lightning protection system

Three basic features:
Lighting: Light propagation system inside a tube
Light source: halogen lamps of 1 kW
Illumination: 300 lx at the floor in average
The light sources installed out of the shielded machine room walls.

Seismic isolation system

11 isolators

Performance

LINPACK

The new Earth Simulator system, which began operation in March 2009, achieved sustained performance of 122.4 TFLOPS and computing efficiency of 93.38% on the LINPACK Benchmark.
The LINPACK Benchmark is a measure of a computer's performance and is used as a standard benchmark to rank computer systems in the TOP500 project.
LINPACK is a program for performing numerical linear algebra on computers.
Computing efficiency is the ratio of sustained performance to a peak computing performance. Here, it is the ratio of 122.4TFLOPS to 131.072TFLOPS.

Computational performance of WRF on Earth Simulator

WRF is a mesoscale meteorological simulation code which has been developed under the collaboration among US institutions, including NCAR and NCEP. JAMSTEC has optimized WRFV2 on the Earth Simulator renewed in 2009 with the measurement of computational performance. As a result, it was successfully demonstrated that WRFV2 can run on the ES2 with outstanding and sustained performance.
The numerical meteorological simulation was conducted by using WRF on the Earth Simulator for the earth's hemisphere with the Nature Run model condition. The model spatial resolution is 4486 by 4486 horizontally with the grid spacing of 5 km and 101 levels vertically. Mostly adiabatic conditions were applied with the time integration step of 6 seconds.
A very high performance on the Earth Simulator was achieved for high-resolution WRF. While the number of CPU cores used is only 1% as compared to the world fastest class system Jaguar at Oak Ridge National Laboratory, the sustained performance obtained on the Earth Simulator is almost 50% of that measured on the Jaguar system. The peak performance ratio on the Earth Simulator is also record-high 22.2%.