Repurposing Supercomputers - What happens on "The Other Side?"
By Andree Jacobson, CIO, New Mexico Consortium & Project Manager, PRObE
Government, industry, and research facilities keep building larger and faster supercomputers, which is a natural effect of trying to keep up with the ever growing demand for compute cycles to perform critical scientific calculations required to ensure the safety of our nation or the profitability of a company. The technology competition is essentially a modern version of the space race that occurred during the cold war as the country with the fastest computer will perform the most advanced science. As these massive computer systems are built and put into production, the “Top 500” list reveals the current state of the race at a SuperComputing conference every six months. For the last three years, China’s “Tianhe-2” computer system with a peak 54.9 PetaFLOPS (Trillion Floating Point Operations Per Second) in is the lead, followed by the U.S. Department of Energy - Oak Ridge National Laboratory system called “Titan” at roughly half the performance of its Chinese counterpart. Running these systems require several MegaWatts of power and cost millions of dollars a year to operate. In industry, massive corporations like Google, Amazon, and Facebook build their own data centers around the world to supply enough compute power to meet the needs of their hundreds of millions of users. Each of these types of data centers can host tens of thousands of individual computers often referred to as nodes. A node is at least as powerful your average office / home computer. Many have co-processors (like GPUs or Xeon Phi’s) to speed up calculations; some have disk, and most have fast networking capabilities. The end result is a massive amount of hardware that has one thing in common - at some point, inevitably - each and every node needs to be discarded.
Live or Let Die?
Andree Jacobson, CIO for the New Mexico Consortium (NMC) focuses on the fate of these decommissioned supercomputers. He is the project manager for PRObE (The Parallel Reconfigurable Observational Environment) which is an NSF funded compute facility hosted by the NMC in Los Alamos, NM. The NMC is a non-profit organization with a purpose to improve the research environment in New Mexico by facilitating collaborations between Los Alamos National Laboratory (LANL) and the three research universities in the state. PRObE is a pilot project designed to determine the feasibility of using re-purposed supercomputer hardware for research purposes. Gary Grider, Division Leader for High Performance Computing at LANL came up with the idea for PRObE in 2006 after arriving to the conclusion that many of their computer systems that are normally decommissioned and subsequently destroyed despite still having quite a bit of useful life left in them.
The technology competition is essentially a modern version of the space race that occurred during the cold war as the country with the fastest computer will perform the most advanced science
Many facilities deal with their decommissioned systems by putting them on trucks and driving them to a secure facility where the components are placed in an industrial metal shredder which chops them into tiny pieces which are then melted down to recover precious metals. But does something that might have cost $30M just three or four years prior really only possess scrap value today? Neither Grider or Jacobson thought so and co-wrote the NSF proposal together other collaborators from Carnegie Mellon University and the University of Utah. In October 2010 the NMC was awarded $10M from the NSF to build PRObE.
From a pure profitability standpoint the answer to the scrap value question is probably yes. Based on historical trends it is usually possible to achieve about double the performance in a 10th of the floor footprint and ⅔ to one half of the power consumption by performing an upgrade of systems that are four years into production. As we will see, the operational expenses (OPEX) for running an outdated computer system quickly exceeds the capital expense (CAPEX) investment with the accompanying reduced OPEX for a new, more efficient system.
Many universities that begin deploying cluster style research computing often resort to using discarded desktop computers. However, these cobbled together systems are simply not adequate to meet the needs of researchers who require very large computer systems to perform their research. This means the value of a decommissioned supercomputer might be significantly higher than the scrap value to the average person or researcher at a university because these older systems can provide plentiful and more powerful computational capabilities than would otherwise be available.
A Different Approach PRObE is an answer to getting these decommissioned systems into the hands of people who can use them, but setting up and maintaining large clusters containing more than 1000 nodes requires overcoming several obstacles:
1) Sheer volume: Decommissioning, moving, inspecting, troubleshooting, and bringing back thousands of old computers online takes significant time and effort. Also, unlike when a system is slated for destruction - care must be taken throughout the decommissioning process so that parts are not damaged.
2) Space: A computer system with 1000 or more nodes and appropriate interconnect networks will likely require about 40- 50 whole racks of computer equipment. PRObE has capacity for 1MW of compute power, about 280 tons of cooling, and 3000 sq ft of server room space to house these large machines. This is sufficient for housing two large and a few smaller clusters.
3) Electricity cost: 1MW costs around $1M per year in New Mexico. It is a required OPEX and in PRObE’s case, is provided by NSF funding. This is not a typical setup, but since there is no procurement cost for the computers - the electricity is covered instead. This allows PRObE to provide the compute services to the community at no cost to the individual users.
4) Lack of spare parts: Vendors do not necessarily keep old spare parts around once a product has reached end-of-life and sometimes the vendor of an old system might have vanished. In such cases, the only outlet is the gray market - such as eBay and other vendors specializing in reused computer equipment. In PRObE’s case - LANL’s systems are usually larger than what PRObE can house, so a sufficient number of spares (typically about 20 percent) can accompany each system. Machines can also be cannibalized to keep the system running once the spares run out.
5) Staff to operate: PRObE is successful primarily because of the workforce we use to build the clusters and to maintain them. In particular, our staff is creative as they can both assemble and maintain the hardware even with limited funds. Instead of hiring consultants or full time staff members to perform this work, PRObE relies on local high-school and early college talent, which is also a wonderful way to train young people. Over the past 6 years we have employed close to 40 high school students that spend a couple of hours with us each week. During summers and winter break, these students often work full time. To PRObE this is an affordable solution and the students get hands-on experience building large computer systems.
The Future PRObE is fortunate that the NSF sees the value in what we do, the training that we provide, and the scientific value these older systems can contribute to the academic and scientific communities. Without NSF support, PRObE would not be possible. While the operation of PRObE require both skill and creativity, the work is rewarding and the scientific benefits are as real as exemplified by the many research citations PRObE regularly receives in the scientific literature.