Research Computing at Harvard
by MICHEAL KIELSTRA
Going by his hair and stubble, Scott Yockel could be the lead singer in an indie rock band. His bookshelf, on which titles like Modern Quantum Chemistry jostle for space with model planes, a mounted autographed baseball, and photos of his family, would place him instead as the resident good-ol’-boy of a world-class chemistry department. This is slightly more accurate: his Ph.D. is indeed from the University of North Texas, and his early research work was in chemistry. These days, though, he lives in Massachusetts (a state which he describes, in a thankful manner, as “having all four seasons”) and leads the Harvard Faculty of Arts and Sciences Research Computing team.
FASRC, as it is commonly known, was founded in 2007 under the aegis of James Cuff. Cuff very quickly set about embedding research computing within Harvard’s culture; today, Yockel regularly reaps the benefits of what he describes as a very stable and reliable budget. Cuff’s masterpiece, though, was, without question, Odyssey, a supercomputer which grew to span over 82 000 computer cores. For comparison, your laptop probably has four. Housed mostly in a data center built as a partnership between Harvard, BU, Northeastern, MIT, and UMass, Odyssey at its height ran 22 million compute jobs every year, ranging from statistical analyses that finish in half an hour to enormous calculations for artificial intelligence or bioinformatics that might take weeks.
This year, Odyssey has been replaced with Cannon, a new system with a very similar architecture. Yockel is very excited about the new liquid-cooling system, which allows for the use of much more powerful processors without overheating. Odyssey’s temperature was managed by cold air, but the newest computing technology produces enough heat that Yockel likens cooling it with air to trying to manage “forty hairdryers in a refrigerator”. By using liquid coolant in copper tubing, FASRC can move the heat away from the processors more effectively, allowing them to use the newest and fastest equipment. The Intel Xeon cores that make up much of Cannon have computing power similar to what you’d find in the very highest tier of gaming computer, each able to perform almost four billion calculations per second at peak capacity. Cannon has 100 000 of them, plus two and a half million “CUDA” cores, designed to run more slowly but make up for it in volume.
The Cannon cluster is still in the final stages of testing, and the current state of things fails in some ways to impress the wider research computing world. The unimaginatively named TOP500, a list of the 500 most powerful computers worldwide, put Odyssey exactly nowhere. Yockel doesn’t care. “We’re not trying to be that,” he says. The liquid cooling might well put Cannon on the list, but that’s not the point. In contrast to Texas Advanced Computing’s Stampede2, number 17 on the TOP500 at the time of writing and famous for a focus on using enormous power to solve enormous problems, FASRC strives to solve problems of all shapes and sizes, from the smallest to the largest. By expanding slowly year by year rather than applying for huge grants all at once, by focusing on services as much as on raw power, FASRC makes gains in accessibility to all researchers at the cost of losing out on a reputation for inhabiting the very top tiers of high-performance computing.
This sort of resource would mostly be wasted were it not for FASRC’s second role as an advising body. Every Wednesday at noon, the research facilitation team gathers in a conference room in their headquarters at 38 Oxford Street to hold office hours. No appointment or affiliation with any given Harvard department is necessary, although the front door does have a lock that requires a Harvard ID. Some people take this opportunity to ask very complex questions about the minutiae of their projects; others are new to the research computing game and need to be walked through running their first program. All are welcome.
The leader of this team, Raminder Singh, describes a typical day at his job as “very user-problem-solving oriented”. Beyond office hours, academics can reach him via any number of channels. As I arrived to interview him, he apologized and said that he had double-booked a consultation with “Doug”. (He did not tell me who exactly Doug was, and I did not ask.) Doug’s problems, in the end, took up only a few minutes, but the event handily underscored how busy Raminder was.
If you can’t make it to 38 Oxford Street, or to the shorter office hours held weekly at the T. H. Chan School of Public Health, you can always email. Raminder’s team receives over six thousand support requests a year. Most come from humans, some come from automatic systems, but all need to be read, answered quickly if possible, and otherwise directed to the most knowledgeable team member. This routing is the job of the “RTCop”, or “research technology cop”, a rotating role that fell to Raminder on the day I interviewed him. Another day, he might be running a training seminar or writing new documentation: when the FASRC team is handed a problem that they imagine might be fairly common, they put their solution on the website to make it easy to find.
(Raminder’s wife, he says, has been encouraging him to buy a police officer’s hat from a costume store and bring it in for the RTCop to wear. He has so far failed or refused to do this, despite finding the title “pretty cool”.)
Regardless of how you get in touch with them, Yockel and Raminder share a philosophy less centered around teaching the intricacies of high-performance computing in great detail – access to Odyssey and now Cannon is not conditional on completing any training, and the recommended preparation can be done online in half an hour – and more focused on helping scholars understand the bigger picture. Raminder used the word “workflow” a number of times, and I have more than once come in to office hours with a technical problem only to have a textbook recommended to me instead of a software patch.
Part of this focus does arise as a response to the risks of letting multiple researchers use the system at once. Many resources are shared in a complex way. For example, memory is connected to individual processors, so if someone wants to use only four processors but requires the memory that would normally be associated with having eight, they actually tie up eight processors. Users, who often have more experience working on their own laptops or on high-powered computers that they wholly own and do not share, sometimes run into difficulties due to the often unintuitive ways in which multi-user clusters operate. In this sort of scenario, a broader teaching strategy, more conceptual and less immediately practical, can pay huge dividends later on.
Framing their philosophy as wholly motivated by the need to safeguard a complex and expensive investment, however, would be a great disservice to the FASRC team. Again and again, the people I interviewed stressed the importance of trying to understand what it was that users really wanted to do. Raminder described how his team had learned to ask further questions before giving a diagnosis or advice. Obviously, Odyssey was and Cannon is crucial to the work of FASRC, but Yockel would even argue that the technology itself is not the most important thing. “Most of the time,” he says, “researchers ask questions that fit in the technology box that they have. And if we can show them that the box can be much bigger, they can ask much grander questions. They can strive to solve much more challenging problems. And that’s the value that we provide.” Anyone can buy a supercomputer, but good teaching, FASRC believes, is priceless.
Michael Kielstra ’22 ([email protected]) may not be a computer, but thinks he’s pretty super in other ways.