文本描述
Compute-EfficientDeep Learning: Algorithmic Trends and Opportunities
*
BRIAN R. BARTOLDSON , Lawrence Livermore National Laboratory, USA
*
BHAVYA KAILKHURA , Lawrence Livermore National Laboratory, USA
*
DAVIS BLALOCK, MosaicML, USA
Although deep learning has made great progress in recent years, the exploding economic and environmental costs of training neural
networks are becoming unsustainable. To address this problem, there has been a great deal of research on algorithmically-efficient deep
learning, which seeks to reduce training costs not at the hardware or implementation level, but through changes in the semantics
of the training program. In this paper, we present a structured and comprehensive overview of the research in this field. First, we
formalize the algorithmic speedup problem, then we use fundamental building blocks of algorithmically efficient training to develop
a taxonomy. Our taxonomy highlights commonalities of seemingly disparate methods and reveals current research gaps. Next, we
present evaluation best practices to enable comprehensive, fair, and reliable comparisons of speedup techniques. To further aid research
and applications, we discuss common bottlenecks in the training pipeline (illustrated via experiments) and offer taxonomic mitigation
strategies for them. Finally, we highlight some unsolved research challenges and present promising future directions.
CCS Concepts: o Computing methodologies → Neural networks; Machine learning algorithms; Parallel algorithms; Distributed
algorithms; o Theory of computation → Machine learning theory.
Additional Key Words and Phrases: deep learning, training speedup, computational efficiency, carbon emission
1 INTRODUCTION
“Science is a way of thinking much more than it is a body of knowledge.”
— Carl Sagan
In the last few years, deep learning (DL) has made significant progress on a wide range of applications, suchas
protein structure prediction (AlphaFold [Jumper et al. 2021]), text-to-image synthesis (DALL-E [Ramesh et al. 2021]),
text generation (GPT-3 [Brown et al. 2020a]), etc. The key strategy behind achieving these performance gains is scaling
up DL models to extremely large sizes and training them on massive amounts of data. For most applications, the number
of trainable parameters is doubling at least every 18 to 24 months—language models are leading with a 4- to 8-month
doubling time [Sevilla and Villalobos 2021]. Notable examples of massive AI models include: Swin Transformer-V2 [Liu
et al. 2022a] with 3 billion parameters for vision applications, PaLM [Chowdhery et al. 2022] with 540 billion parameters
for language modeling, and Persia [Lian et al. 2021] with 100 trillion parameters for content recommendations.
Although scaling up DL models is enabling unprecedented advances, training large models has become extremely
expensive. For example, GPT-3 training was estimated to cost $1.65 million with Google v3 TPUs [Lohn and Musser
2022] and inefficient/naive development of a transformer model would emit carbon dioxide (CO2) equivalent tothe
arXiv:2210.06640v1 [cs.LG] 13 Oct 2022 lifetime carbon footprint of five cars [Strubell et al. 2019]. Concerningly, DL has still not reached the performance level
required by many of its applications: e.g., human-level performance is required for deploying fully autonomous vehicles
in the real world but hasn’t yet been reached. Growing model and data sizes to reach such required performances will
make current training strategies unsustainable on financial, environmental, and other fronts. Indeed, extrapolating
current trends, the training cost of the largest AI model in 2026 would be more than the total U.S. GDP [Lohn and
Musser 2022]. Moreover, the heavy compute reliance of DL raises concerns around the marginalization of users with
*All authors contributed equally to this research. Bhavya led the study conceptualization and taxonomy design. Brian led the written survey of the
literature. Davis conducted all experiments and led the creation of a guide to achieving speedups in practice.
1 TBD
limited financial resources like academics, students, and researchers (particularly those from emerging economies)
[Ahmed and Wahed 2020]. We discuss these critical issues in more detail in Appendix A.
Given the unsustainable growth of its computational burden, progress with DL demands more compute-efficient
training methods. A natural direction is to eliminate algorithmic inefficiencies in the learning process to reduce the time,
cost, energy, and carbon footprint of DL training. Such Algorithmically-Efficient Deep Learning methods could change
the training process in a variety of ways that include: altering the data or the order in which samples are presented to the
model; tweaking the structure of the model; and changing the optimization algorithm. These algorithmic improvements
are critical to moving towards estimated lower bounds on the required computational burden of effective DL training,
which are greatly exceeded by the burden induced by current practices [Thompson et al. 2020]. Further, these algorithmic
gains compound with software and hardware acceleration techniques [Hernandez and Brown 2020]. Thus, we believe
algorithmically-efficient DL presents an enormous opportunity to increase the benefits of DL and reduce itscosts.
While this view is supported by the recent surge in algorithmic efficiency papers, these papers also suggest that
research and application of algorithmic efficiency methods are hindered by fragmentation. Disparate metrics areused
to quantify efficiency, which produces inconsistent rankings of speedup methods. Evaluations are performed onnarrow
or poorly characterized environments, which results in incorrect or overly-broad conclusions. Algorithmic efficiency
methods are discussed in the absence of a taxonomy that reflects their breadth and relationships, which makes it hard
to understand how to traverse the speedup landscape to combine different methods and develop new ones.
Accordingly, our central contributions are an organization of the algorithmic-efficiency literature (via a taxonomy
and survey inspired by [Von Rueden et al. 2019]) and a technical characterization of the practical issues affecting
the reporting and achievement of speedups (via guides for evaluation and practice). Throughout, our discussion
emphasizes the critical intersection of these two thrusts: e.g., whether an algorithmic efficiency method leads toan
actual speedup indeed depends on the interaction of the method (understandable via our taxonomy) and the compute
platform (understandable via our practitioner’s guide). Our contributions are summarized as follows:
o Formalizing Speedup: We review DNN efficiency metrics, then formalize the algorithmic speedup problem.
o Taxonomy and Survey: We classify over 200 papers via 5 speedup actions (the 5Rs) that apply to 3 training-
pipeline components (see Tables 1 and 3). The taxonomy facilitates selection of methods for practitioners,
digestion of the literature for readers, and identification of opportunities for researchers.
o Best Evaluation Practices: We identify evaluation pitfalls common in the literature and accordingly present
best evaluation practices to enable comprehensive, fair, and reliable comparisons of various speedup techniques.
o Practitioner’s Guide: We discuss compute-platform bottlenecks that affect speedup-method effectiveness. We
suggest appropriate methods and mitigations based on the location of the bottlenecks in the training pipeline.
With these contributions, we hope to improve the research and application of algorithmic efficiency, a critical
piece of the compute-efficient deep learning needed to overcome the economic, environmental, and inclusion-rel