Watt Matters in AI: Software-based views on energy efficiency

Researchers are addressing AI-related energy efficiency challenges technologically on two fronts: through hardware innovations (creating low-power, brain-inspired computing devices and accelerators) and software innovations (developing algorithms and techniques to reduce the computational cost of AI models). In addition, considerable attention is being paid to behavioral change (e.g., ethical frameworks, corporate responsibility, user involvement, policy development, and public awareness). Many of these advances have been published in leading journals and conferences and are available as open-access articles.

This article focuses on opportunities on the software side. We provide an overview of recent open-access research (2024-2025) that focuses on strategies to improve the energy efficiency of AI software.

Watt Matters in AI

Watt Matters in AI is an initiative of Mission 10-X, in collaboration with the University of Groningen, University of Twente, Eindhoven University of Technology, Radboud University, and the Convention Bureau Brainport Eindhoven. IO+ is responsible for marketing, communication, and conference organization.
More information on the Conference website.

Model compression: pruning and quantization

On the algorithmic side, model compression has been a crucial strategy for reducing the energy and compute footprint of AI models. Two widely used techniques are pruning (removing redundant weights/connections) and quantization (reducing numerical precision of model parameters). A comprehensive survey of neural network pruning published in mid-2024 (open-access in Cognitive Computation) highlighted that large CNNs often contain a vast number of parameters that can be removed with minimal effect on accuracy. Pruning these unnecessary weights yields “lighter and energy-efficient neural networks”. The survey summarized recent breakthroughs in pruning methods – from unstructured weight pruning to structured channel pruning and neural architecture search for sparsity – and discussed how these methods reduce inference costs and even carbon footprint. It also noted the need for better metrics to guide pruning (beyond simple weight magnitude) and for techniques to handle modern architectures like Transformers. In short, pruning has evolved into an effective tool for obtaining 'Green AI models' by eliminating computational waste.

Quantization is another area that saw significant progress. By using lower-bit representations for neural network weights and activations (e.g., 8-bit, 4-bit, or even binary), one can drastically cut memory usage and energy per operation. The challenge is to maintain accuracy with such reduced precision. In late 2024, researchers introduced 4-bit quantization methods for large language models (LLMs) that achieve near-original accuracy. One approach, QRazor (Lee et al. 2024), uses a two-stage “significant data razoring” scheme: first constraining weights/activations to 8–16 bit scales for stability, then compressing to 4-bit by keeping only the most significant bits. This method preserved accuracy on transformer models “better or comparable to state-of-the-art 4-bit methods”, while enabling hardware optimizations. Notably, the authors developed a custom integer arithmetic unit to operate directly on the 4-bit compressed data, achieving ~58% lower power and area than a standard 8-bit unit. Such innovations demonstrate that ultra-low precision (<=4 bits) is becoming practical for deep networks, which could translate to significant energy savings in both data centers and edge devices.

Watt Matters in AI: in search of energy-efficient AI

In anticipation of the "Watt Matters in AI" conference, IO+ describes the current situation, the societal needs, and the scientific progress

Beyond post-training compression, some works address efficiency during training as well. For instance, a NeurIPS 2023 paper (Shi et al.) proposed SDP4Bit, a strategy to quantize both gradients and weights to 4 bits for distributed training, managing to retain model quality. Techniques like quantization-aware training and distillation also continue to evolve, often allowing training of smaller models that perform on par with large ones at a fraction of the computation cost.

Overall, model compression over the last year has enabled the development of smaller, sparser, and lower-precision models that run faster and consume less power. Pruned and quantized models not only use less energy at inference time, but they also can reduce memory bottlenecks, enabling deployment of advanced AI on energy-constrained hardware. These software optimizations complement hardware advances – for example, a quantized model running on a neuromorphic or CIM accelerator compounds the efficiency gains.

Approximate computing techniques

Approximate computing is a broad paradigm that accepts slight reductions in result accuracy in exchange for disproportionate gains in efficiency. This concept is highly relevant to AI, where exact arithmetic is often not necessary for good model performance. In an approximate computing approach, one might use lower-precision operations, skip or early-terminate calculations with low impact, or employ simplified algorithms. According to a 2025 survey in ACM CSUR, Approximate Computing has emerged as a “promising solution” for energy-efficient AI, allowing designers to tune the quality of results to improve energy usage and performance. Significant research has explored approximation at various system layers - from circuits (e.g., approximate adders/multipliers) up to algorithmic techniques (e.g., dropping layers or using proxy models).

In practice, approximate computing for AI often translates into energy-aware throttling of computations. For example, a system might dynamically select a lighter or heavier model based on the available energy or required accuracy, as seen in an adaptive framework for edge AI. Another example is early-exit networks that allow inference to stop once sufficient confidence is achieved, thereby skipping later layers when possible. These strategies ensure that the computation (and energy use) scales with the task’s complexity in real time.

innovationorigins_an_AI_chip_that_however_powerful_keeps_its_en_3a2cdcc5-f26d-4977-a0f9-065eb6144fed.png

This chip slashes AI energy consumption by 1000 times

A team of American researchers developed a chip that can dramatically cut down on AI's energy consumption.

Empirical results have shown that careful approximation can yield significant energy savings for minimal accuracy loss. In one study, a joint sensor-memory-compute approximation method on an image recognition system saved about 1.6×–5× energy with under 1% drop in accuracy. Even on deep object detection models, synergistic approximations across the pipeline yielded up to 5.2× energy reduction for similarly negligible quality impact. The key is to apply approximations in a controlled way – for instance, using reduced precision where the network is less sensitive, or skipping computations that contribute little to the final result. As tools, like the above survey and new frameworks, categorize the “knobs” for approximation (at data, algorithm, or circuit level), developers can more easily incorporate these techniques to create gracefully degrading AI systems that consume far less power.

One emerging area is the integration of approximate computing with federated and edge learning, where devices have limited energy. A 2024 survey in Energies explored energy-efficient design for federated learning, emphasizing approximation and optimization to prolong battery life on distributed clients. Techniques include aggressive quantization of communication and local model updates, and lossy compression of exchanged information – all forms of approximation tolerating a bit of error for significant energy gains.

In summary, approximate computing provides a framework to trade accuracy for efficiency in a principled manner. Over the last year, it has been increasingly applied to AI workloads, often in conjunction with model compression and specialized hardware, to push energy consumption down to new lows. As AI systems proliferate in power-constrained environments (like IoT sensors, mobile devices, and EVs), these approximation strategies will be vital for sustainable AI deployment.

Watt Matters in AI: Hardware-based views on energy efficiency

In anticipation of the "Watt Matters in AI" conference, IO+ describes the current situation, the societal needs, and the scientific progress

Energy-efficient neural network architectures

While compression and approximation adapt existing models, another approach is to design new neural network architectures that are inherently more energy-efficient. This can mean creating models that achieve more with fewer parameters or operations, or architectures that better exploit modern hardware parallelism to save energy per inference. A prominent trend is hardware-aware neural architecture search (NAS) – automatically searching for network designs that optimize accuracy and efficiency metrics. Recent research has started to include energy consumption in the NAS objective directly. For example, La et al. (arXiv 2025) present an NAS method focused on minimizing measured energy usage rather than proxy metrics like FLOPs. In their study on tabular data models, the NAS-found architecture reduced energy consumption by up to 92% compared to architectures found by conventional (accuracy-only) NAS approaches. This huge gain underscores how different an energy-optimal network can be from a standard one, and the importance of explicitly optimizing for energy. Increasingly, NAS frameworks on vision and language tasks are following suit, incorporating energy and latency predictors during the search to yield specialized, efficient models.

Researchers are also revisiting neural network fundamentals to improve efficiency. Spiking Neural Networks (SNNs) are one example – they encode information in sparse spike events over time, potentially requiring far fewer operations than dense feed-forward networks (especially on neuromorphic hardware). New architectures and training methods for SNNs have emerged that make them more viable for practical tasks, combining the strength of deep learning with the event-driven efficiency of spiking. A 2024 framework called SNN4Agents introduced optimization techniques for embodied spiking neural networks, demonstrating energy-efficient control in robotics. Likewise, advances in binary neural networks (where weights/activations are just 1-bit) continue to push accuracy closer to full-precision networks, which could enable digital inference with orders of magnitude less energy.

AI chip startup Axelera raises $68 million to challenge Nvidia

Eindhoven-based AI chip developer Axelera AI bags €68 million in its latest funding round.

Another approach to efficient architectures is using model sparsity and modularity. Techniques like Mixture-of-Experts (MoE) create models where only a small fraction of the network (“expert”) is active for each input, saving computation on average. And methods that enforce structured sparsity (like only a fraction of neurons fire) can leverage efficient sparse matrix operations in hardware. These architecture-level strategies often intersect with training techniques – e.g., lottery ticket hypothesis style pruning finds sub-networks that can be trained in isolation, effectively yielding a new smaller architecture that is as accurate as the original.

Finally, researchers are focusing on task-specific architectures that are streamlined for particular domains. For instance, lightweight CNNs for mobile vision (e.g., MobileNet variants) and small transformers for edge NLP have seen continuous improvements. In 2024, we saw new variants of efficient vision transformers and graph neural networks that drastically cut down model size and operations, often by incorporating insights like early-exit, re-using features, or tailoring the network depth to input complexity (dynamic depth). These innovations ensure the model does no more work than necessary for each task.

In essence, the architecture of a neural network greatly influences its energy profile. Through automated search and clever design principles, the past year’s research has delivered neural architectures optimized for efficiency from the ground up. When combined with efficient hardware, these architectures help bend the curve of AI’s energy demand, enabling high-performance AI with a much smaller energy footprint.

Watt Matters in AI

Watt Matters in AI is a conference that aims to explore the potential of AI with significantly improved energy efficiency. In the run-up to the conference, IO+ publishes a series of articles that describe the current situation and potential solutions. Tickets to the conference can be found at wattmattersinai.eu.

View Watt Matters in AI

Disclaimer: In finding and analyzing relevant studies for this article, artificial intelligence was used.

Watt Matters in AI: Software-based views on energy efficiency

By: Bart Brouwers

Watt Matters in AI

Model compression: pruning and quantization

Watt Matters in AI: in search of energy-efficient AI

Approximate computing techniques

This chip slashes AI energy consumption by 1000 times

Watt Matters in AI: Hardware-based views on energy efficiency

Energy-efficient neural network architectures

AI chip startup Axelera raises $68 million to challenge Nvidia

Watt Matters in AI