Arm Mali-G76 GPU microarchitecture deep dive
In the pursuit of ever greater graphics performance, Arm made some significant changes with the third entry in the high-performance tier of its Bifrost architecture, the Mali-G76. A number of these important tweaks already made their way to the mid-tier Mali-G52, but the G76 aims to push performance up by another 50 percent in just a single iteration.
To see how Arm is pushing its chips’ graphics performance, let’s take a closer look inside the Mali-G76.
More execution lanes, more performance
As we touched on in the announcement, the key to the performance improvement lies in doubling up the number of execution engines inside each Mali-G76 core. In the Mali-G7X architecture, each core contains three execution engines, represented as a multiple of an MP1 on the product naming scheme — An MP2 has two cores and six total execution engines and an MP4 has four cores for 12 execution engines. In the Mali-G52, IP partners have the option of either two or three execution engines for more flexible low-mid range performance.
These execution engines contain the execution lanes handling scalar threads for math. These all run in parallel, so a core with more threads can do more math at any one time. However, increasing the number of lanes also increases bandwidth, texture support, and power and silicon area requirements.
The Mali-G76 increases the number of lanes in each execution unit to eight, up from four with the Mali-G72. In a single Mali-G76 core there are now 24 execution lanes, up from 12 in the G72. This doubles the compute capability of a single core, resulting in a reasonably small 28 percent increase in area size. G76 cores will be slightly larger than previous G72 and G71 cores, but they are more powerful, so we can certainly expect the graphics core count to fall in upcoming smartphone SoCs compared with the current generation.
The maximum number of cores when using a Mali-G76 also now caps out at 20. That’s a decrease from the maximum of 32 cores with the G72, though we never really saw smartphone designs venture further than the high teens anyway. Despite the lower core count, the maximum number of execution lanes in the largest configurations increases. A 20-core Mali-G76 offers 480 execution lanes versus just 384 lanes in a 32 core Mali-G72 setup. Therefore peak performance in the biggest configuration can be increased by up to 25 percent.
The second major benefit of increasing the number of lanes in each execution engine is a relative decrease in power consumption — each core is more power efficient for the same workload than a previous generation core. This is because the power draw of the other GPU components remains mostly constant when scaling up the number of execution lanes.
Arm’s graphic above demonstrates that although the relative energy cost of the arithmetic datapath and register files remains the same, there are major efficiency savings made in the data path control, cache, and quad control parts of the GPU. This allows the G76 to boast a 30 percent improvement in energy efficiency compared to the G72 on the same process node.
These execution lanes also now support INT8 dot product math support via a new instruction. Each lane supports four multiply-accumulate operations per cycle to greatly improve the throughput. We’ve already seen this implementation in the mid-range Mali-G52. Arm says this can improve the efficiency of machine learning applications using INT8 dot product by around 270 percent compared to the previous generation.
Balancing the design
Along with the increase in compute power per core, the Mali-G72 boasts a number of other improvements to ensure the change in design doesn’t produce any unwanted bottlenecks.
There’s a new dual texture mapper, which as the name suggests handles texture sampling, resizing, and placement onto 3D models. It’s capable of two texels per cycle, doubling the rendering throughput over the G72. The Quad manager has been optimized to keep the eight lane execution engines and the dual texture mapping parts of the GPU well fed with data.
Arm’s latest graphics part features a number of other smaller optimizations, including out of order polygon list writeback to prevent stalls during cache misses, varying pre-loads to improve efficiency and depth pre-loads for better multi-render performance, and TLS Address interleaving to improve the speed of cache fetching by better organizing the memory space.
This results in not only a number of performance optimizations, but also more linear performance scaling as the core count increases. Arm now expects essentially linear boosts to performance with core counts up into the high teens and only a minimal loss when capping out at 20. Previously there had been some more noticeable curtailing in the performance gains when scaling up closer to the maximum core count.
What to expect from Mali-G76 GPUs
As we’ve come to expect from Arm’s generational graphics improvements, both performance and energy efficiency are set for a notable uplift. Actual implementations in smartphones could see graphics performance improve by as much as 50 percent.
The Mali-G76 presents a bit of a naming problem when gauging performance though. Mali-G76 designs with lower core counts will provide comparable and better performance to existing G71 and G72 GPUs with high core counts. The G71 and G72 saw high-performance smartphones offer core counts in the high teens, but Arm expects this to fall to the low teens with the G76, even though performance will climb. For example, a Mali-G76 MP14 will offer better performance than a Mali-G72 MP18.
Each Mali-G76 core can be up to twice as powerful as in the G72.
Just like with the new Cortex-A76, the Mali-G76 is a flexible component designed to scale all the way from mid-tier performance mobile devices up to higher performance laptops, as well as potential AR and VR products.
The Mali-G76 is available for Arm’s partners to license now, meaning we could see devices using it on the market by the end of the year.
CommentsncG1vNJzZmivp6x7orrDq6ainJGqwam70aKrsmaTpLpwrdGmZKaZnJ56qIOVZp6prV2Zsqa8jJ2gr51dbYRxhJhwZg%3D%3D