AMD RDNA 4: Out-of-Order Memory Accesses and the Elimination of False Dependencies

2025-03-23
AMD RDNA 4: Out-of-Order Memory Accesses and the Elimination of False Dependencies

AMD's RDNA 4 architecture introduces significant memory subsystem enhancements, notably addressing false dependencies between wavefronts present in RDNA 3 and earlier architectures. Previously, one wavefront could be stalled by another's memory accesses, impacting performance. RDNA 4 resolves this by implementing new out-of-order queues, allowing requests from different shaders to be serviced out of order. This article details testing that verifies this improvement and compares AMD, Intel, and Nvidia GPU architectures in handling cross-wave memory dependencies. While not entirely novel, RDNA 4's improvements significantly enhance performance, particularly in emerging workloads like ray tracing.

Read more

Intel Xe3 Architecture Deep Dive: Significant Improvements Target High-End Market

2025-03-19
Intel Xe3 Architecture Deep Dive:  Significant Improvements Target High-End Market

Details of Intel's Xe3 GPU architecture have emerged, with software development visible across several open-source repositories. Xe3 boasts a potential maximum of 256 Xe Cores, significantly more than its predecessor, supporting up to 32,768 FP32 lanes. Improvements include 10 concurrent threads per XVE, flexible register allocation, increased scoreboard tokens, and a new gather-send instruction. Additionally, Xe3 introduces Sub-Triangle Opacity Culling (STOC), which subdivides triangles to reduce wasted shader work, enhancing ray tracing performance. These advancements bring Intel's architecture closer to AMD and Nvidia's in terms of performance and efficiency, signaling Intel's ambition in the high-end GPU market.

Read more
Hardware GPU Architecture

Deep Dive into Intel Battlemage's Ray Tracing Performance

2025-03-16
Deep Dive into Intel Battlemage's Ray Tracing Performance

This article delves into the ray tracing performance of Intel's Arc B580 GPU under the Battlemage architecture. Analyzing Cyberpunk 2077's path tracing and 3DMark Port Royal benchmark, it reveals improvements in Battlemage's Ray Tracing Accelerator (RTA), including a tripled ray traversal pipeline, doubled triangle intersection test rate, and a 16KB BVH cache. While high occupancy in Cyberpunk 2077's path tracing didn't translate to high execution unit utilization, the improved cache and architecture excelled in Port Royal. The article concludes that Battlemage shows significant ray tracing advancements, but the memory subsystem remains a performance bottleneck.

Read more
Hardware

AMD's Strix Halo SoC: A Handheld Threadripper?

2025-03-14
AMD's Strix Halo SoC: A Handheld Threadripper?

At CES 2025, Mahesh Subramony, AMD Senior Fellow, unveiled the Strix Halo SoC, a groundbreaking integrated processor boasting a Zen 5 CPU and a powerful iGPU. Unlike desktop Zen 5, Strix Halo prioritizes power efficiency with innovative die-to-die interconnect technology, reducing latency and boosting efficiency. A 32MB MALL cache primarily amplifies GPU bandwidth; while inaccessible to the CPU directly, its design allows for future software updates to expand functionality. Intended as a high-performance mobile workstation, Strix Halo features a full 512-bit FPU and impressive multi-threaded performance.

Read more
Hardware

Zen 5: AMD's Graceful Handling of AVX-512 at High Frequencies

2025-03-01
Zen 5: AMD's Graceful Handling of AVX-512 at High Frequencies

This article delves into the performance of AMD's Zen 5 architecture running AVX-512 instructions at high frequencies. Unlike Intel's Skylake-X, which suffered from fixed frequency offsets and lengthy transition periods, Zen 5 leverages improved on-die sensors and adaptive clocking to achieve full AVX-512 performance at its 5.7GHz peak frequency. Tests reveal that Zen 5 doesn't experience significant frequency drops when encountering AVX-512 workloads; instead, it employs fine-grained IPC (instructions per cycle) adjustments as needed to maintain high performance. This dynamic adjustment mechanism effectively avoids frequent frequency transitions, ensuring smooth performance transitions between high and low loads. While brief IPC drops might occur under extreme conditions, overall, Zen 5's AVX-512 support is impressive, significantly outperforming previous Intel architectures.

Read more
Hardware

Intel's Battlemage: A Deep Dive into the Arc B580 and its Challenges

2025-02-11
Intel's Battlemage: A Deep Dive into the Arc B580 and its Challenges

Intel's new Battlemage GPU architecture arrives with the Arc B580, a mid-range card aiming to disrupt the market with 12GB of VRAM at $250. This article delves into Battlemage's improvements over Alchemist, including wider Xe vector engines, enhanced cache mechanisms, and optimized memory access. Despite lower specs on paper, the B580 surprisingly outperforms its predecessor, the A770, in real-world tests. However, driver issues and reliance on Resizable BAR remain hurdles for Intel to overcome.

Read more
Hardware

Alibaba's Xuantie C910: Ambitious RISC-V Core, Short on Fundamentals

2025-02-04
Alibaba's Xuantie C910: Ambitious RISC-V Core, Short on Fundamentals

Alibaba's T-HEAD division has released the Xuantie C910, a high-performance RISC-V core aiming to reduce reliance on foreign chips and provide cost-effective solutions for IoT and edge computing. This deep dive analyzes C910's architecture, including its out-of-order execution engine, branch predictor, and cache system, revealing performance characteristics through testing. While excelling in vector extensions and unaligned access handling, C910 suffers from an imbalanced out-of-order engine with insufficient scheduler and register file capacity relative to its ROB size. Its weak cache subsystem further limits performance. Despite ambition, C910 needs improvement in balancing core architecture and memory subsystem.

Read more

SiFive P550 Microarchitecture Deep Dive: RISC-V's Ambitious Step

2025-01-27
SiFive P550 Microarchitecture Deep Dive: RISC-V's Ambitious Step

This article delves into SiFive's P550 microarchitecture, a RISC-V processor core targeting high-performance applications. The P550 employs a three-wide out-of-order execution architecture with a 13-stage pipeline, aiming for 30% higher performance in less than half the area of a comparable Arm Cortex A75. The analysis compares P550 to the Cortex A75, examining branch prediction, instruction fetch and decode, out-of-order execution, and the memory subsystem. While the P550 shows weaknesses in areas like unaligned memory access, it represents a significant step forward for RISC-V. Though needing further refinement, the P550 demonstrates SiFive's progress towards high-performance general-purpose CPUs.

Read more

Zen 5's Op Cache Disabled: A Deep Dive into its Clustered Decoders

2025-01-24
Zen 5's Op Cache Disabled: A Deep Dive into its Clustered Decoders

This article delves into the instruction fetch and decode mechanism of AMD's Zen 5 processor. Zen 5 uses a unique dual-decoder cluster architecture, with each cluster serving one of the core's two SMT threads. Normally, Zen 5 relies on a 6KB op cache to deliver instructions, with the decoders only activating on cache misses. The author disables the op cache, forcing the decoders to handle all instructions, to evaluate their performance. Tests reveal significant performance drops in single-threaded mode with the op cache disabled; however, in multi-threaded mode, the dual-decoder clusters effectively compensate for the performance loss, even showing performance gains in some multi-threaded workloads. The author concludes that Zen 5's dual-decoder cluster design isn't the primary instruction source but acts as a secondary mechanism, boosting performance in high-IPC and multi-threaded scenarios, complementing the op cache for a balanced performance and power consumption.

Read more
Hardware CPU Architecture

Intel's Skymont: A Deep Dive into the E-Core Architecture

2025-01-18
Intel's Skymont: A Deep Dive into the E-Core Architecture

Intel's latest mobile chip, Lunar Lake, features Skymont, a new E-core architecture replacing Meteor Lake's Crestmont. Skymont significantly improves both multi-threaded performance and low-power background task handling. This article provides an in-depth analysis of Skymont's architecture, covering branch prediction, instruction fetch and decode, out-of-order execution engine, integer execution, floating-point and vector execution, load/store, and cache and memory access. While Skymont excels in some benchmarks, its advantages over Meteor Lake's Crestmont cores and AMD's Zen 5c cores aren't always clear-cut. This highlights the crucial role of cache architecture in CPU performance and the challenges of designing a single core architecture to handle both low-power and high-performance multi-threaded workloads.

Read more
Hardware E-core

AMD Radeon Instinct MI300A: A Deep Dive into its Massive APU Architecture

2025-01-18
AMD Radeon Instinct MI300A: A Deep Dive into its Massive APU Architecture

The AMD Radeon Instinct MI300A is a colossal APU integrating 24 Zen 4 cores and 228 CDNA3 compute units. This article delves into its massive Infinity Fabric interconnect, highlighting its high-bandwidth, low-latency characteristics and efficient CPU-GPU data sharing. While its high-bandwidth memory subsystem excels for the GPU, it impacts CPU latency, resulting in single-threaded integer performance comparable to the Ryzen 9 3950X from years ago. Despite this, MI300A has achieved significant success in supercomputing, notably powering LLNL's El Capitan system and topping the TOP500 list.

Read more
Hardware

Fujitsu's Monaka CPU: An ARMv9 Datacenter Beast with SVE2 and 3D Stacking

2024-12-14
Fujitsu's Monaka CPU: An ARMv9 Datacenter Beast with SVE2 and 3D Stacking

Fujitsu is set to launch Monaka, a new datacenter CPU slated for a 2027 release. This ARMv9-based processor boasts SVE2 extensions and utilizes 3D stacking, resembling AMD's EPYC architecture with a central IO die and disaggregated SRAM and compute units. Each Monaka CPU will pack up to 144 cores across four 36-core chiplets, all built on a 2nm process. The IO boasts 12 channels of DDR5 (potentially exceeding 600GB/s bandwidth), PCIe 6.0 with CXL 3.0 support, and air-cooling capability. Unlike its predecessor, A64FX, Monaka omits HBM support and targets the general datacenter market.

Read more
Hardware 3D Stacking