Condor's Cuzco: A High-Performance RISC-V Core with a Twist

2025-08-30
Condor's Cuzco: A High-Performance RISC-V Core with a Twist

Condor Computing, an Andes Technology subsidiary, unveiled its high-performance RISC-V core, Cuzco, at Hot Chips 2025. Cuzco boasts an 8-wide out-of-order execution engine, a modern branch predictor, and a novel time-based scheduling scheme, putting it in the same league as SiFive's P870 and Veyron's V1. Its unique approach uses mostly static scheduling in the backend for power efficiency and reduced complexity, requiring no ISA changes or compiler adjustments for optimal performance. Cuzco is highly configurable, allowing for customization to meet diverse customer needs, and supports multi-core clusters.

Read more
Hardware

Google's Datacenter-Scale Liquid Cooling: A Revolution for AI

2025-08-26
Google's Datacenter-Scale Liquid Cooling: A Revolution for AI

The rise of AI has created a significant heat challenge for datacenters. At Hot Chips 2025, Google showcased its massive liquid cooling system designed for its TPUs. This system uses CDUs (Coolant Distribution Units) for rack-level cooling, significantly reducing power consumption compared to air cooling and ensuring system stability through redundancy. Google also employs a bare-die design, similar to PC enthusiast 'de-lidding', to improve the heat transfer efficiency of its TPUv4. This solution not only tackles the immense cooling demands of AI but also points towards a new direction for future datacenter cooling solutions.

Read more
Tech

Intel's Lion Cove: A Deep Dive into Gaming Performance

2025-07-07
Intel's Lion Cove: A Deep Dive into Gaming Performance

Intel's latest high-performance CPU architecture, Lion Cove, excels in SPEC CPU2017 benchmarks and even rivals AMD's Zen 5. However, gaming workloads differ significantly from productivity tasks. This article provides a deep dive into Lion Cove's gaming performance, analyzing detailed data on cache hierarchy, instruction execution latency, branch prediction, and more. It reveals Lion Cove's strengths and weaknesses in gaming scenarios and compares it to Zen 4. Results show a strong frontend but bottleneck in backend memory latency, leaving room for improvement in gaming performance.

Read more
Hardware

Nvidia's Blackwell: A Colossus of Compute, but at What Cost?

2025-06-29
Nvidia's Blackwell: A Colossus of Compute, but at What Cost?

Nvidia's latest Blackwell architecture, exemplified by the RTX PRO 6000, boasts a gargantuan GB202 die (750mm², 92.2 billion transistors) and a staggering 188 SM units, delivering unmatched compute performance. A deep dive into its microarchitecture reveals details on instruction caching, execution units, and memory subsystems, comparing it to AMD's RDNA4. While Blackwell exhibits some imperfections, like L2 cache performance and per-unit efficiency, its sheer scale dwarfs the competition, making it the largest consumer GPU available. This ambition, however, comes at a cost, including power consumption (600W) and L2 latency. The article concludes with a perspective on the future GPU landscape.

Read more
Hardware

Deep Dive into AMD's Instinct MI350: GCN-Based AI Accelerator

2025-06-20
Deep Dive into AMD's Instinct MI350: GCN-Based AI Accelerator

In an interview, Alan Smith, AMD's Chief Instinct Architect, delved into the details of the new MI350 series AI accelerators, based on the GFX9 architecture. While MI350 retains the GFX9 architecture, significant performance improvements are achieved through increased LDS capacity (160KB) and bandwidth, along with the introduction of microscaling formats supporting FP8, FP6, and FP4 data types. Notably, MI350's FP6 and FP4 boast the same throughput, reflecting AMD's confidence in FP6's potential for both training and inference. Furthermore, MI350 omits TF32 hardware acceleration in favor of optimized BF16, offering software emulation for TF32 support. Built with N3P process compute chips and N6 process I/O chips, MI350 optimizes design and reduces compute units to achieve high performance while lowering power consumption.

Read more
Hardware

AMD's CDNA 4 Architecture: Balancing Matrix and Vector Operations

2025-06-17
AMD's CDNA 4 Architecture: Balancing Matrix and Vector Operations

AMD unveils its latest compute-oriented GPU architecture, CDNA 4, a modest upgrade over CDNA 3. The focus is on boosting matrix multiplication performance with lower-precision data types crucial for machine learning. Simultaneously, CDNA 4 aims to maintain AMD's lead in vector operations. Utilizing a similar multi-chiplet design as CDNA 3, and increasing clock speeds, CDNA 4 improves Local Data Share (LDS) capacity and bandwidth, introducing read-with-transpose LDS instructions to optimize matrix multiplication. While lagging behind Nvidia's Blackwell architecture in low-precision matrix operations, CDNA 4 retains a significant advantage in vector operations and higher-precision data types due to its higher core count and clock speeds.

Read more
Hardware

AMD Trinity's Compromised Interconnect: A Decade of iGPU Integration

2025-06-17
AMD Trinity's Compromised Interconnect: A Decade of iGPU Integration

This article delves into the memory interconnect architecture of AMD's Trinity APU (released in 2012). Unlike the later Infinity Fabric, Trinity uses two distinct links, "Onion" and "Garlic," to connect the CPU and iGPU. "Onion" guarantees cache coherency but is bandwidth-limited, while "Garlic" offers high bandwidth but lacks coherency. This design reflects a compromise based on the then-current Athlon 64 architecture, resulting in performance penalties when the CPU and GPU access each other's memory. While performing adequately for graphics workloads like gaming, Trinity's architecture lacks the elegance and efficiency of Intel's Sandy Bridge/Ivy Bridge integrated iGPUs. The author uses tests and data analysis to detail the functionality, advantages, and disadvantages of both links, demonstrating Trinity's memory bandwidth usage with various games and image processing programs.

Read more
Hardware Interconnect

IBM Telum II: A Revolutionary Mainframe Processor and its Virtual Cache Strategy

2025-05-19
IBM Telum II: A Revolutionary Mainframe Processor and its Virtual Cache Strategy

IBM's latest mainframe processor, Telum II, boasts eight 5.5GHz cores and a massive 360MB on-chip cache, along with a DPU and AI accelerator. Its most intriguing feature is its innovative virtual L3 and L4 cache strategy. By cleverly using saturation metrics and cache replacement policies, Telum II virtually combines multiple L2 caches into a huge L3 and a cross-chip L4, dramatically boosting single-threaded performance while maintaining incredibly low latency even with up to 32 processors working together. This strategy could potentially inform future client CPU designs, but challenges remain in overcoming cross-chip interconnect bandwidth limitations.

Read more
Hardware Virtual Cache

Zhaoxin's Century Avenue: A Deep Dive into China's x86 CPU Ambitions

2025-04-30
Zhaoxin's Century Avenue: A Deep Dive into China's x86 CPU Ambitions

Zhaoxin's latest CPU, the KX-7000, featuring the new "Century Avenue" architecture, aims to bridge the performance gap with early 2010s Intel CPUs. While showing progress with a wider 4-wide core and higher clock speeds, the KX-7000 lags in cache bandwidth, branch prediction, and memory subsystem performance. Single-threaded performance roughly matches AMD's Bulldozer, outperforming it in floating-point benchmarks but falling short in multi-threaded tasks against both Bulldozer and Intel Skylake. The article suggests the KX-7000 isn't designed to directly compete with AMD and Intel, but rather to meet China's demand for domestic CPUs, highlighting the technical and resource challenges in the pursuit of performance.

Read more
Hardware Zhaoxin x86 CPU

RDNA 4's Dynamic VGPR Allocation: A Ray Tracing Bottleneck Breaker

2025-04-05
RDNA 4's Dynamic VGPR Allocation: A Ray Tracing Bottleneck Breaker

AMD's RDNA 4 architecture introduces a novel dynamic VGPR (vector general-purpose register) allocation mode to address the trade-off between register count and occupancy in ray tracing. Traditional GPUs face limitations in ray tracing where fixed register allocation per thread restricts thread parallelism in stages with high register demands. RDNA 4's dynamic allocation allows threads to adjust register counts at runtime, increasing occupancy without enlarging the register file, thus reducing latency and boosting ray tracing performance. While this mode can lead to deadlocks, AMD mitigates this with a deadlock avoidance mode. This isn't a universal solution, limited to wave32 compute shaders, but significantly advances AMD's ray tracing capabilities.

Read more

AMD RDNA 4: Out-of-Order Memory Accesses and the Elimination of False Dependencies

2025-03-23
AMD RDNA 4: Out-of-Order Memory Accesses and the Elimination of False Dependencies

AMD's RDNA 4 architecture introduces significant memory subsystem enhancements, notably addressing false dependencies between wavefronts present in RDNA 3 and earlier architectures. Previously, one wavefront could be stalled by another's memory accesses, impacting performance. RDNA 4 resolves this by implementing new out-of-order queues, allowing requests from different shaders to be serviced out of order. This article details testing that verifies this improvement and compares AMD, Intel, and Nvidia GPU architectures in handling cross-wave memory dependencies. While not entirely novel, RDNA 4's improvements significantly enhance performance, particularly in emerging workloads like ray tracing.

Read more

Intel Xe3 Architecture Deep Dive: Significant Improvements Target High-End Market

2025-03-19
Intel Xe3 Architecture Deep Dive:  Significant Improvements Target High-End Market

Details of Intel's Xe3 GPU architecture have emerged, with software development visible across several open-source repositories. Xe3 boasts a potential maximum of 256 Xe Cores, significantly more than its predecessor, supporting up to 32,768 FP32 lanes. Improvements include 10 concurrent threads per XVE, flexible register allocation, increased scoreboard tokens, and a new gather-send instruction. Additionally, Xe3 introduces Sub-Triangle Opacity Culling (STOC), which subdivides triangles to reduce wasted shader work, enhancing ray tracing performance. These advancements bring Intel's architecture closer to AMD and Nvidia's in terms of performance and efficiency, signaling Intel's ambition in the high-end GPU market.

Read more
Hardware GPU Architecture

Deep Dive into Intel Battlemage's Ray Tracing Performance

2025-03-16
Deep Dive into Intel Battlemage's Ray Tracing Performance

This article delves into the ray tracing performance of Intel's Arc B580 GPU under the Battlemage architecture. Analyzing Cyberpunk 2077's path tracing and 3DMark Port Royal benchmark, it reveals improvements in Battlemage's Ray Tracing Accelerator (RTA), including a tripled ray traversal pipeline, doubled triangle intersection test rate, and a 16KB BVH cache. While high occupancy in Cyberpunk 2077's path tracing didn't translate to high execution unit utilization, the improved cache and architecture excelled in Port Royal. The article concludes that Battlemage shows significant ray tracing advancements, but the memory subsystem remains a performance bottleneck.

Read more
Hardware

AMD's Strix Halo SoC: A Handheld Threadripper?

2025-03-14
AMD's Strix Halo SoC: A Handheld Threadripper?

At CES 2025, Mahesh Subramony, AMD Senior Fellow, unveiled the Strix Halo SoC, a groundbreaking integrated processor boasting a Zen 5 CPU and a powerful iGPU. Unlike desktop Zen 5, Strix Halo prioritizes power efficiency with innovative die-to-die interconnect technology, reducing latency and boosting efficiency. A 32MB MALL cache primarily amplifies GPU bandwidth; while inaccessible to the CPU directly, its design allows for future software updates to expand functionality. Intended as a high-performance mobile workstation, Strix Halo features a full 512-bit FPU and impressive multi-threaded performance.

Read more
Hardware

Zen 5: AMD's Graceful Handling of AVX-512 at High Frequencies

2025-03-01
Zen 5: AMD's Graceful Handling of AVX-512 at High Frequencies

This article delves into the performance of AMD's Zen 5 architecture running AVX-512 instructions at high frequencies. Unlike Intel's Skylake-X, which suffered from fixed frequency offsets and lengthy transition periods, Zen 5 leverages improved on-die sensors and adaptive clocking to achieve full AVX-512 performance at its 5.7GHz peak frequency. Tests reveal that Zen 5 doesn't experience significant frequency drops when encountering AVX-512 workloads; instead, it employs fine-grained IPC (instructions per cycle) adjustments as needed to maintain high performance. This dynamic adjustment mechanism effectively avoids frequent frequency transitions, ensuring smooth performance transitions between high and low loads. While brief IPC drops might occur under extreme conditions, overall, Zen 5's AVX-512 support is impressive, significantly outperforming previous Intel architectures.

Read more
Hardware

Intel's Battlemage: A Deep Dive into the Arc B580 and its Challenges

2025-02-11
Intel's Battlemage: A Deep Dive into the Arc B580 and its Challenges

Intel's new Battlemage GPU architecture arrives with the Arc B580, a mid-range card aiming to disrupt the market with 12GB of VRAM at $250. This article delves into Battlemage's improvements over Alchemist, including wider Xe vector engines, enhanced cache mechanisms, and optimized memory access. Despite lower specs on paper, the B580 surprisingly outperforms its predecessor, the A770, in real-world tests. However, driver issues and reliance on Resizable BAR remain hurdles for Intel to overcome.

Read more
Hardware

Alibaba's Xuantie C910: Ambitious RISC-V Core, Short on Fundamentals

2025-02-04
Alibaba's Xuantie C910: Ambitious RISC-V Core, Short on Fundamentals

Alibaba's T-HEAD division has released the Xuantie C910, a high-performance RISC-V core aiming to reduce reliance on foreign chips and provide cost-effective solutions for IoT and edge computing. This deep dive analyzes C910's architecture, including its out-of-order execution engine, branch predictor, and cache system, revealing performance characteristics through testing. While excelling in vector extensions and unaligned access handling, C910 suffers from an imbalanced out-of-order engine with insufficient scheduler and register file capacity relative to its ROB size. Its weak cache subsystem further limits performance. Despite ambition, C910 needs improvement in balancing core architecture and memory subsystem.

Read more

SiFive P550 Microarchitecture Deep Dive: RISC-V's Ambitious Step

2025-01-27
SiFive P550 Microarchitecture Deep Dive: RISC-V's Ambitious Step

This article delves into SiFive's P550 microarchitecture, a RISC-V processor core targeting high-performance applications. The P550 employs a three-wide out-of-order execution architecture with a 13-stage pipeline, aiming for 30% higher performance in less than half the area of a comparable Arm Cortex A75. The analysis compares P550 to the Cortex A75, examining branch prediction, instruction fetch and decode, out-of-order execution, and the memory subsystem. While the P550 shows weaknesses in areas like unaligned memory access, it represents a significant step forward for RISC-V. Though needing further refinement, the P550 demonstrates SiFive's progress towards high-performance general-purpose CPUs.

Read more

Zen 5's Op Cache Disabled: A Deep Dive into its Clustered Decoders

2025-01-24
Zen 5's Op Cache Disabled: A Deep Dive into its Clustered Decoders

This article delves into the instruction fetch and decode mechanism of AMD's Zen 5 processor. Zen 5 uses a unique dual-decoder cluster architecture, with each cluster serving one of the core's two SMT threads. Normally, Zen 5 relies on a 6KB op cache to deliver instructions, with the decoders only activating on cache misses. The author disables the op cache, forcing the decoders to handle all instructions, to evaluate their performance. Tests reveal significant performance drops in single-threaded mode with the op cache disabled; however, in multi-threaded mode, the dual-decoder clusters effectively compensate for the performance loss, even showing performance gains in some multi-threaded workloads. The author concludes that Zen 5's dual-decoder cluster design isn't the primary instruction source but acts as a secondary mechanism, boosting performance in high-IPC and multi-threaded scenarios, complementing the op cache for a balanced performance and power consumption.

Read more
Hardware CPU Architecture

Intel's Skymont: A Deep Dive into the E-Core Architecture

2025-01-18
Intel's Skymont: A Deep Dive into the E-Core Architecture

Intel's latest mobile chip, Lunar Lake, features Skymont, a new E-core architecture replacing Meteor Lake's Crestmont. Skymont significantly improves both multi-threaded performance and low-power background task handling. This article provides an in-depth analysis of Skymont's architecture, covering branch prediction, instruction fetch and decode, out-of-order execution engine, integer execution, floating-point and vector execution, load/store, and cache and memory access. While Skymont excels in some benchmarks, its advantages over Meteor Lake's Crestmont cores and AMD's Zen 5c cores aren't always clear-cut. This highlights the crucial role of cache architecture in CPU performance and the challenges of designing a single core architecture to handle both low-power and high-performance multi-threaded workloads.

Read more
Hardware E-core

AMD Radeon Instinct MI300A: A Deep Dive into its Massive APU Architecture

2025-01-18
AMD Radeon Instinct MI300A: A Deep Dive into its Massive APU Architecture

The AMD Radeon Instinct MI300A is a colossal APU integrating 24 Zen 4 cores and 228 CDNA3 compute units. This article delves into its massive Infinity Fabric interconnect, highlighting its high-bandwidth, low-latency characteristics and efficient CPU-GPU data sharing. While its high-bandwidth memory subsystem excels for the GPU, it impacts CPU latency, resulting in single-threaded integer performance comparable to the Ryzen 9 3950X from years ago. Despite this, MI300A has achieved significant success in supercomputing, notably powering LLNL's El Capitan system and topping the TOP500 list.

Read more
Hardware

Fujitsu's Monaka CPU: An ARMv9 Datacenter Beast with SVE2 and 3D Stacking

2024-12-14
Fujitsu's Monaka CPU: An ARMv9 Datacenter Beast with SVE2 and 3D Stacking

Fujitsu is set to launch Monaka, a new datacenter CPU slated for a 2027 release. This ARMv9-based processor boasts SVE2 extensions and utilizes 3D stacking, resembling AMD's EPYC architecture with a central IO die and disaggregated SRAM and compute units. Each Monaka CPU will pack up to 144 cores across four 36-core chiplets, all built on a 2nm process. The IO boasts 12 channels of DDR5 (potentially exceeding 600GB/s bandwidth), PCIe 6.0 with CXL 3.0 support, and air-cooling capability. Unlike its predecessor, A64FX, Monaka omits HBM support and targets the general datacenter market.

Read more
Hardware 3D Stacking