Exploring the Power of Parallelized CPU Architectures

Exploring the Power of Parallelized CPU Architectures

Exploring the CELL BE, a Supercomputer from Another Planet


37 min read

Cover Illustration by hi__dan

In a recent trip from Japan, i came to score a rare PS3 reference tool (DECR-1000J) which was sold below market rates. As these things are apparently very rare and are not regionlocked unlike the retail models, i took the chance to buy it in order to serve as a learning platform. But unfortunately i lost the unit during a home invasion incident while i was out of the country during the weekend, alongside several valuables and my work laptop. I'm not exactly sure what they were planning to do with a rackmountable PS3, but alas i was not destined to play around with it for long.

I've always see the architecture of the PS3, the CELL Broadband Engine, as some sort of mythical creature. After looking online for tutorials on how to install Linux on it, i got to work on figuring out what i can possibly use this for, or what can i write about it that hasn't been endlessly written before. Thats when i got the idea to remark a math library to the CELL SPU, utilizing its SIMD and ILP features to speed up the process.

Funnily enough, recently a startup called Flow Computing that has some graphs (and names) suspiciously similar to how the CELL BE architecture work. The startup claims that their new "CPU 2.0" architecture is able to accelerate CPU workloads up to 100x, which is (sorta) believable knowing that they're not some scrappy shady startup but a spinout of VTT Technical Research Center from Finland.

This article (probably closer to an academic paper at this point) would not be possible without :

  • Francesco Mazzoli on speeding atan2f by 50x through AVX-512 instructions which serves as the initial inspiration for this article

  • Mike Acton from B3D@CellPerformance for the sequential implementation of atan2 in the CELL SPU (archive link)

  • IBM's CELL Broadband Engine Resource Center (archive link), specifically the following docs

    • Cell Broadband Engine Programmer's Guide

    • Cell Broadband Engine Programming Tutorial

    • Cell Broadband Engine Programming Handbook

    • SIMD Math Library Specification for Cell Broadband Engine Architecture

    • C/C++ Language Extensions for Cell Broadband Engine Architecture

  • The Best Programming Practice for Cell/BE (self archive link) by Akira Tsukamoto (Sony CELL BE team)

  • This doujinshi from Japan which contains the most detailed account of programming with CELL SPE cores, its in Japanese and fully absurd

  • That one guy from University of Indonesia (UI) Computer Science Faculty (Fasilkom) that said i'm not a programmer because i cannot do leetcode algorithms


The Cell processor, developed by Sony, Toshiba, and IBM as part of the STI alliance, was an ambitious and highly unconventional processor design that promised extraordinary computational power. Introduced in the PlayStation 3 in 2006, the Cell architecture aimed to leverage the parallel processing capabilities of multiple specialized co-processors.

To understand why the CELL architecture is so weird, you need to understand that during the early 2000s the traditional approach of enhancing performance by increasing clock frequencies were reaching its limits. Intel's Pentium architecture could not evolve further, and its promised successor was nowhere to be found. Similarly, IBM's PowerPC G5 processors failed to deliver on the promised 3 GHz clock speed. People were calling this the end of Moore's law.

While there were alot of ideas thrown around on how to fundamentally rework computing architectures to better scale performance, many thought that reworking the architecture from a monolithic processing design into a split-workload design where multiple smaller machines collaborate to distribute the compute task can work. This is similar to how we do cloud workloads today with microservices.

The CELL's general-purpose "leader" core, known as the PowerPC Processing Element (PPE), was responsible for directing the overall operation of the CELL circuitry. It acted as the central processing unit, managing the workload and delegating tasks to the "assistant" cores.

The CELL architecture has been classified as a Network-on-Chip (NoC) rather than a traditional System-on-Chip (SoC) due to its unconventional data bus design, known as the Element Interconnect Bus (EIB). The EIB is a novel interconnect topology devised by IBM to address the performance bottlenecks and congestion issues faced by demanding CPU components in previous architectures.

Unlike traditional single bus topologies, the token ring topology employed by the EIB aims to tackle large amounts of concurrent traffic. Data is transferred in the form of 128-bit packets, and each ring can accommodate up to three concurrent transfers, provided that the packets do not overlap.

In addition to the EIB for networking different parts of the CELL architecture, the system incorporates several high-performance interfaces responsible for data movement and communication. These interfaces include:

  • Broadband Engine Interface Unit (BEI): Responsible for facilitating communication between the CELL and external devices or systems.

  • Memory Interface Controller (MIC): Manages access to the main memory, with preferential priority on the EIB.

  • Flex I/O buses: Provide additional I/O capabilities for the CELL architecture.

PowerPC Processing Unit (PPU)

The PPE, or Power Processing Element, is the general-purpose "leader" core of the CELL architecture. Unlike previous iterations where IBM adapted existing processors to meet new requirements, the PPE is a new CPU design specifically crafted for the CELL architecture. However, it is based on the PowerPC instruction set architecture (ISA) version 2.02, which was the last PowerPC specification before being rebranded as the Power ISA. The PPE shares a lineage with the PowerPC G5, as both are descendants of the POWER4 architecture, which was primarily used in workstations and supercomputers.

IBM's decision to leverage PowerPC technology for the PPE was driven by several factors. First, PowerPC was a mature platform with approximately 10 years of testing and refinement in the Macintosh user base, meeting Sony's requirements for the CELL architecture. Additionally, the PowerPC architecture could be adapted to different environments if needed. Crucially, the use of a well-known and established architecture provided compatibility with existing compilers and codebases, offering a significant advantage for a new console.

The PPE implements the PowerPC ISA version 2.02, including optional opcodes for floating-point square root operations . It has also been extended with a SIMD (Single Instruction, Multiple Data) instruction set called the Vector/SIMD Multimedia Extension (VMX). Notably, some elements from the original PowerPC specification are missing in the PPE, such as little-endian mode (the CELL operates only in big-endian mode) and a handful of opcodes.

It's important to highlight that while the PPE leverages existing PowerPC technology, IBM constructed the PPE from the ground up, following the PowerPC 2.02 specification, rather than simply adapting an existing processor. This approach allowed IBM to optimize the PPE's design for the CELL's unique multi-core architecture and performance requirements, and also build interaction capabilities with the SPU.

Synergistic Processing Unit (SPU)

The Synergistic Processor Unit (SPU) is programmed utilizing an instruction set architecture. While both the SPU and the PowerPC Processing Unit (PPU) adhere to the Reduced Instruction Set Computer (RISC), the SPU's ISA is proprietary and primarily composed of a Single Instruction Multiple Data (SIMD) instruction set. Consequently, the SPU features 128 128-bit general-purpose registers, which accommodate vectors comprising 32/16-bit fixed-point or floating-point values. Conversely, to conserve memory, SPU instructions are markedly compact, merely 32 bits in length. The initial segment contains the opcode, while the remaining portion can reference up to three operands to be computed in parallel.

This architecture bears resemblance to the preceding Vector Floating Point Unit introduced in the PlayStation 2, albeit with substantial enhancements. For instance, developers are no longer required to learn a proprietary assembly language specific to the SPU, as IBM and Sony provided toolkits facilitating programming of the SPUs using C++, C, or assembly.

The SPEs are somewhat general-purpose coprocessors, not restricted to a single application, allowing them to assist the PPE in a wide range of tasks, provided that developers can program them properly. But the SPEs are more intended as the "assistant" cores of the CELL architecture, working in collaboration with the "leader" PPE to provide accelerated vector processing capabilities. By leveraging the SPEs' specialized design and local memory, the CELL architecture aims to offload computationally intensive tasks from the PPE, potentially achieving significant performance gains in applications that can take advantage of parallel processing and vector operations.

The core the SPE is the Synergistic Processor Unit (SPU), equivalent to the PPU in the PPE. But unlike the PPU, the SPU is isolated from the rest of the CELL architecture and does not have shared memory with the PPU or other SPUs. Instead, the SPU contains something called Local Memory (LS) used as a working space. However, the contents of this local memory can be moved back and forth using the Memory Flow Controller (MFC). In terms of functionality, the SPU is more limited compared to the PPU. It does not include memory management functions (address translation and memory protection) or advanced features like dynamic branch prediction. However, the SPU excels at vector processing operations. To program the SPU, developers use the PPU to invoke routines provided by the PlayStation 3's Operating System. These routines upload the executable specifically written for the SPU to the target SPU and signal it to start execution. After that, the PPU maintains a reference to the SPU's thread for synchronization purposes.

The Memory Flow Controller (MFC) is the component that interconnects the SPU with the rest of the CELL architecture, acting as an interface similar to the PowerPC Processor Storage Subsystem (PPSS) in the PPE. The primary function of the MFC is to move data between the SPU's local memory and the CELL's main memory, and to keep the SPU synchronized with its neighboring components. To perform its duties, the MFC embeds a Direct Memory Access (DMA) controller to handle communication between the Element Interconnect Bus (EIB) and the SPU's local memory. Additionally, the MFC houses another component called the Synergistic Bus Interface (SBI) that sits between the EIB and the DMA controller. The SBI is a complex piece of circuitry that interprets commands and data received from outside and signals the internal units of the SPE. As the front door to the CELL architecture, the SBI operates in two modes: bus master (where the SPE is adapted to request data from outside) or bus slave (where the SPE is set to receive orders from outside). It's worth noting that, considering the limit of EIB packets (up to 128-bit long), the MFC's DMA block can only move up to 16 KB of data per cycle. If the data transfer exceeds this limit, the EIB will throw a "Bus Error" exception during execution.

Programing for the CELL Architecture

As the CELL architecture is radically different from regular PC architectures, it has been known to be very hard to program for (by design). Despite this, IBM p roposed some ideas on how programmers can build applications in the CELL architecture.

PPE-centric Approaches

These approaches place the primary computational responsibilities on the PPE, treating the SPEs as co-processors or accelerators to offload specific tasks or workloads. There are three main patterns within this category:

  1. Multistage Pipeline Model:

    • The PPE acts as the driver, sending work to a single SPE.

    • Each SPE performs its assigned computations and passes the results to the next SPE in a pipeline fashion. The final SPE in the chain sends the processed data back to the PPE.

    • This model is not recommended for primary tasks due to the high inter-SPE communication overhead and complexity in managing the pipeline.

  2. Parallel Stages Model:

    • The PPE decomposes the main task into independent sub-tasks.

    • Each sub-task is assigned to a different SPE for parallel execution. SPEs return their processed results to the PPE upon completion.

    • The PPE combines the results from all SPEs to produce the final output.

    • This approach can be effective for heavily parallelized workloads with minimal data dependencies.

  3. Services Model:

    • Each SPE is assigned a specific service or job (e.g., audio decoding, video encoding, physics calculations). Service assignments to SPEs can be dynamically adjusted based on the program's changing requirements.

    • The PPE acts as a job dispatcher, sending input data to the appropriate SPE based on the required service. While awaiting results, the PPE can perform other tasks or manage system resources.

    • This model is suitable for applications with distinct, well-defined tasks that can be offloaded to dedicated SPEs.

SPE-centric Approaches

In contrast to the PPE-centric models, SPE-centric approaches place a greater emphasis on the SPEs as the primary computational engines, with the PPE playing a supporting role in resource management and coordination.

  • Using their internal Direct Memory Access (DMA) units, SPEs directly fetch and execute tasks stored in main memory.

  • The PPE is responsible for initially setting up these tasks in memory and allocating them to the appropriate SPEs. Once the SPEs begin executing their assigned tasks, the PPE's involvement is minimized, allowing the SPEs to operate with a high degree of autonomy.

  • This approach can potentially unlock higher levels of parallelism and efficiency, as the SPEs are not constrained by frequent communication with the PPE.

  • However, it also introduces challenges in terms of data partitioning, synchronization, resource management, fault tolerance, and portability to other architectures.

Hybrid Approaches

In practice, many applications may benefit from a hybrid approach that combines elements of both PPE-centric and SPE-centric models. For example:

  • The PPE could handle high-level task management and coordinate the overall workflow. While computationally intensive or parallelizable workloads could be offloaded to the SPEs using an SPE-centric model.

  • The PPE could also perform pre-processing or post-processing tasks on the data before or after the SPE computations.

  • Communication and data transfer between the PPE and SPEs could be minimized by strategically partitioning the workload and leveraging DMA transfers.

The choice of programming style depends on various factors, including the nature of the application, performance requirements, data dependencies, and the development team's familiarity with the Cell BE architecture and programming models.

While the heterogeneous multi-core architecture of the Cell BE offered significant computational power, programming it effectively presented several challenges that developers had to overcome:

  1. Code Partitioning and Load Balancing:

    • Determining the optimal partitioning of code and data between the PPE and SPEs was crucial for maximizing performance.

    • Load balancing the workload across the SPEs to avoid bottlenecks and ensure efficient utilization of resources was a complex task.

    • Developers had to carefully analyze data dependencies, communication patterns, and computational requirements to make informed partitioning decisions.

  2. Memory Management and Data Transfers:

    • Each SPE had a limited local store (256 KB) for instructions and data, necessitating careful memory management. Data transfers between main memory and SPE local stores had to be orchestrated efficiently using DMA transfers to minimize stalls and bottlenecks.

    • Techniques like double-buffering and software caching were often employed to overlap computation with data transfers.

  3. SIMD Vectorization:

    • The SPEs featured a SIMD instruction set, enabling parallel operations on vectors of data. To fully leverage the SIMD capabilities, developers had to vectorize their code, which involved restructuring algorithms and data layouts to expose parallelism.

    • Compiler auto-vectorization support was limited, often requiring manual vectorization efforts by developers.

  4. Branch Prediction and Control Flow:

    • The SPEs lacked dynamic branch prediction hardware, making control-intensive code with unpredictable branches less efficient.

    • Developers had to employ techniques like software pipelining, predication, and branch hint instructions to mitigate the impact of branch mispredictions.

  5. Synchronization and Communication:

    • Coordinating the execution and communication between the PPE and multiple SPEs required careful synchronization mechanisms to avoid race conditions and ensure data coherency.

    • Efficient inter-core communication protocols and messaging schemes had to be implemented, often leveraging mailboxes, signal notifiers, and DMA transfers.

  6. Debugging and Profiling:

    • Debugging and profiling parallel applications on the heterogeneous Cell BE architecture posed challenges due to the distributed nature of execution and limited visibility into the SPEs.

    • Specialized debugging and profiling tools were developed by IBM, Sony, and third-party vendors to aid developers in identifying performance bottlenecks and optimizing their applications.

Which gives us alot of challenges to work with, but some very interesting opportunities to leverage coding styles and trickery with C that many might be unfamiliar when working with typical x86 binaries.

Understanding the Mathematics

The atan2 function, which computes the angle between the positive x-axis and a vector (x, y) in the Cartesian plane, is a fundamental operation in various scientific and engineering applications, such as computer graphics, robotics, and signal processing.

Mathematically, the arctangent function is expressed as:

$$atan(x) = θ, tan(θ) = x$$

The range of the arctangent function is typically [-π/2, π/2] radians, or [-90°, 90°] degrees. However, in many applications, it is desirable to have the full range of [-π, π] radians, or [-180°, 180°] degrees. This is achieved by considering the signs of both the input x and y values, leading to the atan2(y, x) function, which is a variation of the arctangent function.

The atan2(y, x) function computes the angle (in radians) between the positive x-axis and the vector (x, y) in the Cartesian plane. It is defined as:

$$atan2(y, x) = arctan(y/x)$$

The first step in computing atan2 is to reduce the input arguments (x, y) to a specific range using well-known trigonometric identities. This process, known as argument reduction, simplifies the subsequent computation by limiting the input domain to a narrow interval. For atan2, the following identities are commonly used:

  • atan2(-y, x) = -atan2(y, x)

  • atan2(y, -x) = π - atan2(y, x)

  • atan2(y, x) = π/2 - atan2(x, y) for |y| > |x|

Once the input arguments have been reduced, the core computation involves evaluating the atan(y/x) function on the reduced range. A common approach is to use a minimax polynomial approximation, which provides an accurate approximation of the function over a narrow interval.

Minimax polynomial approximation is a technique used in numerical analysis to find a polynomial 𝑃(𝑥) of a given degree 𝑛n that closely approximates a function 𝑓(𝑥) over a specified interval [𝑎,𝑏]. The primary objective of this method is to minimize the maximum absolute deviation between the polynomial approximation and the function across the interval. This is quantified as:

$$\max_{x \in [a, b]} |f(x) - P(x)|$$

Where the left-hand side represents the peak absolute error. The polynomial 𝑃(𝑥)P(x) is chosen to minimize this maximum error, hence the term "minimax". The polynomial approximation can be expressed as:

$$P(x) = c_0 + c_1 (x - x_0) + c_2 (x - x_0)^2 + \ldots + c_n (x - x_0)^n$$

Here, 𝑥0​ typically represents a central point within the interval [𝑎,𝑏], often the midpoint, which can be strategically chosen to reduce computational complexity or improve the symmetry of the error distribution. The constants 𝑐0,𝑐1,…,𝑐𝑛 are the coefficients of the polynomial, determined so as to achieve the minimax objective.

The Remez exchange algorithm is a popular method used to compute these coefficients. It iteratively adjusts the coefficients and the set of points at which the maximum error occurs, seeking to equalize the error at these points and minimize its peak value. This algorithm is particularly well-suited for finding minimax solutions due to its efficiency in handling the non-linear nature of the problem.

The evaluation of the minimax polynomial approximation is a critical component of the overall atan2 implementation. Two common techniques for efficient polynomial evaluation are Horner's scheme and Estrin's scheme.

Horner's scheme is a numerically stable algorithm for polynomial evaluation that minimizes the number of multiplications. Given a polynomial of degree 𝑛:

$$P(x) = a_n x^n + a_{n-1} x^{n-1} + \ldots + a_1 x + a_0$$

Horner’s scheme reformulates this polynomial as:

$$P(x) = (((\ldots(a_n x + a_{n-1}) x + a_{n-2}) \ldots ) x + a_1) x + a_0$$

This nested multiplication approach reduces the computational complexity from O(n^2) operations (if computed naively) to O(n) multiplications and O(n) additions, thus enhancing efficiency, especially for large 𝑛. This scheme is particularly effective for sequential processing environments because it sequentially updates the result and requires maintaining only a single accumulator during computation.

Estrin's scheme, on the other hand, is a factored form of the polynomial that can be evaluated using a series of fused multiply-add (FMA) instructions. While Estrin's scheme typically requires more instructions than Horner's scheme, it can exploit instruction-level parallelism more effectively, potentially improving performance on superscalar architectures like the CELL SPE. Like Horner's, it reduces the polynomial evaluation complexity but does so in a way that allows for concurrent execution of operations. For the same polynomial 𝑃(𝑥)P(x), Estrin's scheme groups terms to enable parallel computation:

$$P(x) = (((\ldots(a_n x^2 + a_{n-1}) x^2 + a_{n-2}) \ldots ) x^2 + a_1) x + a_0$$

However, this representation typically works best for polynomials where the degree is a power of two, allowing balanced splitting. If 𝑛 is not a power of two, dummy terms with coefficients of zero may be added to fit this scheme. Estrin’s method is particularly beneficial on architectures that support instruction-level parallelism and fused multiply-add (FMA) instructions, allowing multiple operations to be performed simultaneously.

Design Considerations of the Code

By combining careful algorithm design alongside many optimization techniques, significant performance gains can be achieved for atan2 computation on the CELL SPE processor.

Using Local Store Memory

The CELL BE had a heterogeneous multi-core architecture consisting of one PowerPC-based Power Processing Element (PPE) and eight specialized co-processors called Synergistic Processing Elements (SPEs). Each SPE had its own Synergistic Processing Unit (SPU) and a small local memory (256 KB) referred to as the Local Store (LS).

The SPU's LS is a high-speed scratchpad memory, and it is designed to be the primary working memory for the SPU. Data had to be explicitly transferred between the main system memory and the LS using Direct Memory Access (DMA) operations, as the SPU could not directly access the main memory.

The LS in the Cell BE SPEs is fundamentally different from a normal hardware cache as it is a software-managed memory, meaning that the programmer has explicit control over data transfers between the LS and the main memory using DMA operations. In contrast, a hardware cache is transparent to the programmer and automatically managed by the processor's memory management unit (MMU).

The LS also doesn't sync with the main memory or other LS. Data consistency must be explicitly managed by the programmer through DMA transfers and synchronization primitives. On the other hand, hardware caches maintain coherency with the main memory and other caches in the system, transparently to the programmer.

// Global Floating-point constants (32 bit)

static const vector unsigned int _cp_f_pio4 = { 0x3f490fdb, 0x3f490fdb, 0x3f490fdb, 0x3f490fdb };
static const vector unsigned int _cp_f_t3p8 = { 0x3fe5ec5d, 0x3fe5ec5d, 0x3fe5ec5d, 0x3fe5ec5d };
static const vector unsigned int _cp_f_npio2 = { 0xbfc90fdb, 0xbfc90fdb, 0xbfc90fdb, 0xbfc90fdb };
static const vector unsigned int _cp_f_pio2 = { 0x3fc90fdb, 0x3fc90fdb, 0x3fc90fdb, 0x3fc90fdb };
static const vector unsigned int _cp_f_pt66 = { 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab, 0x3f2aaaab };
static const vector unsigned int _cp_f_pi = { 0x40490fdb, 0x40490fdb, 0x40490fdb, 0x40490fdb };
static const vector unsigned int _cp_f_npi = { 0xc0490fdb, 0xc0490fdb, 0xc0490fdb, 0xc0490fdb };
static const vector unsigned int _cp_f_morebits = { 0x38800000, 0x38800000, 0x38800000, 0x38800000 };
static const vector unsigned int _cp_f_hmorebits = { 0x34000000, 0x34000000, 0x34000000, 0x34000000 };

// Helper functions to load constant values into quadwords
static inline qword cp_flpio4(void) { return si_lqa((intptr_t)&_cp_f_pio4); }
static inline qword cp_flt3p8(void) { return si_lqa((intptr_t)&_cp_f_t3p8); }
static inline qword cp_flnpio2(void) { return si_lqa((intptr_t)&_cp_f_npio2); }
static inline qword cp_flpio2(void) { return si_lqa((intptr_t)&_cp_f_pio2); }
static inline qword cp_flpt66(void) { return si_lqa((intptr_t)&_cp_f_pt66); }
static inline qword cp_flpi(void) { return si_lqa((intptr_t)&_cp_f_pi); }
static inline qword cp_flnpi(void) { return si_lqa((intptr_t)&_cp_f_npi); }
static inline qword cp_filzero(void) { return si_ilhu((int16_t)0x0000); }
static inline qword cp_filnzero(void) { return si_ilhu((int16_t)0x8000); }
static inline qword cp_filone(void) { return si_ilhu((int16_t)0x3f80); }
static inline qword cp_filtwo(void) { return si_ilhu((int16_t)0x4000); }
static inline qword cp_filinf(void) { return si_ilhu((int16_t)0x7f80); }
static inline qword cp_filninf(void) { return si_ilhu((int16_t)0xff80); }
static inline qword cp_filnan(void) { return si_ilhu((int16_t)0x7fc0); }

// Polynomial coefficients for cp_fatan approximation
static const vector unsigned int _cp_f_atan_p7 = { 0x3c08876a, 0x3c08876a, 0x3c08876a, 0x3c08876a };
static const vector unsigned int _cp_f_atan_p6 = { 0xbd954629, 0xbd954629, 0xbd954629, 0xbd954629 };
static const vector unsigned int _cp_f_atan_p5 = { 0x3f8a07c1, 0x3f8a07c1, 0x3f8a07c1, 0x3f8a07c1 };
static const vector unsigned int _cp_f_atan_p4 = { 0xbf49eee6, 0xbf49eee6, 0xbf49eee6, 0xbf49eee6 };
static const vector unsigned int _cp_f_atan_p3 = { 0x3ee4f8b5, 0x3ee4f8b5, 0x3ee4f8b5, 0x3ee4f8b5 };
static const vector unsigned int _cp_f_atan_p2 = { 0xbf62365c, 0xbf62365c, 0xbf62365c, 0xbf62365c };
static const vector unsigned int _cp_f_atan_p1 = { 0x3f490965, 0x3f490965, 0x3f490965, 0x3f490965 };
static const vector unsigned int _cp_f_atan_p0 = { 0xbf2697e0, 0xbf2697e0, 0xbf2697e0, 0xbf2697e0 };

// Higher-degree polynomial coefficients for cp_fatan approximation
static const vector unsigned int _cp_f_atan_q7 = { 0x3c0897d0, 0x3c0897d0, 0x3c0897d0, 0x3c0897d0 };
static const vector unsigned int _cp_f_atan_q6 = { 0xbd890e31, 0xbd890e31, 0xbd890e31, 0xbd890e31 };
static const vector unsigned int _cp_f_atan_q5 = { 0x3f6c4616, 0x3f6c4616, 0x3f6c4616, 0x3f6c4616 };
static const vector unsigned int _cp_f_atan_q4 = { 0xbf2bbfc2, 0xbf2bbfc2, 0xbf2bbfc2, 0xbf2bbfc2 };
static const vector unsigned int _cp_f_atan_q3 = { 0x3eb6679b, 0x3eb6679b, 0x3eb6679b, 0x3eb6679b };
static const vector unsigned int _cp_f_atan_q2 = { 0xbf56c0c9, 0xbf56c0c9, 0xbf56c0c9, 0xbf56c0c9 };
static const vector unsigned int _cp_f_atan_q1 = { 0x3f7df927, 0x3f7df927, 0x3f7df927, 0x3f7df927 };
static const vector unsigned int _cp_f_atan_q0 = { 0xbf800000, 0xbf800000, 0xbf800000, 0xbf800000 };

In programming for the CELL SPE, it is crucial to ensure memory alignment, as the Cell BE's DMA operations and SIMD instructions often require 16-byte alignment for optimal efficiency. Explicit data transfer management between the main memory and the LS must be handled by the programmer using DMA operations, which involves initiating transfers, waiting for completion, and synchronizing data to ensure consistency. Implementing double buffering can overlap computation with data transfer, hiding latency and improving performance. Prefetching data into the LS ahead of time reduces latency and ensures that the SPU has continuous access to the required data. Synchronization primitives such as barriers and signals are essential to maintain data consistency between the LS and the main memory, as well as between different SPEs.

Going Further to Direct Memory Access

Double buffering for DMA (Direct Memory Access) is a technique that allows for the overlapping of data transfers and computation, thus improving the efficiency of a system, particularly in the CELL BE architecture. The CELL BE comprises a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPEs). Each SPE has its own Synergistic Processing Unit (SPU) and Local Store (LS). The LS is a high-speed scratchpad memory used by the SPU, but data must be explicitly transferred between the LS and the main memory using DMA operations.

In double buffering, two buffers are used alternately to transfer and process data. While one buffer is being processed by the SPU, the other buffer is being filled with new data from the main memory. This technique ensures that the SPU can continue processing without waiting for data transfers to complete, thereby hiding the latency of memory operations and increasing overall throughput.

  1. Buffer Initialization: Two buffers, buffer0 and buffer1, are allocated in the LS. Pointers current_buffer and next_buffer are used to manage these buffers.

  2. Initial DMA Transfer: The dma_transfer function is called to initiate the first DMA transfer, filling current_buffer with data from the main memory.

  3. Processing and Overlapping Transfer: While the SPU processes the data in current_buffer, the next DMA transfer fills next_buffer with the subsequent data chunk. This overlap minimizes idle time.

  4. Swapping Buffers: After processing the current_buffer, the results are written back to the main memory. The buffers and DMA tags are then swapped, and the process repeats.

#define BUFFER_SIZE 128 // Example buffer size
#define TAG 1 // DMA tag

// DMA transfer function
void dma_transfer(qword *buffer, uint32_t ea, int tag) {
    mfc_get(buffer, ea, BUFFER_SIZE * sizeof(qword), tag, 0, 0);
    mfc_write_tag_mask(1 << tag);

// Double buffering for DMA transfers
void process_with_double_buffering(uint32_t ea_data, uint32_t ea_result, int num_elements) {
    qword buffer0[BUFFER_SIZE] __attribute__((aligned(16)));
    qword buffer1[BUFFER_SIZE] __attribute__((aligned(16)));
    qword *current_buffer = buffer0;
    qword *next_buffer = buffer1;
    int current_tag = TAG;
    int next_tag = TAG + 1;

    // Initial DMA transfer
    dma_transfer(current_buffer, ea_data, current_tag);

    for (int i = 0; i < num_elements; i += BUFFER_SIZE) {
        // Start next DMA transfer
        if (i + BUFFER_SIZE < num_elements) {
            dma_transfer(next_buffer, ea_data + (i + BUFFER_SIZE) * sizeof(qword), next_tag);

        // Process current buffer
        for (int j = 0; j < BUFFER_SIZE && (i + j) < num_elements; ++j) {
            qword y = current_buffer[j * 2];
            qword x = current_buffer[j * 2 + 1];
            qword result = _cp_fatan2(y, x);
            current_buffer[j] = result;

        // Write results back
        mfc_put(current_buffer, ea_result + i * sizeof(qword), BUFFER_SIZE * sizeof(qword), current_tag, 0, 0);
        mfc_write_tag_mask(1 << current_tag);

        // Swap buffers and tags
        qword *temp_buffer = current_buffer;
        current_buffer = next_buffer;
        next_buffer = temp_buffer;
        int temp_tag = current_tag;
        current_tag = next_tag;
        next_tag = temp_tag;

The buffers are aligned using __attribute__((aligned(16))) to ensure proper alignment for SIMD (Single Instruction, Multiple Data) operations. Proper alignment is crucial for SIMD operations because it allows data to be loaded and stored in chunks, minimizing the number of memory accesses and maximizing the throughput of data processing. Misaligned data can cause additional overhead and reduce the efficiency of SIMD instructions.

Then, the dma_transfer function starts the first DMA transfer, loading data from the main memory into current_buffer. The function uses mfc_get to initiate the transfer and mfc_write_tag_mask and mfc_read_tag_status_all to manage DMA tags and ensure the transfer completes before processing.

Inside the loop, while the SPU processes the data in current_buffer, the next DMA transfer is initiated to load data into next_buffer. This is achieved by checking if there are more elements to process (i + BUFFER_SIZE < num_elements) and calling dma_transfer for next_buffer.

After processing current_buffer, the results are written back to the main memory using mfc_put. The buffers and tags are then swapped, allowing the next iteration of the loop to process the newly loaded data while the previous results are being transferred back.

Loop Unrolling through Parallelism

Loop unrolling is an optimization technique used to enhance the performance of loops by decreasing the loop control overhead and increasing the instruction-level parallelism. This technique involves replicating the loop body multiple times within a single iteration, thereby reducing the number of iterations and allowing more operations to be executed in parallel.

Consider a loop that processes one element per iteration:

for (int i = 0; i < n; ++i) {

If we unroll this loop by a factor of 4, the loop body is replicated four times, and the loop control instructions are adjusted accordingly:

for (int i = 0; i < n; i += 4) {
    process(data[i + 1]);
    process(data[i + 2]);
    process(data[i + 3]);

By doing this, the number of iterations is reduced by a factor of 4, thus reducing the loop control overhead (incrementing the counter and checking the loop condition) and exposing more opportunities for parallel execution.

Loop unrolling enhances the efficiency of the double buffering technique by minimizing the overhead of loop control operations and increasing the degree of parallelism. This is particularly beneficial in the CELL BE architecture, where the SPU can take advantage of SIMD instructions to process multiple data points concurrently.

Here’s how loop unrolling can be applied in the code :

#define BUFFER_SIZE 128
#define TAG 1

void doublebuff(uint32_t ea_data, uint32_t ea_result, int num_elements) {
    qword buffer0[BUFFER_SIZE] __attribute__((aligned(16)));
    qword buffer1[BUFFER_SIZE] __attribute__((aligned(16)));
    qword *current_buffer = buffer0;
    qword *next_buffer = buffer1;
    int current_tag = TAG;
    int next_tag = TAG + 1;

    // Initial DMA transfer
    dma_transfer(current_buffer, ea_data, current_tag);

    for (int i = 0; i < num_elements; i += BUFFER_SIZE) {
        if (i + BUFFER_SIZE < num_elements) {
            dma_transfer(next_buffer, ea_data + (i + BUFFER_SIZE) * sizeof(qword), next_tag);

        // Process current buffer with loop unrolling
        for (int j = 0; j < BUFFER_SIZE; j += UNROLL_FACTOR) {
            for (int k = 0; k < UNROLL_FACTOR && (i + j + k) < num_elements; ++k) {
                qword y = current_buffer[(j + k) * 2];
                qword x = current_buffer[(j + k) * 2 + 1];
                qword result = _cp_fatan2(y, x);
                current_buffer[j + k] = result;

        mfc_put(current_buffer, ea_result + i * sizeof(qword), BUFFER_SIZE * sizeof(qword), current_tag, 0, 0);
        mfc_write_tag_mask(1 << current_tag);

        qword *temp_buffer = current_buffer;
        current_buffer = next_buffer;
        next_buffer = temp_buffer;
        int temp_tag = current_tag;
        current_tag = next_tag;
        next_tag = temp_tag;

The buffers are initialized in the LS, ensuring they are aligned to 16-byte boundaries for efficient SIMD operations. The first DMA transfer is initiated to fill current_buffer with data from the main memory. And then, inside the loop the body of the loop is unrolled by a factor of 4. This reduces the number of iterations and allows multiple elements to be processed in each iteration, reducing loop overhead and improving parallelism. After processing current_buffer, the results are written back to the main memory, and the buffers are swapped for the next iteration.

Polynomial Approximation Through Fused Multiply-Add (FMA) Operations

Now we can start getting into the mathematics of it all. And it starts by computing the arctangent of a single-precision floating-point value or a vector of four single-precision floating-point values using a polynomial approximation. We can implement this using the Estrin's method, evaluating the numerator and denominator polynomials separately using Fused Multiply-Add (FMA) instructions, and then dividing the numerator by the denominator to obtain the final result.

const qword xp2 = si_fm(range_x, range_x);
const qword znum0 = f_atan_p0;
const qword znum1 = si_fma(znum0, xp2, f_atan_p1); // FMA: (znum0 * xp2) + f_atan_p1
const qword znum2 = si_fma(znum1, xp2, f_atan_p2); // FMA: (znum1 * xp2) + f_atan_p2
const qword znum3 = si_fma(znum2, xp2, f_atan_p3); // FMA: (znum2 * xp2) + f_atan_p3
const qword znum = si_fma(znum3, xp2, f_atan_p4); // FMA: (znum3 * xp2) + f_atan_p4
const qword zden0 = si_fa(xp2, f_atan_q0);
const qword zden1 = si_fma(zden0, xp2, f_atan_q1); // FMA: (zden0 * xp2) + f_atan_q1
const qword zden2 = si_fma(zden1, xp2, f_atan_q2); // FMA: (zden1 * xp2) + f_atan_q2
const qword zden3 = si_fma(zden2, xp2, f_atan_q3); // FMA: (zden2 * xp2) + f_atan_q3
const qword zden = si_fma(zden3, xp2, f_atan_q4); // FMA: (zden3 * xp2) + f_atan_q4

A polynomial of degree n can be written as

P(x) = a₀ + a₁x + a₂x² + ... + aₙxⁿ
     = a₀ + x(a₁ + x(a₂ + ... + x(aₙ₋₁ + xaₙ)))

This representation allows the polynomial to be evaluated using nested multiplication and addition operations, with each level of nesting corresponding to a higher power of x. In the _cp_fatan implementation, the numerator polynomial is evaluated as:

$$znum = f_atan_p0 + xp2 * (f_atan_p1 + xp2 * (f_atan_p2 + xp2 * (f_atan_p3 + xp2 * f_atan_p4)))$$

Similarly, the denominator polynomial is evaluated as:

$$zden = f_atan_q0 + xp2 * (f_atan_q1 + xp2 * (f_atan_q2 + xp2 * (f_atan_q3 + xp2 * f_atan_q4)))$$

The Estrin method is particularly well-suited for architectures that support Fused Multiply-Add (FMA) instructions, as each level of nesting can be computed using a single FMA operation. This is precisely what the implementation does, using the si_fma intrinsic to perform the nested multiplication and addition operations in a single instruction.

The main advantage of the Estrin method is that it minimizes the number of operations required to evaluate a polynomial. For a polynomial of degree n, the Estrin method requires only n multiplications and n additions, which is optimal in terms of the number of operations.

However, it's important to note that the Estrin method can introduce numerical instabilities due to the accumulation of rounding errors, especially for higher-degree polynomials or input values with large magnitudes. To mitigate this issue, we need to refine the denominator value by performing a series of FMA operations. The first step is to compute (1 - zden_r0) * zden using the si_fnms (Fused Negative Multiply-Subtract) intrinsic:

const qword zden_r1 = si_fnms(zden_r0, zden, f_one);

This operation computes (1 - zden_r0) * zden and subtracts the result from 1.0. The purpose of this step is to remove the fractional part from zden, effectively computing the integer part of zden.

Finally, the implementation refines the denominator value by adding the fractional part back to the integer part using another FMA operation:

const qword zden_r = si_fma(zden_r1, zden_r0, zden_r0);

This operation computes zden_r1 * zden_r0 + zden_r0, which is equivalent to adding the fractional part (zden_r0) to the integer part (zden_r1 * zden_r0). The result is a more accurate representation of the denominator value, denoted as zden_r.

With the refined denominator value zden_r, the implementation can now perform the final division more accurately:

const qword zdiv = si_fm(znum, zden_r);

This step computes zdiv = znum / zden_r, which is the final result of the arctangent approximation.

The purpose of refining the denominator value is to reduce the impact of rounding errors when the denominator is close to zero. By separating the integer and fractional parts of the denominator and refining the fractional part, the implementation can maintain higher precision and reduce the accumulation of rounding errors, especially in cases where the denominator value is small.

Mathematically, the refinement process can be represented as follows:

Let zden = f + i, where f is the fractional part, and i is the integer part.

Then :

$$zden_r0 = f$$

$$zden_r1 = (1 - f) \cdot (f + i) = i + f - f^2 \approx i$$

$$zden_r = zden_r1 + zden_r0 = i + f = zden$$

By refining the denominator value (zden_r) to be a more accurate representation of zden, the implementation can mitigate the potential numerical instabilities caused by rounding errors, especially when the denominator value is close to zero.

Refining Values Through Range Reduction

Range reduction is crucial for improving the accuracy and efficiency of the arctangent approximation. It involves mapping the input value x into a smaller range, where the arctangent function can be more accurately approximated using a polynomial.

The CELL BE, like most modern processors, represents single-precision (32-bit) and double-precision (64-bit) floating-point numbers using a fixed number of bits for the significand (mantissa) and exponent. This limited precision can lead to rounding errors when representing certain values or performing arithmetic operations. For example, when computing the arctangent function using a polynomial approximation, the input values may need to be divided or multiplied by certain constants. If these constants cannot be represented exactly in the floating-point format, rounding errors can accumulate throughout the computation, potentially leading to inaccurate results.

const qword range0_mask = si_fcgt(pos_x, f_t3p8);
const qword range1_gt_mask = si_fcgt(f_pt66, pos_x);
const qword range1_eq_mask = si_fceq(f_pt66, pos_x);
const qword range1_mask = si_or(range1_gt_mask, range1_eq_mask);
const qword range2_mask = si_nor(range0_mask, range1_mask);

The range0_mask identifies cases where pos_x (the absolute value of the input x) is greater than tan(3π/8), which is a constant value f_t3p8. The si_fcgt intrinsic performs a floating-point greater-than comparison.

The range1_mask identifies cases where pos_x is less than or equal to 0.66, which is a constant value f_pt66. It is computed by combining two masks using si_or: range1_gt_mask (for pos_x < 0.66) and range1_eq_mask (for pos_x == 0.66).

The range2_mask identifies the remaining cases that are not covered by range0_mask or range1_mask. It is computed by performing a bitwise NOR operation (si_nor) on range0_mask and range1_mask.

These range masks are used to select different computations and approximations for the arctangent function, based on the range in which the input value falls.

For the range0 case, where pos_x is greater than tan(3π/8), the range reduction involves computing -1.0 / pos_x:

const qword range0_x0 = si_frest(pos_x);
const qword range0_x1 = si_fi(pos_x, range0_x0);
const qword range0_x2 = si_fnms(range0_x1, pos_x, f_one);
const qword range0_x3 = si_fma(range0_x2, range0_x1, range0_x1);
const qword range0_x = si_xor(range0_x3, f_msb);
const qword range0_y = f_pio2;

This computation is performed using a series of FMA (Fused Multiply-Add) instructions for improved accuracy and performance. The final result, range0_x, is then negated using an XOR operation with f_msb (which flips the sign bit). The corresponding range0_y value is set to π/2.

For the range1 case, where pos_x is less than or equal to 0.66, the range reduction is straightforward:

const qword range1_x = pos_x;
const qword range1_y = f_zero;

The range1_x value is simply set to pos_x, and the corresponding range1_y value is set to 0.0.

For the range2 case, which covers the remaining values of pos_x, the range reduction involves computing (pos_x - 1.0) / (pos_x + 1.0):

const qword range2_y = f_pio4;
const qword range2_x0num = si_fs(pos_x, f_one);
const qword range2_x0den = si_fa(pos_x, f_one);
const qword range2_x0 = si_frest(range2_x0den);
const qword range2_x1 = si_fnms(range2_x0, range2_x0den, f_one);
const qword range2_x2 = si_fma(range2_x1, range2_x0, range2_x0);
const qword range2_x = si_fm(range2_x0num, range2_x2);

This computation is also performed using FMA instructions for efficiency. The corresponding range2_y value is set to π/4.

After computing the range-specific values (range0_x, range1_x, range2_x, range0_y, range1_y, range2_y), the appropriate values are selected based on the range masks:

const qword range_x0 = si_selb(range2_x, range0_x, range0_mask);
const qword range_x = si_selb(range_x0, range1_x, range1_mask);
const qword range_y0 = si_selb(range2_y, range0_y, range0_mask);
const qword range_y = si_selb(range_y0, range1_y, range1_mask);

The si_selb intrinsic is used to perform a conditional selection operation, selecting the first value if the corresponding mask bit is 0, and the second value if the mask bit is 1. The range-specific values range_x and range_y are then used in the subsequent polynomial approximation step of the _cp_fatan function.

The purpose of this range reduction is to map the input value x into a smaller range, typically [-1, 1], where the arctangent function can be more accurately approximated using a polynomial. By decomposing the input range into multiple sub-ranges and applying different range reduction strategies, the implementation can achieve higher accuracy and efficiency in the approximation.

Special Case Handling for Weird Values

We need to handle special cases to ensure correct results when dealing with corner cases or exceptional inputs, such as zero, infinity, and NaN (Not a Number) to make sure that the approximation adheres to the mathematical specifications and produces accurate results even in corner cases that might otherwise lead to undefined or incorrect behavior.

Positive and Negative Infinite Inputs

When one of the inputs (x or y) is infinite, the mathematical behavior of the atan2(y, x) function needs to be explicitly defined. For example, when y is positive infinity and x is negative infinity, the result should be 3π/4 radians. These cases cannot be handled by the general polynomial approximation used for finite input values.

Masks x_eqinf_mask and x_eqninf_mask are created using si_fceq to identify if x is equal to positive or negative infinity, respectively.

If x is positive infinity, the result should be π/2. This is achieved by selecting between result_y0 (the result from the previous step) and f_pio2 (the constant value for π/2) based on the x_eqinf_mask:

const qword result_y1 = si_selb(result_y0, f_pio2, x_eqinf_mask);

If x is negative infinity, the result should be -π/2. This is achieved by selecting between result_y1 (the result from the previous step) and f_npio2 (the constant value for -π/2) based on the x_eqninf_mask:

const qword result = si_selb(result_y1, f_npio2, x_eqninf_mask);

If Input is Zero

When one or both inputs are zero, the mathematical behavior of atan2(y, x) becomes undefined or depends on the specific signs of x and y. For instance, when y is 0 and x is negative, the result should be π radians, while when y is 0 and x is positive, the result should be 0 radians. These cases require special handling to ensure correct results.

const qword x_eqz_mask = si_fceq(f_zero, x);
const qword result_y0 = si_selb(pos_yaddz, x, x_eqz_mask);

const qword x_eqinf_mask = si_fceq(f_inf, x);
const qword x_eqninf_mask = si_fceq(f_ninf, x);
const qword result_y1 = si_selb(result_y0, f_pio2, x_eqinf_mask);
const qword result = si_selb(result_y1, f_npio2, x_eqninf_mask);

The first step is to identify if the input x is equal to 0.0 using the si_fceq intrinsic, which performs a floating-point equal-to comparison. The result is stored in the x_eqz_mask.

If x is equal to 0.0, the result should be 0.0. This is achieved by selecting between pos_yaddz (the computed arctangent value from the previous steps) and x (which is 0.0) based on the x_eqz_mask. The si_selb intrinsic is used for this conditional selection operation:

const qword result_y0 = si_selb(pos_yaddz, x, x_eqz_mask);

Not a Number (NaN)

If either x or y is NaN, the result of atan2(y, x) should also be NaN, according to the IEEE 754 floating-point standard. NaN values represent invalid or undefined results, and they need to be propagated correctly through mathematical operations. The CELL BE itself supports various rounding modes, including round-to-nearest, round-toward-positive-infinity, round-toward-negative-infinity, and round-toward-zero. The choice of rounding mode can affect the final result, especially when dealing with exceptional values or inputs near the boundaries of the representable range. For example, when computing atan2(y, x) with x or y close to zero, the rounding mode can determine whether the result is rounded to zero or a non-zero value.

The CELL BE provides hardware support for handling exceptional floating-point values like infinities and NaNs (Not a Number). However, the behavior of arithmetic operations involving these values may not always be intuitive or consistent across different scenarios. For example, when computing atan2(y, x) with one input being infinity and the other being a finite value, the result can depend on the specific signs of the inputs and may require special handling to ensure accurate and consistent behavior.

Thoughts and Conclusions

One key advantage of a SPE-centric approach to vectorizing mathematical functions is that it can potentially unlock higher levels of parallelism and efficiency, as the SPEs are not constrained by the need to continuously communicate with the PPE. This can be particularly beneficial for workloads that can be effectively parallelized and distributed across the multiple SPEs.

SPUs also feature a "SIMD-type instruction set" and 128-bit registers that can hold vectors of 32/16-bit fixed-point or floating-point values. This SIMD architecture is more well-suited for performing the same operation on multiple data elements in parallel. Leveraging the Floating-Point Unit (FPU) capable of performing single-precision (32-bit) and double-precision (64-bit) floating-point operations also helps here.

However, adopting an SPE-centric programming model also introduces several challenges like developers must carefully partition the data and tasks to be processed by each SPE, ensuring that there are no conflicts or race conditions when multiple SPEs access shared data structures concurrently. Proper synchronization mechanisms, such as locks or barriers, must be employed to maintain data coherency.

And while the PPE is responsible for initially setting up tasks and allocating resources, the SPEs may still require some level of dynamic resource management during execution. Mechanisms for requesting and releasing resources, such as memory or hardware accelerators, may need to be implemented.

In a distributed computing environment like the SPE-centric model, fault tolerance and error handling become also more critical. Mechanisms must be in place to detect and recover from potential failures or errors in individual SPEs, as well as to gracefully handle scenarios where one or more SPEs become unavailable.

And finally, codebases implementing SPE-centric algorithms may be more challenging to port to other platforms that do not share the same heterogeneous architecture as the Cell BE. This can potentially limit the reusability and portability of the codebase across different hardware targets. This is partly why PS3 emulation has been such a tough nut to crack, with Sony themselves dropping backwards compatibility support for the PS3 in the PS4.

Despite its initial hype and potential, the Cell processor ultimately failed to gain widespread adoption and became an architectural dead-end.

One of the fundamental reasons for the Cell's failure can be attributed to the breakdown of Dennard scaling. Dennard scaling, a principle that governed transistor behavior for several decades, predicted that as transistors became smaller, their power density would remain constant, allowing for higher clock speeds without increasing power consumption or heat generation.

The Cell processor was designed during an era when engineers were still expecting to achieve clock speeds of 5GHz or higher in consumer electronics chips. However, as the Dennard scaling limitations became apparent, the Cell's design, which relied heavily on high clock speeds, became increasingly untenable. As a result, the processor had to be pigeon-holed into a role it was not originally designed for, and Sony had to hastily integrate a separate GPU from NVIDIA to compensate for the Cell's underperforming graphics capabilities, which led to some production issues.

Beyond the hardware limitations, the tooling and documentation for the Cell were widely criticized, further compounding the challenges developers faced in effectively utilizing the processor's capabilities. This is due to Sony probably looking to prevent developers from making their games multiplatform, which is why until only recently we never saw ports of many PS3 games to other platforms.

While the Cell processor represented an ambitious and innovative effort to push the boundaries of parallel computing, its performance in more general-purpose computing tasks was often lackluster due to the programming challenges and the inherent limitations of its architecture (despite it excelling in certain embarrassingly parallel workloads, such as matrix math and password cracking). While there was hype regarding its initial potential for certain specialized workloads, the Cell processor's unconventional design proved to be a misstep, and it ultimately faded into obscurity.

Speaking of processor designs, next article might be about CUDA kernels, but idk tho lol.