This part handles optimizing the previous post. Let’s start with understanding some of the huge mistakes in the previous implementation.
Memory access
Regarding caching, the main idea is to make the best use out of different available memory levels.

If you look at the memory model in OpenCL (check chapter 3.3 in the OpenCL 1.x spec available here), There are different memory level. The fastest is the private memory, and it’s memory available for use per work-item. Local memory is shared among a work-group (a group of work-items), and it’s larger (16KB minimum). Then, there is the big shared memory, and it’s the main GPU memory (shared between all cores). Also, there is a constant (read only) memory that’s also shared. The more data you can bring to low-level memory the better. So, the first thing we’ll do is to move the data we use to local memory, and to do that, we need to understand the execution model of OpenCL.

We mentioned earlier that local memory is shared inside a work-group. A work-group is basically a group of work-items. Because relying on global memory will slow things down drastically (hundreds of cores accessing one memory), GPU cores are grouped into work-groups and they share some memory (starts at 16KB or larger, depending on the hardware), to speed things up. Also, each work item has its own private memory. So, in the N-Body problem that we have, it’s suitable to cache the body of the current work-item in its private memory, cache a block of bodies in the local memory (for the work-group to use, rather than reading from global memory), and that’s it.
We start by caching the current body.
__kernel void accelerate(__global body *bodies, const unsigned int count, float G, float e) {
//get the global run id
unsigned int i = get_global_id(0);
//check if it's over the particles' count
if (i >= count) return;
//body cache
body currentBody;
currentBody.x = bodies[i].x;
currentBody.y = bodies[i].y;
currentBody.vx = bodies[i].vx;
currentBody.vy = bodies[i].vy;
currentBody.ax = 0;
currentBody.ay = 0;
//calculate acceleration
unsigned int j = 0;
float dx, dy, r, f;
for (; j < count; ++j) {
if (i == j) continue;
dx = currentBody.x - bodies[j].x;
dy = currentBody.y - bodies[j].y;
r = max(sqrt(dx*dx + dy*dy), e);
f = G * bodies[j].m / (r*r);
currentBody.ax -= f * (dx/r);
currentBody.ay -= f * (dy/r);
}
bodies[i].ax = currentBody.ax;
bodies[i].ay = currentBody.ay;
}
My tests showed that running the simulation using 32,768 particles using this kernel takes the same as the old one simulating 16,384 particles. Big difference, and it’s only because of bad memory access patterns.
Step two is local memory. This one is tricky. There two parts: work-group size and local memory. As mentioned earlier, local memory has a minimum of 16KB. However, work-group size doesn’t have a minimum (the minimum is 1). So, you’ll have to modify the kernel to adapt with your hardware. An easy way to do this is by using preprocessor options while building your CL kernels, and we’ll do indeed.
Local memory is shared within a work-group. What we can do is we can divide the big loop that iterates on all bodies into blocks, and each block would cache a number of bodies so that the work-group would use that cache at the same time (instead of repeated reads from global memory for the same bodies over and over), and so on. We do that by making each work item cache a single body, so the whole group would complete the caching process. Then, we use the cached data, then we move to the next block. To achieve this, we’ll need to synchronize work-items. Also, since we only need the position and mass of the bodies, we’ll use float3
instead of the body
structure.
__kernel void accelerate(
__global body *bodies, const unsigned int count,
// We add local memory for the cache (the size is specified using clSetKernelArg)
__local float3 *cache,
float G, float e) {
// get the global run id
unsigned int i = get_global_id(0);
unsigned int group_count = get_num_groups(0);
unsigned int group_size = get_local_size(0);
unsigned int local_id = get_local_id(0);
// check if it's over the particles' count
if (i >= count) return;
// body cache
body currentBody;
currentBody.x = bodies[i].x;
currentBody.y = bodies[i].y;
currentBody.vx = bodies[i].vx;
currentBody.vy = bodies[i].vy;
currentBody.ax = 0;
currentBody.ay = 0;
// declare variables that will be used to calculate acceleration
float dx, dy, r, f;
// Loop on blocks
for(unsigned int g = 0; g < group_count; ++g) {
// cache a single body
unsigned idx = g * group_size + local_id;
cache[local_id].x = bodies[idx].x;
cache[local_id].y = bodies[idx].y;
cache[local_id].z = bodies[idx].m;
// synchronize all work-items in the work-group
barrier(CLK_LOCAL_MEM_FENCE);
// calculate acceleration between the current body and the cache
for(unsigned int j = 0; j < group_size; ++j) {
dx = currentBody.x - cache[j].x;
dy = currentBody.y - cache[j].y;
r = max(sqrt(dx*dx + dy*dy), e);
f = G * cache[j].z / (r*r);
currentBody.ax -= f * (dx/r);
currentBody.ay -= f * (dy/r);
}
// synchronize all work-items in the work-group
barrier(CLK_LOCAL_MEM_FENCE);
}
bodies[i].ax = currentBody.ax;
bodies[i].ay = currentBody.ay;
}
Memory access is now optimized. Next step
Instructions Optimization
The body of the algorithm can’t be optimized any further. However, we can optimize other things, like replacing variables in loops with constants. We can do that using one of two methods:
- Using macros to define them at build time.
- Loop unrolling, so that the check for the loop condition will be less frequent.
We’ll go with unrolling (manual). First, we take the calculation part and put it into a function, and we’ll call that function several times inside the loop. Since the block size is always multiples of 2, 8 calls should be sufficient.
One last step is to combine the acceleration and the integration kernel in one, to make use of the cached body, and we’re done. This is the final version of the kernel:
struct _body {
float x, y;
float vx, vy;
float ax, ay;
float m;
};
typedef struct _body body;
void computeAcceleration(body* currentBody, float3 cachedBody, float G, float e) {
float dx, dy, r, f;
dx = currentBody->x - cachedBody.x;
dy = currentBody->y - cachedBody.y;
r = max(sqrt(dx*dx + dy*dy), e);
f = G * cachedBody.z / (r*r);
currentBody->ax -= f * (dx/r);
currentBody->ay -= f * (dy/r);
}
__kernel void accelerate(
//Bodies, their count, and the local cache
__global body *bodies, const unsigned int count, __local float3 *cache,
//Constants
float G, float e, float dt, float decay
) {
// get the global run id
unsigned int i = get_global_id(0);
unsigned int group_count = get_num_groups(0);
unsigned int group_size = get_local_size(0);
unsigned int local_id = get_local_id(0);
// check if it's over the particles' count
if (i >= count) return;
// body cache
body currentBody;
currentBody.x = bodies[i].x;
currentBody.y = bodies[i].y;
currentBody.vx = bodies[i].vx;
currentBody.vy = bodies[i].vy;
currentBody.ax = 0;
currentBody.ay = 0;
// Loop on blocks
for(unsigned int g = 0; g < group_count; ++g) {
// cache a single body
unsigned idx = g * group_size + local_id;
cache[local_id].x = bodies[idx].x;
cache[local_id].y = bodies[idx].y;
cache[local_id].z = bodies[idx].m;
// synchronize all work-items
barrier(CLK_LOCAL_MEM_FENCE);
// calculate acceleration between the current body and the cache
for(unsigned int j = 0; j < group_size;) {
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
computeAcceleration(&currentBody, cache[j++], G, e);
}
// synchronize all work-items
barrier(CLK_LOCAL_MEM_FENCE);
}
currentBody.vx += currentBody.ax * dt;
currentBody.vy += currentBody.ay * dt;
currentBody.x += currentBody.vx * dt;
currentBody.y += currentBody.vy * dt;
currentBody.vx *= decay;
currentBody.vy *= decay;
bodies[i].x = currentBody.x;
bodies[i].y = currentBody.y;
bodies[i].vx = currentBody.vx;
bodies[i].vy = currentBody.vy;
}
There is a branch called “optimized” in the repository shared in the previous post. It has the updated code. Check it out.

As you can see, the difference is phenomenal. I also experimented with different workgroup sizes. Here are the results.

Hardware: ATI Radeon HD 6490M (256MB memory, 160 cores, peaks ~260 GFLOPS).
Distributed systems differ
And we’ll will discuss that in a future post, Hopefully, ISA.
Keep up the good work. You’re doing good. (Y)
PS: This code is not so good. Instructions should be optimized. Data structures should be altered. In a future post.
Like this:
Like Loading...