Testing PreSonus Studio One

Hello..

Compared to GarageBand, it’s much more flexible. Having the ability to share instruments between tracks, track grouping, unlimited number of filters per track/group, and having a number of worthy plugins makes it a very good candidate. After trying it for a month or so, I think it will only work if I got the Presence XT add-on. There is a similar one that comes with Studio One Artist (thankfully, I got it for free). but that quality is simply bad.

At some point, I’ll try Presence XT and see how things will go. Nonetheless, here is something I made using Studio One. Enjoy!

Trying to relax/recover

Hello people 🙂

 

So, in the middle of playing games and being busy doing nothing, tech posts will be on hold for some time (I think it’s very noticeable :D). I’ll be posting tracks here for the time being. I don’t know when I’ll get back to technical stuff, but I definitely know it’s not soon. Also, I had a terrible bike accident. So, I’m in doing-nothing-playing-and-eating-all-day-sleeping-a-lot mode (it’s an actual mode :D).

 

Here are two things I posted lately on SoundCloud. I hope you enjoy them.


Keep up the good work 😉

Sketching the Third Soundtrack.. and somthing else

Hi there!

Between work and laziness, I managed to complete this one. It’s still a working progress. It’s meant to be the inspiration of the whole soundtrack. I’ll take my time on this one.

 

I hope you enjoy it.

 

PS: Technical posts will come back soon, ISA. I’m trying to setup some hardware for it, and as soon as the hardware is ready I’ll get back to the geeky stuff, ISA. Wish me good luck. 🙂

First Soundtrack?

Sorry for the long absence. I disappear.
Just finished the first soundtrack. I made it from scratch. Some of it were ideas I had back in 2012. Some of it were ideas I had just today. 😀
I hope you enjoy it: The First Soundtrack. My favorite one is the last one, titled “Walking Away”, which I had the idea for it this afternoon.
See you soon.
PS: About the distributed series.. I have no idea if I’ll finish it or not. I’ve been really busy since then. So..

Setup a Cluster using ROCKS!

I thought it might come in handy for some people. Let’s get to it.

What is Rocks?

Rocks is a linux distribution based on CentOS. It’s intended for High-Performance Computing. It has the tools that helps managing the cluster, and it’s really easy to setup for a… I’m not going to say a “Mere Mortal”, but it shouldn’t be hard for an average computer science student.

Installation

Check Your System Requirements

You’ll need a minimum of two computers (the main node, and a compute node). Here are the system requirements for running Rocks:

Frontend Node
  • Disk Capacity: 30 GB
  • Memory Capacity: 1 GB
  • Ethernet: 2 physical ports (e.g., “eth0” and “eth1”)
  • BIOS Boot Order: CD, Hard Disk
Compute Node
  • Disk Capacity: 30 GB
  • Memory Capacity: 1 GB
  • Ethernet: 1 physical port (e.g., “eth0”)
  • BIOS Boot Order: CD, PXE (Network Boot), Hard Disk

In other words, one main node with two ethernet ports, and n-number of compute nodes. Both should have loads of free space, and you will need some cables and switches to connect things together, depending on the number of nodes. And no minimum requirement for processor cores. Also, you can use the same installation process for virtual clusters.

Connect The Nodes Together

Connect all nodes to the main node, and connect the main node to the internet. Of course, you can configure the network connections to suit your application. Also, make sure that DHCP is disabled in all switches that connects the private network. Rocks relies on DHCP to boot (read more), connect and install the OS on compute nodes.

Download ROCKS! and Burn the DVD

I used the jumbo DVD, because it has everything. Rocks components are separated into multiple CDs. The big one is better because inserting CDs to the machine one-by-one can be annoying. Here is the link to the download page.

Install Rocks on the main node

Before doing so, make sure you set the right boot order as described in the system requirements. Insert the DVD, and boot it up.

Rocks boot screen.
As soon as the boot screen pops, type “frontend” and hit return. This will tell Rocks that this node is the frontend node, and will start the installation process accordingly.Wait a bit for installer to come into life. When it shows up.

The welcome screen

This is the welcome screen. Choose CD/DVD-based roll.

Here one should choose the rolls to be installed on the cluster.

Here is the rolls list. Choose the following rolls: Base, Ganglia, HPC, Kernel and OS. Those are rolls I needed to set up my cluster.

Check the rolls summary, and make sure that the right rolls are selected. Then, hit "Next".

Check the rolls summary, and make sure that the right rolls are selected. Then, hit “Next”.

Here, Rocks requests some basic information about the cluster, such as host name and location.

Here, Rocks requests some basic information about the cluster, such as host name and location.

This should be configured depending on the router that connects to the internet.

Public network configuration. This should be configured depending on the router that connects to the internet. Double check the physical ports and make sure that you got the connections right. I chose “eth0” to be the public connection.

Should be configure according to the switch that connects the network together. If it's a virtual HPC cluster, change is not needed.

Private network configuration. Should be configure according to the switch that connects the network together. If it’s a virtual HPC cluster, change is not needed. Also, make sure that the network is in the right order. I chose “eth1” to be the private network.

Write your desired Gateway and DNS server. I chose the router.

Write your desired Gateway and DNS server. I chose the router.

Don't forget the password. You'll need it.

Don’t forget the password. You’ll need it. Note: the root is the system admin. It can do anything and everything. You don’t want to forget the password.

Choose the proper timezone.

Choose the proper timezone.

If the machine has the minimum disk requirement (30GB), manually partitioning won't be needed.

If the machine has the minimum disk requirement (30GB), manually partitioning won’t be needed.

Wait a bit for things to finish. Have a snack, maybe.

Wait a bit for things to finish. Have a snack, maybe.

Voila! The main node is up. Time to set up the rest.

Voila! The main node is up. Time to set up the rest.

Install Rocks on the rest of the nodes

If you got network booting running in all compute nodes, things will be much easier and seamless:

  1. Open a terminal window on the frontend node.
  2. Run the insert-ethers command.
    Connecting the cluster to the main node using `insert-ethers`.
  3. Choose the device you want to connect. For now, it will be “Compute”.
  4. Turn on all compute nodes (make sure the boot order is set as specified in the requirements), and watch Rocks handles the rest.
  5. If network booting didn’t work, you’d have to boot the Installation DVD into each compute node. However, in this case, boot and leave Rocks handle the rest. No need to type anything on the boot screen.
A list of all detected nodes. Nodes are detected using DHCP.

A list of all detected nodes. Nodes are detected using DHCP.

insert-ethers will show nodes as they’re discovered.

Ganglia

Ganglia provides a monitoring solution for your cluster.

Ganglia provides a monitoring solution for your cluster.

It monitors the cluster. Simply open the browser on the main node, and go to http://localhost/ganglia/. It will be the way to check whether the setup on each node was completed, too.

Then what?

I don’t know. You got yourself a cluster. Do whatever you want with it. Write some MPI app that folds proteins. Distribute a database of randomly generated records for fun. Break the world record on the number of PI digits you can calculate. Whatever.

I hope this helps. Leave me a question if you want. Hopefully, I’ll finish the NBody series.. I not going to say soon, but I hope I do.

HPC Setup

I’ve been trying to setup an HPC cluster of all the PCs I have at home. When I finish that, I’ll be able to continue the N-Body post series. What I want is for people to understand how HPC apps work and being able to build ones that harness the power they have in hand. On the way, I’ll post a few things that might not be relative to the topic. Maybe I’ll post a tutorial on how to build a cheap HPC system. Also, before that, I’ll post another OpenCL implementation that fixed some major implementation faults and performance issues.

Stay tuned.

A Quickie: Chromium on top of WPF.

Want Chromium in your WPF app? Chromium is great, and, lets admit that the standard web browser control that comes with WPF is crippled (can’t run plugins unless you add some code in your HTML files).

Awesomium.NET & CefSharp

Based on Awesomium and Chrome Embedded Framework respectively, which are based on Chromium, which is based on WebKit, which is developed by Apple, Google and many others, and originally create by KDE.

Both are equally great. Download their examples and see it for yourself.

I’m out.

PS: The only drawback to these running on top of WPF is that they use a different rendering engine than that in WPF. So, internally, it copies the render buffer onto the target WPF control every few milliseconds (user defined. Default is 16 in both, I think). So, when you run a page with a plugin like flash, mouse hovers seem to be lost on every refresh, which is not very good.

Distributed N-Body Simulation – Parallel First (Part 2)

This part handles optimizing the previous post. Let’s start with understanding some of the huge mistakes in the previous implementation.

Memory access

Regarding caching, the main idea is to make the best use out of different available memory levels.

Figure 1. Taken from OpenCL 1.x Specs, figure 3.3

If you look at the memory model in OpenCL (check chapter 3.3 in the OpenCL 1.x spec available here), There are different memory level. The fastest is the private memory, and it’s memory available for use per work-item. Local memory is shared among a work-group (a group of work-items), and it’s larger (16KB minimum). Then, there is the big shared memory, and it’s the main GPU memory (shared between all cores). Also, there is a constant (read only) memory that’s also shared. The more data you can bring to low-level memory the better. So, the first thing we’ll do is to move the data we use to local memory, and to do that, we need to understand the execution model of OpenCL.

Figure 2 - from OpenCL 1.x Specs, figure 3.2

We mentioned earlier that local memory is shared inside a work-group. A work-group is basically a group of work-items. Because relying on global memory will slow things down drastically (hundreds of cores accessing one memory), GPU cores are grouped into work-groups and they share some memory (starts at 16KB or larger, depending on the hardware), to speed things up. Also, each work item has its own private memory. So, in the N-Body problem that we have, it’s suitable to cache the body of the current work-item in its private memory, cache a block of bodies in the local memory (for the work-group to use, rather than reading from global memory), and that’s it.

We start by caching the current body.

__kernel void accelerate(__global body *bodies, const unsigned int count, float G, float e) {
    //get the global run id
    unsigned int i = get_global_id(0);
    
    //check if it's over the particles' count
    if (i >= count) return;
    
    //body cache
    body currentBody;
    currentBody.x = bodies[i].x;
    currentBody.y = bodies[i].y;
    currentBody.vx = bodies[i].vx;
    currentBody.vy = bodies[i].vy;
    currentBody.ax = 0;
    currentBody.ay = 0;
    
    //calculate acceleration
    unsigned int j = 0;
    float dx, dy, r, f;
    for (; j < count; ++j) {
        if (i == j) continue;
        dx = currentBody.x - bodies[j].x;
        dy = currentBody.y - bodies[j].y;
        r = max(sqrt(dx*dx + dy*dy), e);
        f = G * bodies[j].m / (r*r);
        
        currentBody.ax -= f * (dx/r);
        currentBody.ay -= f * (dy/r);
    }
    
    bodies[i].ax = currentBody.ax;
    bodies[i].ay = currentBody.ay;
}

My tests showed that running the simulation using 32,768 particles using this kernel takes the same as the old one simulating 16,384 particles. Big difference, and it’s only because of bad memory access patterns.

Step two is local memory. This one is tricky. There two parts: work-group size and local memory. As mentioned earlier, local memory has a minimum of 16KB. However, work-group size doesn’t have a minimum (the minimum is 1). So, you’ll have to modify the kernel to adapt with your hardware. An easy way to do this is by using preprocessor options while building your CL kernels, and we’ll do indeed.

Local memory is shared within a work-group. What we can do is we can divide the big loop that iterates on all bodies into blocks, and each block would cache a number of bodies so that the work-group would use that cache at the same time (instead of repeated reads from global memory for the same bodies over and over), and so on. We do that by making each work item cache a single body, so the whole group would complete the caching process. Then, we use the cached data, then we move to the next block. To achieve this, we’ll need to synchronize work-items. Also, since we only need the position and mass of the bodies, we’ll use float3 instead of the body structure.

__kernel void accelerate(
__global body *bodies, const unsigned int count,
// We add local memory for the cache (the size is specified using clSetKernelArg)
 __local float3 *cache,
float G, float e) {
    // get the global run id
    unsigned int i = get_global_id(0);
    unsigned int group_count = get_num_groups(0);
    unsigned int group_size = get_local_size(0);
    unsigned int local_id = get_local_id(0);
    
    // check if it's over the particles' count
    if (i >= count) return;
    
    // body cache
    body currentBody;
    currentBody.x = bodies[i].x;
    currentBody.y = bodies[i].y;
    currentBody.vx = bodies[i].vx;
    currentBody.vy = bodies[i].vy;
    currentBody.ax = 0;
    currentBody.ay = 0;
    


    // declare variables that will be used to calculate acceleration
    float dx, dy, r, f;
    
    // Loop on blocks
    for(unsigned int g = 0; g < group_count; ++g) {
        // cache a single body
        unsigned idx = g * group_size + local_id;
        cache[local_id].x = bodies[idx].x;
        cache[local_id].y = bodies[idx].y;
        cache[local_id].z = bodies[idx].m;
        
        // synchronize all work-items in the work-group
        barrier(CLK_LOCAL_MEM_FENCE);
        
        // calculate acceleration between the current body and the cache
        for(unsigned int j = 0; j < group_size; ++j) {
            dx = currentBody.x - cache[j].x;
            dy = currentBody.y - cache[j].y;
            r = max(sqrt(dx*dx + dy*dy), e);
            f = G * cache[j].z / (r*r);
            
            currentBody.ax -= f * (dx/r);
            currentBody.ay -= f * (dy/r);
        }
        
        // synchronize all work-items in the work-group
        barrier(CLK_LOCAL_MEM_FENCE);
        
    }
    
    bodies[i].ax = currentBody.ax;
    bodies[i].ay = currentBody.ay;
}

Memory access is now optimized. Next step

Instructions Optimization

The body of the algorithm can’t be optimized any further. However, we can optimize other things, like replacing variables in loops with constants. We can do that using one of two methods:

  • Using macros to define them at build time.
  • Loop unrolling, so that the check for the loop condition will be less frequent.

We’ll go with unrolling (manual). First, we take the calculation part and put it into a function, and we’ll call that function several times inside the loop. Since the block size is always multiples of 2, 8 calls should be sufficient.

One last step is to combine the acceleration and the integration kernel in one, to make use of the cached body, and we’re done. This is the final version of the kernel:

struct _body {
    float x, y;
    float vx, vy;
    float ax, ay;
    float m;
};
typedef struct _body body;

void computeAcceleration(body* currentBody, float3 cachedBody, float G, float e) {
    float dx, dy, r, f;
    dx = currentBody->x - cachedBody.x;
    dy = currentBody->y - cachedBody.y;
    r = max(sqrt(dx*dx + dy*dy), e);
    f = G * cachedBody.z / (r*r);
    
    currentBody->ax -= f * (dx/r);
    currentBody->ay -= f * (dy/r);
}

__kernel void accelerate(
//Bodies, their count, and the local cache
__global body *bodies, const unsigned int count, __local float3 *cache,
//Constants
float G, float e, float dt, float decay
) {
    // get the global run id
    unsigned int i = get_global_id(0);
    unsigned int group_count = get_num_groups(0);
    unsigned int group_size = get_local_size(0);
    unsigned int local_id = get_local_id(0);
    
    // check if it's over the particles' count
    if (i >= count) return;
    
    // body cache
    body currentBody;
    currentBody.x = bodies[i].x;
    currentBody.y = bodies[i].y;
    currentBody.vx = bodies[i].vx;
    currentBody.vy = bodies[i].vy;
    currentBody.ax = 0;
    currentBody.ay = 0;
    
    // Loop on blocks
    for(unsigned int g = 0; g < group_count; ++g) {
        // cache a single body
        unsigned idx = g * group_size + local_id;
        cache[local_id].x = bodies[idx].x;
        cache[local_id].y = bodies[idx].y;
        cache[local_id].z = bodies[idx].m;
        
        // synchronize all work-items
        barrier(CLK_LOCAL_MEM_FENCE);
        
        // calculate acceleration between the current body and the cache
        for(unsigned int j = 0; j < group_size;) {
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
            computeAcceleration(&currentBody, cache[j++], G, e);
        }
        
        // synchronize all work-items
        barrier(CLK_LOCAL_MEM_FENCE);
        
    }
    
    currentBody.vx += currentBody.ax * dt;
    currentBody.vy += currentBody.ay * dt;
    currentBody.x += currentBody.vx * dt;
    currentBody.y += currentBody.vy * dt;
    currentBody.vx *= decay;
    currentBody.vy *= decay;
    
    bodies[i].x = currentBody.x;
    bodies[i].y = currentBody.y;
    bodies[i].vx = currentBody.vx;
    bodies[i].vy = currentBody.vy;
}

There is a branch called “optimized” in the repository shared in the previous post. It has the updated code. Check it out.

Comparison of the optimized kernel and the regular one.

As you can see, the difference is phenomenal. I also experimented with different workgroup sizes. Here are the results.

Comparison of running the simulation using different work-group sizes.

Hardware: ATI Radeon HD 6490M (256MB memory, 160 cores, peaks ~260 GFLOPS).

Distributed systems differ

And we’ll will discuss that in a future post, Hopefully, ISA.
Keep up the good work. You’re doing good. (Y)

PS: This code is not so good. Instructions should be optimized. Data structures should be altered. In a future post.