Tuesday, October 6, 2015

Revisiting Quicksort for Big Data

Quicksort is one of the most used subroutines in applications.While browsing its implementations online, one thing troubles me. As the following example from Rosettacode,  it has the neat implementation. But how about all the numbers are same! It doesn't do anything wrong, but it just runs your CPU all the time by doing useless things :P.

#include 
 
void quick_sort (int *a, int n) {
    int i, j, p, t;
    if (n < 2)
        return;
    p = a[n / 2];
    for (i = 0, j = n - 1;; i++, j--) {
        while (a[i] < p)
            i++;
        while (p < a[j])
            j--;
        if (i >= j)
            break;
        t = a[i];
        a[i] = a[j];
        a[j] = t;
    }
    quick_sort(a, i);
    quick_sort(a + i, n - i);
}

It is not hard to fix the problem by counting the number of the pivot value.  Just return when observing all the numbers are same.

While the data set is huge and the partitions are getting smaller with a few call of quick_sort, it has high probability that all the values in one small partition are the same. For some situations even worse, like sorting all the people in USA according to the age. If the input is an array of records in database systems, it is really costly by doing useless memory access/copies in nlog(n) scope!!!

Human's age are values from 1 to 100 something. If blindly using the quick_sort algorithms, it just eats the CPU and memory for no good. In many cases the keys to be sorted have just a few unique values like the age. In these cases, instead of using a nlog(n) quicksort algorithm,  a hybrid hash/count sort can do a much better job by one or a few scans. First extract the unique keys from the input, sort/hash them, do a final scan by counting or rearranging inputs. One can do optimization by just doing one scan depending the situations.




No comments:

Post a Comment