Wednesday, June 24, 2015

Bash Inefficient Matrix Multiplication

Bash can only handle integer arithmetic. To process floats, you need to shell out to 'bc' or some other external utility. The core multiplication changes to:

    result=`echo  "$result + $m * $n" | bc`;

Boy, does that slow things down! For each multiplication, you're launching a shell, invoking a program, returning results. The result is 800 multiplications a second. Note this is running on integer matrices, there remain unresolved problems processing floats.


One obvious problem is shelling out for every multiplication. More efficient is to accumulate the expression to calculate, and just invoke the external calculator once. Now we only shell out N^2 times, instead of N^3, so increasing matrix size leads to better performance, up to a limit. the 32x32 matrix gives 18x performance compared to the simpler version. This implies the 32x reduction in shelling out is offset by the cost of building the expression. By the time we reach 100x100, overhead costs are overwhelming gains: a 100x reduction in shelling out generates only a 16x improvement in performance.


Tuesday, June 23, 2015

BASH Matrix Multiplication

tl;dr Bash is not the language for math-intensive operations.





REPS=$1;
FILE_1=$2;
FILE_2=$3
OUTFILENAME=$4;

readonly COLS=`head -1 $FILE_1 | wc -w`;
readonly ROWS=`cat $FILE_1 | wc -l`;
# echo "rows is $ROWS; cols is $COLS"
if [[ $ROWS != $COLS ]]; then
    echo "Expecting square matrices, " \
         "but rows = $ROWS, cols = $COLS\n";
    exit 1;
fi

# --------------------------------------------------
# SUBROUTINES
#
function outputMatrix()
{
    local matrixname=$1;
    local matrix;
    local elem;
    echo "matrix is '$matrixname'.";
    eval matrix=\( \${${matrixname}[@]} \);
    local i=0;
    for elem in "${matrix[@]}"; do
 echo -n "$elem ";
 if (( ++i == $COLS )); then
     echo '';
     i=0;
 fi
    done  
}

function multiply()
{
    declare -a product;
    local M=$1 N=$2;
    local i j k idx1 idx2 idx3;
    for ((i=0; i < $ROWS; i++ )); do
        for ((j=0; j<$COLS; j++)); do
            local result=0;
            for ((k=0; k<$ROWS; k++)); do
                idx1=$((i*$ROWS+k));
                idx2=$((k*$COLS+j));
                result=$((result+${M}[idx1]*${N}[idx2]));
            done
            idx3=$((i*$COLS+j));
            product[$idx3]=$result;
        done
    done
#    outputMatrix product;
}
# --------------------------------------------------
#

matrix1=($(cat $FILE_1));
matrix2=($(cat $FILE_2));


start=`date  +%s.%N`;
for ((repeat = 0; repeat < $REPS; repeat++)); do
    multiply matrix1 matrix2 
done
end=`date  +%s.%N`;
echo "$end - $start" | bc

My core multiplication routine is similar to the one at RosettaCode. Results are drastically inferior compared to Perl or JavaScript, a factor or 200x ~ 1000x as slow.


Notice that the 100x100 matrix has only a single data point, at 55,000 multiplications per second. That took 547 seconds; I wasn't going to wait around 2 hours for more repetitions.

Part of the problem is from flattening a two-dimensional matrix into a single one-dimensional array. Converting indices into a single array index overwhelms the actual matrix multiplication, resulting in 3N^3 + N^2 multiplications. These all happen in "user space", while Perl & JS have the opportunity to carry out array indexing in "language space", potentially more efficient.

Actually, simplifying the index calculations only improved runtime about 10%, not a whole lot. Possibly variable access overwhelms the difference between addition and multiplication.


    for ((i=0; i <; $ROWS; i++ )); do
        for ((j=0; j < $COLS; j++)); do
            local result=0;
            local idx1=$((i*$ROWS));
            local idx2=$j;

            for ((k=0; k < $ROWS; k++)); do
                result=$((result+${M}[idx1]*${N}[idx2]));
                ((idx1 += 1))
                ((idx2 += COLS))
            done
            product[$idx3]=$result;
            ((idx3 ++))
        done
    done



Wednesday, June 17, 2015

Better Performance Through Transposition

Both Perl and Javascript implement only one-dimensional arrays. To represent a matrix you need to nest arrays within arrays, typically one array for each row, within the overall matrix array.

Each element of the result matrix is the product of one row from the first matrix, and one column from the second. Going from one element to the next in the same row is much quicker than going to another row. The first matrix only changes rows once for each row, N changes, but the second matrix has N^3 changes of row, one for each multiplication.

If we were to transpose the second array, turn the rows into columns and columns into rows, the multiplication could be carried out by by scanning along a row instead of going from row to row.That reduces the cost to N^2. How much will that save us?

I transposed the second array while reading it in:

for my $size ( keys %vars ) {
   my $filestub = q{M_} . $size . q{x} . $size . q{.};
   my ( $f1, $f2 ) = ( $filestub . '1', $filestub . '2' );
   $vars{$size}[0] = readMatrix( $f1 );
   $vars{$size}[1] = transpose( readMatrix( $f2 ));
}

and use pairwise() from List::MoreUtils to multiply corresponding elements of the rows, and sum0() from List::Utils to add up the products.


for my $i ( 0..$rows - 1 ) {
    for my $j ( 0 .. $cols - 1 ) {
    prod[$i][$j] 
        = sum0 pairwise { $a * $b } M->[$i], N->[$j];
    }
}

The results were much the same for both integers and floats.

-➤   perl ./int_benchmark.pl
Processing for 5 seconds each size will take 25 seconds.
Wed Jun 17 15:25:14 2015
              Rate M_100x100   M_32x32   M_10x10     M_5x5     M_2x2
M_100x100   5.54/s        --      -97%     -100%     -100%     -100%
M_32x32      174/s     3046%        --      -97%     -100%     -100%
M_10x10     5114/s    92135%     2832%        --      -85%      -98%
M_5x5      34909/s   629461%    19913%      583%        --      -87%
M_2x2     274124/s  4943582%   157052%     5260%      685%        --
Wed Jun 17 15:25:46 2015

-➤   perl -E '
say "100 => ", 100**3 * 5.54   / 10**6;
say " 32 => ",  32**3 * 174    / 10**6;
say " 10 => ",  10**3 * 5114   / 10**6;
say "  5 => ",   5**3 * 34909  / 10**6;
say "  2 => ",   2**3 * 274124 / 10**6;
'
100 => 5.54
 32 => 5.701632
 10 => 5.114
  5 => 4.363625
  2 => 2.192992

Previous float results were 2.1, 4.1, 4.9, 5.2 and 5.4 MFLOPS for 2x2 ... 100x100 sized matrices. So 10,000 row changes or 1,000,000, it doesn't make much difference.

I modified the Javascript program to operate on a transposed second matrix, but since Javascript doesn't provide a pairwise() routine, I generated a dot-product() routine which takes two rows. But  both integer and float matrices produce much the same results as before ... +/- 10%. So transposing is irrelevant when there's so much other overhead in a scripting language.

Monday, June 15, 2015

Linpack FLOPS Benchmark

The numbers I've been reporting for Javascript and Perl performance need a context. Are they running on a Commodore 64, or a super-computer? Nope, it's my laptop, and here's what Linpack has to report about it: 4 core, 8 thread (Intel core i7) at 3.389 GHz. Peak performance is 46 GFLOPS.

Javascript and Perl are both single-threaded, so you only get 1/8 peak performance, 5 3/4 GFLOPS. That's still 1000 times the performance we actually achieve on more convenient languages. You have to sacrifice some things to gain benefits. On the optimistic side, a moderately fast modern computer sunning Perl or Javascript is almost as fast as  Sparc-10 running LINPACK.

Intel(R) LINPACK data

Current date/time: Fri Jun 12 01:19:06 2015

CPU frequency:    3.389 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 8

Parameters are set to:

Number of tests                             : 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used = 16200901024, at the size = 45000

============= Timing linear equation system solver =================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm)
1000   1000   4      0.019      35.5628  1.029343e-12 3.510325e-02
1000   1000   4      0.018      36.2016  1.029343e-12 3.510325e-02
1000   1000   4      0.018      36.1857  1.029343e-12 3.510325e-02
1000   1000   4      0.018      36.2075  1.029343e-12 3.510325e-02
2000   2000   4      0.136      39.2932  4.298950e-12 3.739560e-02
2000   2000   4      0.135      39.4407  4.298950e-12 3.739560e-02
5000   5008   4      1.923      43.3503  2.581643e-11 3.599893e-02
5000   5008   4      1.925      43.3232  2.581643e-11 3.599893e-02
10000  10000  4      14.781     45.1176  9.603002e-11 3.386116e-02
10000  10000  4      14.825     44.9833  9.603002e-11 3.386116e-02
15000  15000  4      49.465     45.4959  2.042799e-10 3.217442e-02
15000  15000  4      49.541     45.4258  2.042799e-10 3.217442e-02
18000  18008  4      85.142     45.6727  2.894987e-10 3.170367e-02
18000  18008  4      85.978     45.2284  2.894987e-10 3.170367e-02
20000  20016  4      115.975    45.9938  4.097986e-10 3.627616e-02
20000  20016  4      116.138    45.9292  4.097986e-10 3.627616e-02
22000  22008  4      156.203    45.4512  4.548092e-10 3.331299e-02
22000  22008  4      156.091    45.4840  4.548092e-10 3.331299e-02
25000  25000  4      227.453    45.8025  6.089565e-10 3.462917e-02
25000  25000  4      227.769    45.7390  6.089565e-10 3.462917e-02
26000  26000  4      254.249    46.0913  6.669421e-10 3.506981e-02
26000  26000  4      254.840    45.9845  6.669421e-10 3.506981e-02
27000  27000  4      284.795    46.0804  6.672171e-10 3.253690e-02
30000  30000  1      414.983    43.3796  8.421348e-10 3.319704e-02
35000  35000  1      732.317    39.0347  1.085509e-09 3.151068e-02
40000  40000  1      1479.396   28.8428  1.466774e-09 3.262155e-02
45000  45000  1      2764.282   21.9782  1.711494e-09 3.011194e-02

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
1000   1000   4       36.0394  36.2075 
2000   2000   4       39.3669  39.4407 
5000   5008   4       43.3367  43.3503 
10000  10000  4       45.0504  45.1176 
15000  15000  4       45.4608  45.4959 
18000  18008  4       45.4506  45.6727 
20000  20016  4       45.9615  45.9938 
22000  22008  4       45.4676  45.4840 
25000  25000  4       45.7708  45.8025 
26000  26000  4       46.0379  46.0913 
27000  27000  4       46.0804  46.0804 
30000  30000  1       43.3796  43.3796 
35000  35000  1       39.0347  39.0347 
40000  40000  1       28.8428  28.8428 
45000  45000  1       21.9782  21.9782 

End of tests



Perl Floating Point-Multiplication Benchmark

I was worried whether I was making basic errors in testing the Perl version, so I decided to use the Benchmark module to get the numbers. I copied the matmult.pl file and added use Benchmark ':all' to the header of the file. The main() routine got changed to :

    my $time = $ARGV[0] || 5;
    my %vars = (   2 => [],
                   5 => [],
                  10 => [],
                  32 => [],
                 100 => [],
    );
    for my $size ( keys %vars ) {
       my $filestub = q{F_} . $size . q{x} . $size . q{.};
       $vars{$size}[0] = readMatrix( $filestub . '1' );
       $vars{$size}[1] = readMatrix( $filestub . '2' );
    }
    
    say "Processing for $time seconds each size ",
        "will take @{[5 * $time]} seconds."; 
    say scalar localtime;
    cmpthese( -$time, {
       'F_2x2'     => sub { matmult( $vars{2}[0],
                                        $vars{2}[1]); },
       'F_5x5'     => sub { matmult( $vars{5}[0],
                                        $vars{5}[1]); },
       'F_10x10'   => sub { matmult( $vars{10}[0],  
                                        $vars{10}[1]); },
       'F_32x32'   => sub { matmult( $vars{32}[0],  
                                        $vars{32}[1]); },
       'F_100x100' => sub { matmult( $vars{100}[0], 
                                        $vars{100}[1]); },
       });
    say scalar localtime;


The user can specify how long each variant should be run, with a default of 5 seconds for each size if no arg is provided.

-➤   perl ./benchmark.pl 2
Processing for 2 seconds each size will take 10 seconds.
Mon Jun 15 17:11:08 2015
              Rate F_100x100   F_32x32   F_10x10     F_5x5     F_2x2
F_100x100   5.39/s        --      -97%     -100%     -100%     -100%
F_32x32      163/s     2931%        --      -97%     -100%     -100%
F_10x10     4909/s    90934%     2904%        --      -85%      -98%
F_5x5      32731/s   606904%    19929%      567%        --      -88%
F_2x2     264660/s  4908131%   161856%     5292%      709%        --
Mon Jun 15 17:11:22 2015

Adjusting for the number of multiplications involved in each size of matrix --- 8 for the 2x2 up to 1,000,000 for the 100x100 --- produces results very similar to what we've already seen ( millions of multiplications per second):

perl -E'say  "100 => 5.36";
        say  "32  => ", 32000 * 154 / 10**6;
        say  "20  => ", 1000 * 4904 / 10**6;
        say  "5   => ", 125 * 32767 / 10**6;
        say  "2   => ", 8 * 263472  / 10**6; 
       '

100 => 5.39
32  => 5.216
20  => 4.909
5   => 4.091375
2   => 2.11728

Rerunning the benchmarks with integer matrices and use integer enabled generated insignificantly, fractionally better numbers; though I wonder if the pragma affected Benchmark's calculations.