Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);% warm up by executing the operations a couple of times,andthen:>> tic, C = A * A; toc
Elapsed time is0.075396 seconds.>> tic, gC = gA * gA; toc
Elapsed time is0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray
version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit
and gputimeit
functions:
>> A = rand(1024); gA = gpuArray(A);>> timeit(@()A*A)
ans =0.0324>> gputimeit(@()gA*gA)
ans =0.0022