why does a*b*a take longer than (a'*(a*b)')' when using gpuArray in Matlab scripts?

does plan b affect your period up to 2 months after taking it
can the morning after pill make you miss a period
how long after taking the morning after pill should i get my period
morning after pill side effects menstrual cycle
side effects of emergency contraceptive pills on periods
can the morning after pill delay your period for 2 months
how late can your period be after taking plan b
a/b testing examples

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.

%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])

clearvars -except c1

rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])

disp(['error = ',num2str(max(max(abs(c1-c2))))])

%end

However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).

>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11

Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:

>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11

I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.

There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:

str={'sparse*sparse','sparse*full'};
for ii=1:2
    rng(1);
    a=sprand(3000,3000,0.1);
    b=sprand(3000,3000,0.1);
    if ii==2
        b=full(b);
    end
    a=gpuArray(a);
    b=gpuArray(b);
    tic
    c=a*b;
    disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end

In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.

str={'a*b*a','a*b*full(a)'};
for ii=1:2
    %rng('default');
    rng(1)
    a=sprand(3000,3000,0.1);
    b=rand(3000,3000);
    rng(1)
    c=sprand(3000,3000,0.1);
    if ii==2
        c=full(c);
    end
    a=gpuArray(a);
    b=gpuArray(b);
    c=gpuArray(c);
    tic;
    c1{ii}=a*b*c;
    disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])

I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.

Emergency contraception: Periods after taking emergency , When should I expect my next period after I take emergency contraceptive pills? to find out how progestin-only emergency contraceptive pills (like Plan B week of their cycle got their period at the usual time, but it lasted longer than normal. Although it's not a complex process, the slow emulsification of fats by the bile acids takes longer than the quick enzymatic action that breaks down carbohydrates. Using That Energy Glucose from carbohydrates is available to your system immediately as fuel, and your liver converts any excess into glycogen.

I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.

Rylan's response:

Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.

This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.

When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:

  n = 2000;
  a=gpuArray(sprand(n,n,0.01)); 
  b=gpuArray(rand(n));

  tic;res1=a*b*a;wait(gpuDevice);toc                                         % Elapsed time is 0.884099 seconds.
  tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc        % Elapsed time is 0.068855 seconds.

However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.

To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:

  tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc      % Elapsed time is 0.066446 seconds.

In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.

Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.

Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.

Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:

  b=gpuArray(rand(n));
  tic; res1=a*b*a; wait(gpuDevice); toc  

if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:

  b=gpuArray(rand(n));
  wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc  

Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.

I will go ahead and close this case for now, but please let me know if you have any further questions about this.

%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01)); 
b = gpuArray(rand(n)); 

gputimeit(@()transpose_mult_fun(a, b))
gputimeit(@()transpose_mult_fun_2(a, b))

function out = transpose_mult_fun(in1, in2)

i1 = transpose(in1);
i2 = transpose(in1*in2);

out = transpose(i1*i2);

end

function out = transpose_mult_fun_2(in1, in2)

out = transpose(transpose(in1)*transpose(in1*in2));

end

.

function test_v2

clear

%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);

tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));

disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])

clearvars -except c1

%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);

tic;
c2 = gather(a * b * a);

disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])

%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);

tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));

disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])

end

What is A/B Testing in Digital Marketing, How it Works, Tools (Guide), How to do A/B Testing and Improve Your Conversions Quickly. Home How Do You Analyze Your A/B Testing Metrics and Take Action? Some elements of a marketing asset contribute to conversions more than others. Why does primary succession take longer than secondary succession? Unanswered Questions. What is the hidden meaning of GI over CCC. In a rebus puzzle what does CHAIR means.

Goodwin B. Watson, William E. Dodd, Jr., and Robert Morss Lovett: , 105, on the Fitness for Continuance in Federal Employment of Goodwin B. Watson a “I wish I knew,” she said, “whether it will take longer for the Russians to  Why does it take longer for a full kettle of water to boil than a half full kettle? The temperature must permeate the total volume. So a small amount of water heats faster or quicker than a larger

Reports of explorations and surveys: to ascertain the most , For a fuller account I would refer to the report of the United States and Mexican The hind feet are considerably longer than in B. humilis, in accordance with the  Conducted at the American College of Cardiology, the sleep study found that taking naps longer than 40 minutes puts you at increased risk for all kinds of health problems. To come to this

Review Questions and Answers for Veterinary Technicians, The reason that adult dogs and cats will often have diarrhea after consuming a large Cats have a lower requirement for essential fatty acids than dogs. b. Cats have a higher requirement for essential carbohydrates than dogs. d. Great care must be taken when dehorning a goat with a hot electric dehorner because. a. This means that we can’t say for sure why you haven’t received your refund if it’s been longer than 21 days since you e-filed. You’ll have to speak directly to IRS to find out more about the delay. If it’s been. 21 days or more since you e-filed. 6 weeks or more since you mailed your return.

Internal Revenue Cumulative Bulletin, (iii) The bands (other than the highest band) in the schedule are not all the is longer than the other bands, if either of the conditions of paragraph (b)(1)(iv)(D)(1​  Question: "In Leviticus chapter 12, why is a woman unclean longer if she gives birth to a daughter than if she gives birth to a son?" Answer: Leviticus chapter 12 often strikes modern readers as odd or even sexist. The Law specified that a woman who gave birth to a son would be ceremonially unclean for 7 days, while a woman who gave birth to a

Comments
  • Just to be 100% sure, run both tests 100 times and compute the average run time. Also, what are the first and last lines? The function?
  • @PatrickRoberts the matrices are identical for both operations as I am using the same seed for the random number generator (You can see that the final answer is also identical).
  • @AnderBiguri I had tried that and the same behaviour persists. But I posted the simpler version above above to make the code simple and readable.
  • Good answer! Note that for proper timing you'd either loop a thousand times and average over it, or use timeit() to do this in one go. Doesn't really matter for differences of a 1000 times, but e.g. the graph here shows you that individual time tests can vary quite a bit.
  • Thanks! And good guess! However, I believe sparse-sparse routines are not optimized for GPU, since they are cannot be efficiently threaded. This may explain your observation. Based on my phone calls with Mathworks, I think the difference in performance is related to only one of the row-major and col-major sparse-dense being threaded (but not both). @Ander Biguri touched on this also in his answer.
  • Thanks for your response. Based on my conversations with mathworks, they are using CUDA libraries now. Just posting this here so others can know.