Quote:
Quote:
Does any of your code generate 128-bit SSE/SSE2 instructions?
Yes, why?
Because I've done some more digging and have a possible reason why you see different CPU load changes than I do, including how you said that you didn't see much benefit from reducing the number of extra calculations.
If the instructions are double-precision, K8, which is Athlon64 and Athlon64 X2, only has a 64-bit wide SSE pathway. To handle 128-bit DP instructions, the K8 core has to split the instruction in half and process the halves. This means that 128-bit DP instructions coming into the pipeline may encounter a split-process-reassemble chain.
So, in addition to the 128-bit, are the numbers also double-precision? That ties back to the post on the Intel forum, btw, and may include whatever IPP calls you use.
These very detailed architectural differences can't really be compensated for by reducing the number of cores you have running or slowing the processor down, as your Core2 system handles SSE(x) very differently internally, to the point of where you'd have to be able to manipulate the performance of a Core2 system down to the XMM / SSE register level.
Bottom line here: Not requesting you rewrite to only generate 64-bit chunks, but just requesting you re-examine removing the unneeded steps in the various filters, because for every 1 unneeded step, K8 may be taking up to 3 unneeded steps (split-process-process-combine vs. process).