Tuesday, June 29, 2010

Faster Faster Faster

Our machines, even when hitting an apparent performance peak, only run at one small fraction of their true potential speed. I feel that today's computers and their software are OK for routine spectra. I couldn't ask for more. Other spectra are quite large, however, and I must wait a few seconds during processing. Without going into the third dimension, consider these novel experiments to measure long range heteronuclear Js. Each row contains at least 4096 points. Quite likely we are going to see larger rows in the next few years. The time required to compute the FFT is in the order of the seconds. It would be great if we could half this time. The solution is public since 2008 and freely available. It must also be well-know, because I have initially found it on Wikipedia.
What they say, in practice, is that if you use the GPU (the graphic chip) instead of the CPU (the main brain of the computer)...
In this work we present a novel implementation of FFT on GeForce 8800GTX that achieves 144 Gflop/s that is nearly 3x faster than best rate achieved in the current vendor’s numerical libraries.

Another paper says:
We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes.

The source code (to be compiled), is available on another site. What upsets me is the ReadMe file:
Currently there are a few known performance issues (bug) that this sample has discovered in rumtime and code generation that are being actively fixed. Hence, for sizes >= 1024, performance is much below the expected peak for any particular size. However, we have internally verified that once these bugs are fixed, performance should be on par with expected peak. Note that these are bugs in OpenCL runtime/compiler and not in this sample.

Maybe they have already found the solution to this problem.
There's another issue, however: how long does it take to move the matrix from the main memory to the GPU and back? I presume that loading the columns will take much more time than loading the rows. In this case it is better to transpose the matrix between the two FFTs (just like when we use the CPU). Eventually, the bottleneck will the the transposition, not the FT.