Registered Member
|
Is GPU computation support foreseen any time in the future?
Last edited by rikardo on Mon Jan 05, 2009 2:44 pm, edited 1 time in total.
|
Registered Member
|
That's never been discussed. I have no clue about GPU programming, but I'm pretty sure that Gael is an expert of that...
One thing to consider before starting. In my understanding, GPU's are massively multi-core (like 50 cores). So leveraging a GPU means first of all parallelizing. I can see two approaches for this -- each has different use cases. 1) You have one large matrix matrix operation to perform on the GPU. The first step, towards this kind GPU support, is to do parallelize Eigen, i.e. start adding OpenMP support without thinking about the GPU. Eigen is a great candidate for that because thanks to expression templates, it has only very few loops and they are centralized. But we've been looking at OpenMP 2 in the past and it's non trivial to leverage. Many things can be done wrong and wrong parallelization can easily be worse than no parallelization. One example is that OpenMP 2 doesn't help much making parallelization-inside-eigen play well with parallelization-inside-the-app-using-eigen. Another remark then is that all that would only be fully leveraging the GPU for large enough matrices -- i.e the size should be at least the number of GPU cores. 2) You have many small, mutually independent, matrix/vector operations to perform on the GPU. You can then do the parallelization in your app and Eigen doesn't have to worry about parallelization. So that may be the easiest kind of GPU support if parallelization in Eigen turns out to be a headache. Note that Eigen is stateless, so if you have different matrices you can do operations on them concurrently without any conflict. Cheers, Benoit
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
I would suggest Intel's Threading Building Blocks library to anyone interested in parallelizing any C++ code as an alternative to OpenMP. We've been using it here at work for awhile and it's quite nice to work with. |
Registered Member
|
Well I added this because I think Eigen is perfect for multithreading on GPU, since modern GPUs have more than 100 GPU cores. I out company we use a lot of matrix x matrix operation for OCR computations and a colleague of mine rewrote the code to use CUDA from NVIDIA. But since now OpenCL is in plan Eigen could use OpenCL framework for BLAS operations. What is great is that OpenCL is a standard any many GPU vendors will support it, so no matter which Graphics Card one use Eigen could make use of it.
Rikardo |
Moderator
|
interesting, someone just told me the same thing yesterday, sounds like opinion converge here... However, what it is the license of TBB ? About the support or GPU and/or multi CPU, since this is a very developer oriented topic, I will give my thoughts on the mailing list. In short, this is a very interesting topic which could even become a small research project. But, don't expect to see it appears soon... Gael. |
Registered Member
|
Indeed, it's a good idea to keep all the devel talk on the mailing list.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Moderator
|
hm... eventually, let's put them here:
First, let's address the case of a single huge computation, let's say for matrices bigger than 1kx1k (at least). The easiest option would be to simply wrap an existing GPU BLAS library such as NVidia's one. There are also many interesting stuff there: http://www.nvidia.com/object/cuda_home.html#state=home However, relying on a BLAS API is not very interesting for us because we lost the power of expression templates, and I'm pretty sure that on GPU, avoiding temporaries is even much more rewarding than on CPU. So we really want something more sophisticated. To generate GPU code from our expressions we have the choice between multiple technologies: - the old school way using OpenGL and fragment shaders => no way ! (very long list of arguments here...) - CUDA: NVidia only => no way. A more technical argument against CUDA: With cuda we would have to generate Cuda code from our expressions, compile it using external tools (mix of nvcc and gcc), generate binary code and link it with user's executable... a very though challenge ! - OpenCL: this is the best candidate. I know there is no real implementation available yet, but that's not an issue for us since we are not going to really investigate GPU support soon anyway. From the technical point of view, unlike CUDA, the OpenCL code can be compiled on the fly by the runtime library. Therefore, in operator=() we could easily generate OpenCL code from the expressions, ask the OpenCL library/driver to compile it and run it. To avoid the compilation/recompilation of the code every time, we could store the generated GPU binary code into a global map indexed by the typename of the expression (typeid(ExprType).name()). Then we need a smart mechanism for temporary matrices such that they are only allocated in the GPU memory and never copied in main memory. A similar one: it would be cool to be able to create matrices which live only on the GPU (so one more option bit for Matrix). The goal is to reduce unnecessary transfers between the CPU and GPU. I'm probably missing several details, but I'm confident this could be done. Of course, this is a lot of work. Another major advantage of OpenCL, is that it is not limited to GPUs, and I think that with minor additional effort, we could use such an "OpenCL backend" to support as well multi-CPU, SPE, etc. So, perhaps, it will be worth investigating this option further when the question of multi-CPU support will be back on top of the stack. Another remark: from the API point-of-view, I think it is fundamental to provide a mechanism allowing to control, for each expression, whether Eigen is allowed to use the GPU backend, a multi-CPU version, etc. I see two options: - a global enable/disable mechanism:
- another option is enforce the user to use a special eval() function every time he wants a multi-CPU, GPU, whatever evaluation. About the case of multiple small computations (most likely fixed size), then the problem is very very different and I don't think there is much to do from Eigen's side. What would be interesting, however, is to be able to use Eigen's API and internal mechanism in OpenCL, but since OpenCL is a "C + extensions" and not a "C++ + extensions" there is no way to do that that way. that's it for now... Gael. |
Registered Member
|
Just out of curiosity, don't you think that in the long run optimizing Compilers like Portland Group's PGI will do the job for you?
Did I understand correctly, that with OpenCL, you will link your programs against a library which handles the compilation at runtime, but you still get directly "jumpable" code afterwards?
'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett
'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger |
Moderator
|
I don't understand what you mean by "jumpable". Basically with OpenCL you ask the library to compile your code and then ask it to run your code. For instance, if the target is a GPU, then the library has to communicate with the driver to allocate some resources, upload the code, configure the device, etc. |
Registered Member
|
With "jumpable" I meant jumping in the sense of Assembler. I'm still not sure I understood what happens at runtime and what at compile time. Is it some kind of runtime Interpreter or just-in-time compiler?
'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett
'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger |
Moderator
|
ok this is what I guessed. so with OpenCL, nothing happens at compile time, I mean you compile your C/C++ project without any difference and you just have to link to the OpenCL library. Then, at runtime, you can load/generate a piece of source code written in the OpenCL language (C + some extensions). This gives you an opaque binary object that you can save on disk for further use, or ask OpenCL to run this code. Theoretically OpenCL can compile for different backends (GPUs, multiple CPUs, SPUs, etc.). So the code you get might be "jumpable" if and only if the code has been generated for the host CPU, but even in that case, I don't think you can directly use the generated code without the OpenCL library because I think this is the role of the later to generate the different threads, manage the recourses, etc... I don't think the generated code contains this logic and anyway, OpenCL is just a specification and such details are up to the implementor. I don't know how to explain the principle of OpenCL better, so if you are still puzzled check the OpenCL specs ! To my understanding it is clearly not an Interpreter and it is also quite different than a JIT too. |
Registered Member
|
Ok, thanks. I hope DirectX11 will also support it, so we don't end up with badly supported platforms then.
'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett
'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger |
Registered Member
|
Now that several OpenCL implementations have been around for some time and almost every current GPU is supported, has there been any progress in GPU support for Eigen? Or does anybody know about a good wrapper to outsource some calculations into OpenCL kernels?
|
Registered Member
|
There's no support yet in Eigen for OpenCL. However, I agree that OpenCL support would be a welcome additional feature. I see two ways how OpenCL support could be a added:
- offer algorithms to be executed on OpenCL devices as backends for solvers (with automatic memory transfers) e.g. for CG, BiCGstab,... for large sparse matrices e.t.c. - as many GPU computations are memory bound there should be a mechanism that allows the user to fully control the transfer of data between host and OpenCL devices and execution of kernels. This could be done, as Gael suggested, by providing matrices that live on the OpenCL device only. I think QT does provide wrappers for OpenCL. You might take a look at ViennaCL which allows some linear algebra operations to be run on OpenCL devices: http://viennacl.sourceforge.net/ |
Registered Member
|
There are several other points to respect here.
First, Eigen is not only used in number crunching / scientific applications, but also in Desktop applications. Many of those will already be multi-threaded, so there should be a good way of controlling Eigen parallelism, even for distribution-compiled packages. Second, in my humble opinion, CPU vectorisation will get more important again with future generations of CPUs. Sandy Bridge and Bulldozer will already feature 256bit registers, and I don't think we'll have to wait long for 512 bits. With hexa- and octo-core processors, and even more so with future many-core processors, there is a lot of reward for vectorisation. The problem will be to respect the topology of those machines. To really exploit the possibilities, there would probably have to be some kind of scheduling layer smart enough for ccNUMA, GPU, etc. I know this is utopian, just my 2 cents. Perhaps future OpenMP generations will be able to schedule tasks on GPUs, or OpenCL will be able to generate optimised SIMD instructions, or LLVM will be a perfect abstraction layer for all these architectures. Who knows...
'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett
'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger |
Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]