- Published on
Some tips for integrating small bits of CUDA code into larger codebases
- Authors
- Name
- icyveins7
In some recent work I integrated some CUDA code I developed into a larger existing codebase, written for the CPU. Essentially, my module(s) would accelerate and replace some existing functionality, but was only a small cog in the machine.
This is likely applicable to many others, so hopefully the lessons I document here will be concisely useful to those who chance upon this post.
The obvious bits
During my kernel development, I pulled out the necessary functionality and experimented on it in a small standalone project, like any sane person. Like most of my personal projects, this lets me use CMake, but the larger codebase was built on (mostly handmade) raw Makefiles.
Inspecting compile_commands.json
produced by CMake will reveal two common compiler options used for all CUDA code:
-forward-unknown-to-host-compiler
, which does what it says on the tin. You can also use the lazier form of this,-forward-unknown-opts
, though I don't usually bother until linker errors start showing up.-x cu
for every source file. This isn't technically required for.cu
files sincenvcc
will recognise them as CUDA files, but becomes important when you place your eventual code into existing.cpp
files. This ensures they are compiled as CUDA.
Relocatable device code requirements
The codebase is also partitioned into many separate distinct libraries, compiled individually and then linked together at the end to create one final executable. Of course, my standalone project with its multiple small individual executables and unit tests didn't need any extra compiler flags, but integrating the code as it was made nvcc
cry at me.
This may have been partly due to the structure in the codebase itself, but the solution was fairly simple:
Add
-dc
to thenvcc
's compiler flags, turning on relocatable device code with the expectation of linkage later.
We will be returning to the docs for nvcc
often; get used to it, because that's how I (you) will figure out what flags you need to add.
Macro pollution
This compiler error is strictly due to existing code in the existing codebase, so you may not experience anything similar. In fact, it is not strictly a CUDA related issue, but could occur with any code.
The observed error was something like this, upon inclusion of anything that used cub
, either directly or indirectly, like via thrust
headers:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin/../include\cub/warp/specializations/warp_scan_shfl.cuh(408): error: identifier "input" is undefined
__declspec(__device__) __forceinline InclusiveScanStep( input, ScanOpT scan_op, int first_lane, int offset)
^
Note that this never occurred in my standalone project, so it was odd that these headers would suddenly cause problems here.
As mentioned above, this isn't specifically a CUDA issue, but I'll show how I investigated it in the context of using nvcc
.
Read the preprocessor output
We can dump all intermediate output from nvcc
via -keep
. This lets you see, among many other things, the preprocessor output in a file usually suffixed with .ii
.
Opening it in my case showed that the input argument's template argument, _T
, had somehow been stripped away.
// note: not the exact same function as the above one, but obvious that the first template argument is gone
template <typename , typename ScanOpT>
__declspec(__device__) __forceinline
InclusiveScanStep( input, ScanOpT scan_op, int first_lane, int offset, ::cuda::std::true_type )
{
return InclusiveScanStep(input, scan_op, first_lane, offset);
}
This smells of macro replacement, but where is it coming from? Searching the original preprocessor output file for #define _T
didn't seem to yield anything.
I had to enable the gcc
-specific option -dD
in order to enable the define
directives. Then I was able to track down an include file several levels up which happened to do exactly #define _T
. Welp, there goes all the cub
headers I guess.
A simple #undef _T
in my translation unit and I was back on track.
Hiding CUDA from the rest of the codebase
Congrats for making it this far down the post. Below lies the meat of the problems I had. Almost all my issues stemmed from having to do this, simply because it would be too gargantuan a task to refactor everything else.
I've already mentioned above that compiling .cpp
files as CUDA requires the nvcc
option -x cu
; in fact, this is what CMake does overtly, even for .cu
files, once you use
set_source_files_properties(sourcefile.cu PROPERTIES LANGUAGE CUDA)
as seen in many CUDA examples.
Why you should specify source file compilation explicitly
Aside from the fact that your code probably won't compile, let's go into exactly why that actually happens, and why using your host compiler doesn't work:
- Device code will not be parsable. This should be obvious, but it is important to note that this will also happen if you have a header-only template calling a kernel, having some
__device__
function, or any other CUDA-specific syntax. This will become important later. - CUDA headers and libraries will not get automatically found. Now this is arguably fixable when the project is small by explicitly adding CUDA header and library paths to the Makefiles (or equivalent), but this quickly gets unwieldy once the project starts getting larger. There is also no guarantee that all the CUDA headers will just work; some may include some
__device__
code, for example. - Hyper-aggressive inlining of code. This is, of course, preferable in most instances of device code, but can cause issues in some CPU-side code sometimes. Fairly rare, but this was observed for one file in my case.
The above probably isn't a definitive list, but hopefully it's convincing enough. They are all underscored by a general software idea though: separation of concerns. We shouldn't use nvcc
for things unless strictly necessary.
But including CUDA headers implies parsing them
Here's the problem. You've already developed some CUDA code: some kernels, and some encapsulated classes to help call them and/or manage some device-side states and workspaces. Perhaps you also want some of the external code to hold on to some memory through thrust::host_vector
or thrust::device_vector
.
Normally, it wouldn't matter much (unless you are extremely pedantic) where you included headers like thrust/device_vector.h
. However, with our nvcc
-specific constraints, this becomes a bit more nuanced.
You can no longer make these includes in the headers of your external code, or else all other translation units which transitively include the header will need to be compiled by
nvcc
.
This is probably not a good idea, even if you think you are okay with the maintenance.
A transitive inclusion example
I can give a MWO of my scenario. Consider the following implementation for a class in the larger codebase:
// codebaseclass.h
#include <thrust/device_vector.h>
class CodebaseClass
{
// ...
private:
thrust::device_vector<int> m_vec;
}
// codebaseclass.cpp
#include "codebaseclass.h"
// all the implementation details for CodebaseClass in here
// ...
Now we set our build system to nvcc
for codebaseclass.cpp
, and we should be good to go, right?
No, but you forgot that somewhere else in the codebase, something else uses this header:
// othercodebase.h
#include "codebaseclass.h"
// ...
// othercodebase.cpp
#include "othercodebase.h"
// ...
Now when the compiler sees the translation unit for othercodebase.cpp
, it pulls in codebaseclass.h
as well, and doesn't know how to deal with the thrust
vector.
pImpl-ing our way through
The pImpl pattern isn't new, nor is it technically meant for purposes like this; it is meant to hide functionality from a client interface. Here, we're using it to hide the functionality from our compilers.
The idea is to simply forward declare our CUDA-classes (including any that we have defined that use CUDA things), and only store pointers to our objects, rather than the objects themselves.
// codebaseclass.h
// forward declaration
namespace thrust{
template <typename T>
class device_vector<T>;
}
class CodebaseClass{
public:
CodebaseClass();
~CodebaseClass();
private:
std::unique_ptr<thrust::device_vector> m_vec;
}
// codebaseclass.cpp
#include "codebaseclass.h"
#include <thrust/device_vector.h>
// ... use the device vector as per normal, just that it's via a pointer
Now, transitively including codebaseclass.h
into othercodebase.h
does not cause problems, because a pointer is simply an address to the compiler. It expects that if you actually need to use it, you will link it in yourself later.
Yes, technically pImpl is a lot more than this, and usually involves a class-internal forward declaration which gets inherited from. We don't really need all that here, and pImpl-ing has a nicer ring to it than 'forward-declaring'.
Why this works, and when it doesn't
This goes back to compiler fundamentals, and essentially boils down to what information the compiler needs when a full type is specified.
When the entire class object is a direct member of another class, like I wrote originally, the compiler must know how to construct and destroy it. This is because its inclusion as a member variable implies that it must be default constructed. Without the class header/implementation like thrust/device_vector.h
in this case, it will be unable to do this.
Similarly, you wouldn't be able to call any methods directly without first having its definition e.g. having an inline function that calls m_vec->reserve()
. In my above example, we just store the pointer to it, which is basically saying "store an address first, I'll tell you what it does later, don't worry about it".
Now there are a few additional things to note when we implement this pattern. Notice how I did not define the destructor inline within the header. This isn't an arbitrary decision; the code will not compile if we had done so i.e. written the innocuous-looking
~CodebaseClass(){}
even if we were going to do this anyway in the source file!
Why does this happen? Because the compiler will emit full code to destroy all objects at the point where the destructor is defined. So if you define it inline in the header, it must know how to destroy thrust::device_vector
, and since we only forward declared it, it will not know how to.
The two options in this case are to:
- Define the destructor out-of-line i.e. in the associated
.cpp
, by which point the compiler should have full knowledge of thethrust::device_vector
type. This is particularly important for polymorphic classes since the compiler does not auto-generate a virtual destructor, so you must always declare it in the header at least, but not define it there. - Simply not defining it at all. The compiler will emit it after it has the full definition of the encompassing class, which occurs in the
.cpp
, same as the above.
Note that the second version is not equivalent to writing
~CodebaseClass() = default
in the header, as the compiler will then attempt to emit it in the header again, and the same problem will arise. You can, however, declare it in the header and then write~CodebaseClass() = default;
in the source.cpp
file.
If you really want the full pImpl experience..
Then you can use constructor inheritance to get you most of the way there. In these scenarios you would actually forward declare a Impl
class or struct for whatever it is you want to hide.
A lot of the time, you don't really need to care about the constructor for this hidden implementation, and you just want to use your hidden class's constructor directly. Then something like this in the .cpp
would work:
struct Impl : public RealClass{
using RealClass::RealClass; // this inherits all the constructors
// ...
}
Again, this isn't a strictly CUDA thing, but it's useful to remember it in this context.