Go is great at infrastructure.
It gives us fast builds, simple deployment, lightweight concurrency, and the ability to ship a single binary.
But Go has always been awkward around one increasingly important area: GPUs.
A lot of modern AI, analytics, vector processing, and high-throughput data work now runs on NVIDIA GPUs through CUDA. The problem is that most CUDA access from application code still lives in the Python world.
This post is about why calling CUDA from Go matters, why cgo is often painful, and what a pure-Go runtime-loaded CUDA Driver API approach can look like.
The problem
Many production backend systems are written in Go.
But most GPU tooling is centered around Python libraries like:
- PyTorch
- TensorFlow
- JAX
- CuPy
That often creates an architecture like this:
Go service
-> HTTP/gRPC
-> Python GPU worker
-> CUDA
-> Python GPU worker
-> Go service
This works, but it adds:
- another service
- another runtime
- serialization overhead
- extra deployment complexity
- extra observability/debugging surface
Sometimes the Python service exists only because the Go service cannot easily talk to CUDA directly.
Why not just use cgo?
The usual way to call native libraries from Go is cgo.
For CUDA, that might look like this:
// #cgo LDFLAGS: -lcuda
// #include <cuda.h>
import "C"
That works, but it changes the Go developer experience.
Now you need:
- a C compiler
- CUDA headers
- CUDA libraries available at build time
- more fragile CI builds
- harder cross-compilation
- platform-specific linking behavior
One of Go’s best properties is this:
CGO_ENABLED=0 go build ./...
A clean binary.
No C toolchain.
No build-time CUDA dependency.
So the interesting question is:
Can Go talk to CUDA without cgo?
Yes — by loading the CUDA Driver API dynamically at runtime.
Runtime-loading CUDA
CUDA exposes a Driver API through the NVIDIA driver library.
On Linux:
libcuda.so.1
On Windows:
nvcuda.dll
Instead of linking to CUDA at build time, a Go program can open the driver library at runtime and bind the symbols it needs.
Conceptually:
Go binary
-> open libcuda.so.1
-> find cuInit
-> find cuDriverGetVersion
-> call CUDA Driver API
The important part is that the binary can still be built like this:
CGO_ENABLED=0 go build ./...
CUDA only needs to exist on the machine where the program actually runs.
Minimal example: initialize CUDA from Go
A small example could look like this:
package main
import (
"fmt"
"github.com/eitamring/gocudrv/cuda"
)
func main() {
driver, err := cuda.Open()
if err != nil {
panic(err)
}
defer driver.Close()
if err := driver.Init(0); err != nil {
panic(err)
}
version, err := driver.DriverVersion()
if err != nil {
panic(err)
}
fmt.Println("CUDA driver version:", version)
}
Example output:
loading CUDA driver: libcuda.so.1
cuInit(0)
DriverVersion() = 12040
CUDA driver version: 12040
This is not ML.
This is the foundation: initialize CUDA, call Driver API functions, then build up to memory management, PTX loading, and kernel launches.
Loading PTX
CUDA kernels can be compiled into PTX.
A simplified PTX kernel might look like this:
.version 7.8
.target sm_75
.address_size 64
.visible .entry vecAdd(
.param .u64 a,
.param .u64 b,
.param .u64 c,
.param .u32 n
)
{
ret;
}
Go can load that PTX module at runtime:
module, err := ctx.LoadPTX(ptxBytes)
if err != nil {
panic(err)
}
kernel, err := module.Function("vecAdd")
if err != nil {
panic(err)
}
fmt.Println("PTX loaded successfully")
Example output:
PTX loaded successfully
Launching a CUDA kernel from Go
Once the PTX is loaded, Go can launch the kernel directly:
err = kernel.Launch(
cuda.GridDim{X: 1024, Y: 1, Z: 1},
cuda.BlockDim{X: 256, Y: 1, Z: 1},
0,
stream,
args,
)
if err != nil {
panic(err)
}
Runtime logs might look like this:
using device 0: NVIDIA GeForce RTX 4090
loading PTX module...
PTX loaded successfully
kernel launch configuration: grid=1024 block=256
launch successful
execution completed
elapsed: 0.186 ms
That is the core idea:
Go -> CUDA Driver API -> GPU
No Python sidecar.
No cgo.
No build-time CUDA toolkit.
Example: vector addition
A simple starter workload is vector addition.
Input:
a = [1, 2, 3, 4]
b = [10, 20, 30, 40]
Expected output:
c = [11, 22, 33, 44]
The Go-side flow looks like this:
aDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer aDev.Free()
bDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer bDev.Free()
cDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer cDev.Free()
if err := ctx.MemcpyHtoD(aDev, aHost); err != nil {
panic(err)
}
if err := ctx.MemcpyHtoD(bDev, bHost); err != nil {
panic(err)
}
err = kernel.Launch(
cuda.GridDim{X: blocks},
cuda.BlockDim{X: threads},
0,
stream,
[]cuda.KernelArg{
cuda.DevicePtrArg(aDev),
cuda.DevicePtrArg(bDev),
cuda.DevicePtrArg(cDev),
cuda.Uint32Arg(uint32(n)),
},
)
if err != nil {
panic(err)
}
if err := ctx.MemcpyDtoH(cHost, cDev); err != nil {
panic(err)
}
This is intentionally low-level.
It gives Go access to the same CUDA primitives used elsewhere:
- allocate device memory
- copy host memory to device memory
- load a module
- launch a kernel
- copy device memory back to host memory
- synchronize
What this is useful for
This is not trying to replace PyTorch.
PyTorch is still the right tool for training models and high-level ML research.
The better use cases for CUDA from Go are infrastructure workloads:
- custom inference kernels
- vector search acceleration
- embeddings pipelines
- columnar data processing
- compression/decompression experiments
- image/video processing
- batch scoring
- database or analytics engine experiments
For example:
HTTP request
-> parse vectors in Go
-> copy batch to GPU memory
-> launch similarity kernel
-> copy result back
-> return response
No Python worker required.
The tradeoff
Runtime-loading CUDA is not magic.
You still need:
- an NVIDIA GPU
- a compatible NVIDIA driver
- compiled PTX
- careful memory management
- enough batch size to justify GPU overhead
Small workloads can be slower than CPU code because you pay for:
- memory transfers
- kernel launch overhead
- synchronization
- device setup
A useful rule of thumb:
Tiny batch:
CPU wins
Large batch:
GPU may win
Repeated large batch:
GPU likely wins
The goal is not “GPU everything.”
The goal is to make GPU access available to Go when it actually makes sense.
Why CGO_ENABLED=0 is the interesting part
The technical detail I care about most is not just “Go can call CUDA.”
It is this:
CGO_ENABLED=0 go build ./...
That means you can build without:
- CUDA headers
- the CUDA toolkit
- a C compiler
- cgo
- build-time GPU dependencies
Then, at runtime:
try loading libcuda.so.1
if available:
use GPU path
else:
fall back to CPU path
That fits infrastructure software much better.
The binary does not need to be built on a GPU machine. It only needs the NVIDIA driver on the machine where GPU execution actually happens.
Final thought
AI made GPUs mainstream, but GPUs are not only for AI.
They are throughput machines.
They are useful when you have large batches of repetitive work: matrix math, vector math, scans, filters, hashing, encoding, simulation, image processing, or data conversion.
Go already owns a lot of backend infrastructure.
So it makes sense for Go to have better access to accelerated computing.
Not through a Python sidecar.
Not through a fragile cgo setup.
But through a simple Go API that can load the CUDA Driver API when available.
That is the experiment behind https://github.com/eitamring/gocudrv
