Calling CUDA from Go without cgo

Go is great at infrastructure.

It gives us fast builds, simple deployment, lightweight concurrency, and the ability to ship a single binary.

But Go has always been awkward around one increasingly important area: GPUs.

A lot of modern AI, analytics, vector processing, and high-throughput data work now runs on NVIDIA GPUs through CUDA. The problem is that most CUDA access from application code still lives in the Python world.

This post is about why calling CUDA from Go matters, why cgo is often painful, and what a pure-Go runtime-loaded CUDA Driver API approach can look like.

The problem

Many production backend systems are written in Go.

But most GPU tooling is centered around Python libraries like:

  • PyTorch
  • TensorFlow
  • JAX
  • CuPy

That often creates an architecture like this:

Go service
  -> HTTP/gRPC
  -> Python GPU worker
  -> CUDA
  -> Python GPU worker
  -> Go service

This works, but it adds:

  • another service
  • another runtime
  • serialization overhead
  • extra deployment complexity
  • extra observability/debugging surface

Sometimes the Python service exists only because the Go service cannot easily talk to CUDA directly.

Why not just use cgo?

The usual way to call native libraries from Go is cgo.

For CUDA, that might look like this:

// #cgo LDFLAGS: -lcuda
// #include <cuda.h>
import "C"

That works, but it changes the Go developer experience.

Now you need:

  • a C compiler
  • CUDA headers
  • CUDA libraries available at build time
  • more fragile CI builds
  • harder cross-compilation
  • platform-specific linking behavior

One of Go’s best properties is this:

CGO_ENABLED=0 go build ./...

A clean binary.

No C toolchain.

No build-time CUDA dependency.

So the interesting question is:

Can Go talk to CUDA without cgo?

Yes — by loading the CUDA Driver API dynamically at runtime.

Runtime-loading CUDA

CUDA exposes a Driver API through the NVIDIA driver library.

On Linux:

libcuda.so.1

On Windows:

nvcuda.dll

Instead of linking to CUDA at build time, a Go program can open the driver library at runtime and bind the symbols it needs.

Conceptually:

Go binary
  -> open libcuda.so.1
  -> find cuInit
  -> find cuDriverGetVersion
  -> call CUDA Driver API

The important part is that the binary can still be built like this:

CGO_ENABLED=0 go build ./...

CUDA only needs to exist on the machine where the program actually runs.

Minimal example: initialize CUDA from Go

A small example could look like this:

package main

import (
    "fmt"

    "github.com/eitamring/gocudrv/cuda"
)

func main() {
    driver, err := cuda.Open()
    if err != nil {
        panic(err)
    }
    defer driver.Close()

    if err := driver.Init(0); err != nil {
        panic(err)
    }

    version, err := driver.DriverVersion()
    if err != nil {
        panic(err)
    }

    fmt.Println("CUDA driver version:", version)
}

Example output:

loading CUDA driver: libcuda.so.1
cuInit(0)
DriverVersion() = 12040
CUDA driver version: 12040

This is not ML.

This is the foundation: initialize CUDA, call Driver API functions, then build up to memory management, PTX loading, and kernel launches.

Loading PTX

CUDA kernels can be compiled into PTX.

A simplified PTX kernel might look like this:

.version 7.8
.target sm_75
.address_size 64

.visible .entry vecAdd(
    .param .u64 a,
    .param .u64 b,
    .param .u64 c,
    .param .u32 n
)
{
    ret;
}

Go can load that PTX module at runtime:

module, err := ctx.LoadPTX(ptxBytes)
if err != nil {
    panic(err)
}

kernel, err := module.Function("vecAdd")
if err != nil {
    panic(err)
}

fmt.Println("PTX loaded successfully")

Example output:

PTX loaded successfully

Launching a CUDA kernel from Go

Once the PTX is loaded, Go can launch the kernel directly:

err = kernel.Launch(
    cuda.GridDim{X: 1024, Y: 1, Z: 1},
    cuda.BlockDim{X: 256, Y: 1, Z: 1},
    0,
    stream,
    args,
)
if err != nil {
    panic(err)
}

Runtime logs might look like this:

using device 0: NVIDIA GeForce RTX 4090
loading PTX module...
PTX loaded successfully
kernel launch configuration: grid=1024 block=256
launch successful
execution completed
elapsed: 0.186 ms

That is the core idea:

Go -> CUDA Driver API -> GPU

No Python sidecar.

No cgo.

No build-time CUDA toolkit.

Example: vector addition

A simple starter workload is vector addition.

Input:

a = [1, 2, 3, 4]
b = [10, 20, 30, 40]

Expected output:

c = [11, 22, 33, 44]

The Go-side flow looks like this:

aDev, err := ctx.MemAlloc(size)
if err != nil {
    panic(err)
}
defer aDev.Free()

bDev, err := ctx.MemAlloc(size)
if err != nil {
    panic(err)
}
defer bDev.Free()

cDev, err := ctx.MemAlloc(size)
if err != nil {
    panic(err)
}
defer cDev.Free()

if err := ctx.MemcpyHtoD(aDev, aHost); err != nil {
    panic(err)
}

if err := ctx.MemcpyHtoD(bDev, bHost); err != nil {
    panic(err)
}

err = kernel.Launch(
    cuda.GridDim{X: blocks},
    cuda.BlockDim{X: threads},
    0,
    stream,
    []cuda.KernelArg{
        cuda.DevicePtrArg(aDev),
        cuda.DevicePtrArg(bDev),
        cuda.DevicePtrArg(cDev),
        cuda.Uint32Arg(uint32(n)),
    },
)
if err != nil {
    panic(err)
}

if err := ctx.MemcpyDtoH(cHost, cDev); err != nil {
    panic(err)
}

This is intentionally low-level.

It gives Go access to the same CUDA primitives used elsewhere:

  • allocate device memory
  • copy host memory to device memory
  • load a module
  • launch a kernel
  • copy device memory back to host memory
  • synchronize

What this is useful for

This is not trying to replace PyTorch.

PyTorch is still the right tool for training models and high-level ML research.

The better use cases for CUDA from Go are infrastructure workloads:

  • custom inference kernels
  • vector search acceleration
  • embeddings pipelines
  • columnar data processing
  • compression/decompression experiments
  • image/video processing
  • batch scoring
  • database or analytics engine experiments

For example:

HTTP request
  -> parse vectors in Go
  -> copy batch to GPU memory
  -> launch similarity kernel
  -> copy result back
  -> return response

No Python worker required.

The tradeoff

Runtime-loading CUDA is not magic.

You still need:

  • an NVIDIA GPU
  • a compatible NVIDIA driver
  • compiled PTX
  • careful memory management
  • enough batch size to justify GPU overhead

Small workloads can be slower than CPU code because you pay for:

  • memory transfers
  • kernel launch overhead
  • synchronization
  • device setup

A useful rule of thumb:

Tiny batch:
CPU wins

Large batch:
GPU may win

Repeated large batch:
GPU likely wins

The goal is not “GPU everything.”

The goal is to make GPU access available to Go when it actually makes sense.

Why CGO_ENABLED=0 is the interesting part

The technical detail I care about most is not just “Go can call CUDA.”

It is this:

CGO_ENABLED=0 go build ./...

That means you can build without:

  • CUDA headers
  • the CUDA toolkit
  • a C compiler
  • cgo
  • build-time GPU dependencies

Then, at runtime:

try loading libcuda.so.1
if available:
  use GPU path
else:
  fall back to CPU path

That fits infrastructure software much better.

The binary does not need to be built on a GPU machine. It only needs the NVIDIA driver on the machine where GPU execution actually happens.

Final thought

AI made GPUs mainstream, but GPUs are not only for AI.

They are throughput machines.

They are useful when you have large batches of repetitive work: matrix math, vector math, scans, filters, hashing, encoding, simulation, image processing, or data conversion.

Go already owns a lot of backend infrastructure.

So it makes sense for Go to have better access to accelerated computing.

Not through a Python sidecar.

Not through a fragile cgo setup.

But through a simple Go API that can load the CUDA Driver API when available.

That is the experiment behind https://github.com/eitamring/gocudrv

Leave a Reply