Quick Start#

This section shows a minimal end-to-end setup:

  1. Install cuNLS

  2. Write a tiny app

  3. Compile and run it against the installed library

Step 1: Install cuNLS#

Use Installation and make sure you have an install prefix (example: /tmp/cunls_install).

Step 2: Create a minimal source file#

Create main.cpp. The program solves the simplest possible nonlinear least-squares problem: a single scalar variable \(x\) pulled toward a target value \(o = 2\) by a prior factor with residual \(r = x - o\). The cost is \(\tfrac{1}{2}\|x - o\|^2\), so the optimal solution is \(x^* = 2\).

Includes and CUDA stream. All cuNLS public headers are available through the umbrella header cunls/cunls.h. A CUDA stream is required by every cuNLS API call — it controls asynchronous GPU execution.

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include "cunls/cunls.h"

int main() {
  // cuNLS operations are asynchronous; a CUDA stream serializes GPU work.
  cudaStream_t stream = nullptr;
  cudaStreamCreate(&stream);

Prepare host data and upload to the GPU. We define one scalar state \(x = 0\) (the initial guess) and one observation \(o = 2\) (the target). dvector (see Common API) is a thin RAII wrapper around cudaMalloc / cudaMemcpy that uploads host vectors to device memory on construction.

// Initial guess: x = 0.  Target observation: o = 2.
std::vector<float> h_state = {0.0f};
std::vector<float> h_obs   = {2.0f};

// Upload both vectors to GPU memory.
cunls::dvector<float> d_state(h_state);
cunls::dvector<float> d_obs(h_obs);

Create the state batch. A VectorStateBatch<1> (see State API) wraps the device memory as a batch of 1-dimensional Euclidean state blocks. The template argument 1 means each block has one float. The second constructor argument is the number of state blocks (here just one).

// Wrap the device state memory in a VectorStateBatch with one block of
// dimension 1.  The solver will update this memory in-place.
cunls::VectorStateBatch<1> state_batch(d_state.data(), /*num_blocks=*/1);

Create the factor batch. A PriorVectorFactorBatch<1> (see Factor API) computes the residual \(r = x - o\) and Jacobian \(J = I\) for each factor. The constructor takes a device pointer to the observation vectors and the number of factors.

// Build a prior factor that penalizes deviation from the observation.
// Residual: r = x - o,  Jacobian: J = I.
cunls::PriorVectorFactorBatch<1> prior(
    reinterpret_cast<const cunls::Vector<1>*>(d_obs.data()),
    /*num_factors=*/1);

Wire state pointers and assemble the problem. The state-pointer vector tells the solver which state block each factor reads. For a prior factor with one state input, there is exactly one pointer per factor. Problem (see Minimizer API) collects all state and factor batches into a single factor graph.

// Each factor needs a list of device pointers to its input state blocks.
// The prior factor reads one state block, so we provide one pointer.
std::vector<float*> state_ptrs = {state_batch.StateBlockDevicePtr(0)};

// Assemble the factor graph.
cunls::Problem problem;
problem.AddStateBatch(&state_batch);
problem.AddFactorBatch(&prior, state_ptrs);

Run the solver. LevenbergMarquardtMinimizer (see Minimizer API) solves the damped normal equations \((J^T J + \lambda D)\,\Delta x = -J^T r\) at each iteration, adapting \(\lambda\) based on step quality. Minimize updates the state memory in-place and returns a MinimizerSummary with solve statistics.

// Solve with default LM settings.  The state (d_state) is updated
// in-place on the GPU.
cunls::LevenbergMarquardtMinimizer minimizer;
auto summary = minimizer.Minimize(stream, problem);

Inspect results.

  std::cout << "Iterations: "   << summary.num_iterations << "\n";
  std::cout << "Initial cost: " << summary.initial_cost   << "\n";
  std::cout << "Final cost: "   << summary.final_cost     << "\n";

  cudaStreamDestroy(stream);
  return 0;
}

Step 3: Create CMakeLists.txt#

cmake_minimum_required(VERSION 3.24)
project(cunls_quick_start LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

if(NOT DEFINED CUNLS_INSTALL_DIR)
  message(FATAL_ERROR "Set CUNLS_INSTALL_DIR to cuNLS install prefix.")
endif()

find_package(CUDAToolkit REQUIRED)
find_library(CUNLS_LIBRARY cunls PATHS "${CUNLS_INSTALL_DIR}/lib" REQUIRED NO_DEFAULT_PATH)

add_executable(minimal main.cpp)
target_include_directories(minimal PRIVATE "${CUNLS_INSTALL_DIR}/include")
target_link_libraries(minimal PRIVATE "${CUNLS_LIBRARY}" CUDA::cudart)
set_target_properties(minimal PROPERTIES
  BUILD_RPATH "${CUNLS_INSTALL_DIR}/lib"
  INSTALL_RPATH "${CUNLS_INSTALL_DIR}/lib"
)

Step 4: Build and run#

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCUNLS_INSTALL_DIR=/tmp/cunls_install
cmake --build build -j
./build/minimal

You should see the final cost decrease toward zero.