State API#
The state module provides batched storage and manifold updates for optimization variables. State batches implement the Plus (retraction) operation so the solver can update states in tangent space while keeping them on the manifold.
- C++ —
cunls/state - Python —
pycunls
Manifolds#
What is a manifold?
Many variables in nonlinear least squares do not live in \(\mathbb{R}^n\) but on curved spaces: 2D/3D rotations (SO(2), SO(3)), rigid or similarity transforms (SE(2), SE(3), Sim(2), Sim(3)), projective linear groups (SL(4)), or other constrained sets. Such a space is a manifold: at each point \(x\) there is a tangent space (a linear space of “directions”) whose dimension is the intrinsic dimension of the manifold. The ambient space is the larger Euclidean space in which the manifold is embedded (e.g. 3×3 matrices for SO(3), so ambient dimension 9).
Why use manifolds?
Constraint satisfaction: Updates are applied in the tangent space and then mapped back onto the manifold, so the state never leaves the constraint set (e.g. rotation matrices stay orthogonal).
Correct dimension: The solver only works with as many unknowns as the tangent dimension (e.g. 3 for SO(3) instead of 9), which improves numerics and efficiency.
Plus (retraction)
The Plus operation (in the literature often written \(\boxplus\)) takes a point \(x\) on the manifold and a tangent vector \(\Delta\) and returns a new point on the manifold:
So the solver computes an update \(\Delta\) in tangent space (e.g. from Gauss-Newton or Levenberg-Marquardt) and then sets \(x_{\mathrm{new}} = x \oplus \Delta\). For Euclidean space, \(x \oplus \Delta = x + \Delta\). For Lie groups (SO, SE, Sim), Plus is implemented as right-multiplication by the exponential of the Lie algebra element: \(x \oplus \Delta = x \cdot \mathrm{Exp}(\Delta)\).
How the minimizer uses state batches
The minimizer holds a current state \(x\) in ambient storage. It solves for
a tangent update \(\Delta\) (using Jacobians that are w.r.t. tangent space).
Then it calls StateBatch::Plus() (or StateBatchOps::Plus
over multiple batches) to write \(x \oplus \Delta\) back into the state
buffer. So the state batch is the object that knows how to apply \(\oplus\)
for its manifold.
StateBatch Interface#
-
size_t TangentSize() const#
- Returns:
[out] Tangent-space dimension per state block.
-
size_t AmbientSize() const#
- Returns:
[out] Ambient/storage dimension per state block.
-
size_t NumStateBlocks() const#
- Returns:
[out] Number of state blocks in this batch.
- void Plus(
- const float *x,
- const float *delta,
- float *x_plus_delta,
- cudaStream_t stream
Computes \(x_{\mathrm{out}} = x \oplus \delta\) for each block in the batch.
- Parameters:
x – [in] Device pointer to the current state values (ambient).
delta – [in] Device pointer to tangent-space updates.
x_plus_delta – [out] Device pointer to updated state values (ambient).
stream – [in] CUDA stream for asynchronous execution.
- Returns:
[out] No return value.
-
float *StateBlockDevicePtr(size_t state_block_idx)#
- Parameters:
state_block_idx – [in] Zero-based index of state block.
- Returns:
[out] Mutable device pointer for the selected block, or
nullptrwhen out-of-range.
-
const float *StateBlockDevicePtr(size_t state_block_idx) const#
- Parameters:
state_block_idx – [in] Zero-based index of state block.
- Returns:
[out] Const device pointer for the selected block, or
nullptrwhen out-of-range.
-
const int *ConstStateIds() const#
- Returns:
[out] Device pointer to constant-state indices, or
nullptrwhen none are set.
-
size_t NumConstStateBlocks() const#
- Returns:
[out] Number of constant (non-optimized) state blocks.
State batch types (tables)#
Each state batch type corresponds to a manifold. The table columns are: Plus formula, Ambient dimension, Tangent dimension, Ambient space description, Tangent space description, and Memory layout of one state block in device memory.
SizedStateBatch<AmbientDim, TangentDim>#
Generic base with compile-time ambient and tangent dimensions. Storage layout:
contiguous blocks, each of AmbientDim floats. Derived classes implement
Plus() for their manifold.
VectorStateBatch<Dim>#
Header: cunls/state/vector_state_batch.h
Euclidean vector state (e.g. landmarks, biases). Tangent and ambient spaces coincide.
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x + \delta\) |
\(\mathrm{Dim}\) |
\(\mathrm{Dim}\) |
\(\mathbb{R}^{\mathrm{Dim}}\) |
\(\mathbb{R}^{\mathrm{Dim}}\) |
\(\mathrm{Dim}\) floats per block, contiguous |
Constructors: Same as SizedStateBatch with both dimensions equal to
Dim. See Constructors below.
SO2StateBatch#
Header: cunls/state/so2_state_batch.h
2D rotations (heading angle). Tangent = 1 (angle in radians).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\delta)\) |
4 |
1 |
2×2 rotation matrix |
angle (radians) |
row-major 2×2: \([\cos\theta,\, -\sin\theta,\, \sin\theta,\, \cos\theta]\) |
SO3StateBatch#
Header: cunls/state/so3_state_batch.h
3D rotations. Tangent = 3 (axis-angle / rotation vector).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\mathrm{skew}(\delta))\) |
9 |
3 |
3×3 rotation matrix |
3D rotation vector |
row-major 3×3 (9 floats) |
SE2StateBatch#
Header: cunls/state/se2_state_batch.h
2D rigid transform (rotation + translation). Tangent = 3 (\(v_x,\, v_y\), angle).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\delta)\) |
9 |
3 |
3×3 homogeneous matrix |
\([v_x,\, v_y,\, \theta]\) |
row-major 3×3: \([\cos\theta,\, -\sin\theta,\, t_x,\, \sin\theta,\, \cos\theta,\, t_y,\, 0,\, 0,\, 1]\) |
SE3StateBatch#
Header: cunls/state/se3_state_batch.h
3D rigid transform (rotation + translation). Tangent = 6 (twist: rotation vector + translation).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\mathrm{skew}(\delta))\) |
16 |
6 |
4×4 homogeneous matrix |
6D twist \([\omega; \rho]\) |
row-major 4×4: \([R\,|\,t;\; 0\; 0\; 0\; 1]\) (16 floats) |
Similarity2StateBatch#
Header: cunls/state/similarity2_state_batch.h
2D similarity (rotation + translation + scale). Tangent = 4 (\(u_x,\, u_y,\, \theta,\, \lambda=\log s\)).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\delta)\) |
9 |
4 |
3×3 sim. matrix |
\([u_x,\, u_y,\, \theta,\, \lambda]\) |
row-major 3×3: \([\cos\theta,\, -\sin\theta,\, t_x,\, \sin\theta,\, \cos\theta,\, t_y,\, 0,\, 0,\, 1/s]\) |
Similarity3StateBatch#
Header: cunls/state/similarity3_state_batch.h
3D similarity (rotation + translation + scale). Tangent = 7 (\(\omega,\, u,\, \lambda=\log s\)).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\delta)\) |
16 |
7 |
4×4 sim. matrix |
\([\omega; u; \lambda]\) |
row-major 4×4: \([R\,|\,t;\; 0\; 0\; 0\; 1/s]\) (16 floats) |
SL4StateBatch#
Header: cunls/state/sl4_state_batch.h
Projective special linear group SL(4). The tangent space is the 15-dimensional Lie algebra \(\mathfrak{sl}(4)\) (\(\mathfrak{so}(4) \oplus \mathrm{sym\_off}(4) \oplus \mathrm{diag}_0(4)\)).
Plus |
Ambient |
Tangent |
Ambient space |
Tangent space |
Memory layout |
|---|---|---|---|---|---|
\(x \cdot \mathrm{Exp}(\delta)\) |
16 |
15 |
4×4 matrix with unit determinant |
15D \(\mathfrak{sl}(4)\) Lie algebra |
row-major 4×4 (16 floats) |
Constructors#
SizedStateBatch<AmbientDim, TangentDim> (constructors)#
-
SizedStateBatch(const float *device_ptr, size_t num_blocks)#
- Parameters:
device_ptr – [in] Device pointer to contiguous state storage (num_blocks × AmbientDim floats).
num_blocks – [in] Number of state blocks.
- Returns:
[out] Constructor has no return value.
- SizedStateBatch(
- const float *device_ptr,
- size_t num_blocks,
- const int *device_constant_state_ids,
- size_t num_const_state_blocks
- Parameters:
device_ptr – [in] Device pointer to contiguous state storage.
num_blocks – [in] Number of state blocks.
device_constant_state_ids – [in] Device pointer to indices of constant blocks.
num_const_state_blocks – [in] Number of constant block indices.
- Returns:
[out] Constructor has no return value.
VectorStateBatch<Dim> (constructors)#
Uses the same constructor signatures as SizedStateBatch with ambient and
tangent dimension Dim.
StateBatch constructors#
Each StateBatch-derived class has constructors equivalent to:
- ClassName(
- cuBLASHandle &cublas_handle,
- const float *device_ptr,
- size_t num_blocks
- ClassName(
- cuBLASHandle &cublas_handle,
- const float *device_ptr,
- size_t num_blocks,
- const int *device_constant_state_ids,
- size_t num_const_state_blocks
- Parameters:
cublas_handle – [in] External cuBLAS handle wrapper.
device_ptr – [in] Device pointer to contiguous state storage.
num_blocks – [in] Number of state blocks.
device_constant_state_ids – [in] Device pointer to constant block indices.
num_const_state_blocks – [in] Number of constant block indices.
- Returns:
[out] Constructor has no return value.
StateBatchOps#
Orchestrates Plus() across multiple state batches: gathers tangent
updates from a single reduced vector, scatters to per-batch deltas, and calls
each batch’s Plus().
-
StateBatchOps()#
- Returns:
[out] Constructor has no return value.
- StateBatchOps(
- cudaStream_t stream,
- const std::vector<StateBatch*> &state_batches
- Parameters:
stream – [in] CUDA stream used to initialize mappings.
state_batches – [in] Ordered list of state batches.
- Returns:
[out] Constructor has no return value.
- void Preprocess(
- cudaStream_t stream,
- const std::vector<StateBatch*> &state_batches
- Parameters:
stream – [in] CUDA stream for mapping/buffer initialization.
state_batches – [in] State batches used to build reduced/full mappings.
- Returns:
[out] No return value.
- void Plus(
- cudaStream_t stream,
- const std::vector<const float*> &x_ptrs,
- const DeviceVector<float> &delta,
- std::vector<float*> &x_plus_delta_ptrs
- Parameters:
stream – [in] CUDA stream for scatter/update operations.
x_ptrs – [in] Current per-batch state pointers.
delta – [in] Reduced tangent update vector.
x_plus_delta_ptrs – [out] Per-batch pointers for updated states.
- Returns:
[out] No return value.
-
size_t NumReducedStates() const#
- Returns:
[out] Number of scalar optimization variables after removing constant states.
Python API (pycunls)#
All Python state batches inherit from the abstract StateBatch base class.
Every constructor argument documented as DevicePointer accepts either a
cupy.ndarray (the device pointer is extracted automatically via
.data.ptr) or a raw int GPU device address.
Common StateBatch interface#
Every state batch — built-in or user-defined — exposes the following methods and properties.
Methods
state_block_device_ptr(index: int) -> int— returns the GPU device pointer (as anint) for state block index. The returned value is the address of the first float in the block’s ambient storage. Use these pointers to build thestate_pointerslist passed to Problem.add_factor_batch. index is zero-based; passing a value>= num_state_blocksreturns0(null pointer).
Read-only properties
num_state_blocks (
int) — total number of state blocks in the batch, including any constant blocks.tangent_size (
int) — tangent-space dimension per state block. This is the number of unknowns the solver allocates per block (e.g. 6 for SE(3), 3 for SO(3)).ambient_size (
int) — ambient/storage dimension per state block. The GPU buffer storesnum_state_blocks * ambient_sizecontiguous floats (e.g. 16 for SE(3) = row-major 4×4 matrix).
pycunls.VectorStateBatch1 / VectorStateBatch2 / VectorStateBatch3 / VectorStateBatch6#
Euclidean vector states where tangent and ambient dimensions coincide. The suffix indicates the dimension (1, 2, 3, or 6). Plus is simple addition: \(x \oplus \delta = x + \delta\).
Constructors
# All optimizable:
sb = pycunls.VectorStateBatch3(data, num_blocks)
# With constant (frozen) blocks:
sb = pycunls.VectorStateBatch3(data, num_blocks, const_state_ids, num_const)
data (
DevicePointer) — contiguous GPU buffer ofnum_blocks × Dimfloats. The state batch does not copy the data; it stores the pointer and reads/writes the buffer directly. The caller must keep the underlying allocation alive for the lifetime of the state batch.num_blocks (
int) — number of state blocks in the batch.const_state_ids (
DevicePointer, optional) — GPUint32array containing the zero-based indices of blocks that should be held constant during optimization. Constant blocks are excluded from the solver’s tangent vector; their ambient values are never modified.num_const (
int, optional) — number of entries in const_state_ids.
pycunls.SE3StateBatch#
3-D rigid-body transform state batch. Ambient = 16 (row-major 4×4 homogeneous matrix), Tangent = 6 (twist \([\omega; \rho]\)). Plus is right-multiplication by the exponential map: \(T \oplus \delta = T \cdot \mathrm{Exp}(\delta)\).
Constructors
cublas = pycunls.CublasHandle()
# All optimizable:
sb = pycunls.SE3StateBatch(cublas, data, num_blocks)
# With constant blocks:
sb = pycunls.SE3StateBatch(cublas, data, num_blocks, const_ids, num_const)
cublas (CublasHandle) — shared cuBLAS handle used internally for matrix operations in the exponential map.
data (
DevicePointer) — contiguous GPU buffer ofnum_blocks × 16floats (row-major 4×4 matrices).num_blocks (
int) — number of state blocks (poses).const_ids (
DevicePointer, optional) — GPUint32array of constant-block indices (e.g. a gauge anchor).num_const (
int, optional) — number of constant blocks.
pycunls.SO3StateBatch#
3-D rotation state batch. Ambient = 9 (row-major 3×3 rotation matrix), Tangent = 3 (rotation vector / axis-angle). Plus: \(R \oplus \delta = R \cdot \mathrm{Exp}(\mathrm{skew}(\delta))\).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.SO3StateBatch(cublas, data, num_blocks)
sb = pycunls.SO3StateBatch(cublas, data, num_blocks, const_ids, num_const)
data —
num_blocks × 9floats (row-major 3×3).
pycunls.SO2StateBatch#
2-D rotation state batch. Ambient = 4 (row-major 2×2 rotation matrix), Tangent = 1 (angle in radians). Plus: \(R \oplus \delta = R \cdot \mathrm{Exp}(\delta)\).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.SO2StateBatch(cublas, data, num_blocks)
sb = pycunls.SO2StateBatch(cublas, data, num_blocks, const_ids, num_const)
data —
num_blocks × 4floats (\([\cos\theta,\,-\sin\theta,\,\sin\theta,\,\cos\theta]\)).
pycunls.SE2StateBatch#
2-D rigid-body transform state batch. Ambient = 9 (row-major 3×3 homogeneous matrix), Tangent = 3 (\([v_x, v_y, \theta]\)).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.SE2StateBatch(cublas, data, num_blocks)
sb = pycunls.SE2StateBatch(cublas, data, num_blocks, const_ids, num_const)
data —
num_blocks × 9floats (row-major 3×3).
pycunls.Similarity2StateBatch#
2-D similarity transform state batch. Ambient = 9, Tangent = 4 (\([u_x, u_y, \theta, \lambda]\) where \(\lambda = \log s\)).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.Similarity2StateBatch(cublas, data, num_blocks)
sb = pycunls.Similarity2StateBatch(cublas, data, num_blocks, const_ids, num_const)
pycunls.Similarity3StateBatch#
3-D similarity transform state batch. Ambient = 16, Tangent = 7 (\([\omega; u; \lambda]\) where \(\lambda = \log s\)).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.Similarity3StateBatch(cublas, data, num_blocks)
sb = pycunls.Similarity3StateBatch(cublas, data, num_blocks, const_ids, num_const)
pycunls.SL4StateBatch#
SL(4) state batch. Ambient = 16 (row-major 4×4 matrix with unit determinant), Tangent = 15 (\(\mathfrak{sl}(4)\) Lie algebra). Plus: \(T \oplus \delta = T \cdot \mathrm{Exp}(\delta)\).
Constructors — same pattern as SE3StateBatch:
sb = pycunls.SL4StateBatch(cublas, data, num_blocks)
sb = pycunls.SL4StateBatch(cublas, data, num_blocks, const_ids, num_const)
data —
num_blocks × 16floats (row-major 4×4).
pycunls.CustomStateBatch#
Base class for user-defined state batches. Subclass this to implement a manifold retraction that is not available as a built-in (e.g. positive scalars, quaternions, constrained subspaces).
Constructor
class MyState(pycunls.CustomStateBatch):
def __init__(self, data, num_blocks):
super().__init__(
data,
ambient_size=...,
tangent_size=...,
num_blocks=num_blocks,
)
data (
DevicePointer) — contiguous GPU buffer ofnum_blocks × ambient_sizefloats.ambient_size (
int) — number of floats per state block in GPU memory.tangent_size (
int) — number of tangent-space unknowns per block.num_blocks (
int) — number of state blocks.const_state_ids (
DevicePointer, optional) — GPUint32array of constant-block indices.num_const_state_blocks (
int, default0) — number of constant blocks.
Methods to override
plus(x_ptr, delta_ptr, x_plus_delta_ptr, stream_handle) -> None— implements the manifold retraction \(x_{\mathrm{out}} = x \oplus \delta\) for all blocks in the batch. All four arguments are rawinthandles:x_ptr — device pointer to the current ambient state (
num_blocks × ambient_sizefloats, read-only).delta_ptr — device pointer to the tangent-space updates (
num_blocks × tangent_sizefloats, read-only).x_plus_delta_ptr — device pointer to the output buffer (
num_blocks × ambient_sizefloats, write).stream_handle —
cudaStream_tcast toint. All GPU work must be launched on this stream so the minimizer can serialize operations correctly.
The default implementation raises
NotImplementedError.
pycunls.warp.WarpStateBatch#
Convenience base for custom state batches implemented with NVIDIA Warp kernels. Inherits from CustomStateBatch and provides helper methods for
zero-copy pointer wrapping so you never need to manually construct
wp.array objects from raw device addresses. Requires warp-lang.
Constructor
from pycunls.warp import WarpStateBatch
class MyWarpState(WarpStateBatch):
def __init__(self, data, num_blocks):
super().__init__(
data,
ambient_size=...,
tangent_size=...,
num_blocks=num_blocks,
device="cuda:0",
)
device (
str, default"cuda:0") — Warp device string used when creatingwp.arraywrappers viawrap_array.
Helper methods (inherited — do not override)
wrap_array(ptr: int, dtype, shape) -> wp.array— zero-copy wrap of an existing GPU allocation as a Warp array. ptr is the device address, dtype a Warp data type (e.g.wp.float32), and shape anintor tuple giving the array dimensions. The returnedwp.arrayshares the memory; no allocation or copy occurs.make_warp_stream(stream_handle: int) -> wp.Stream— wraps a rawcudaStream_t(passed asint) as awp.Stream. Use the returned stream inwp.launch(..., stream=stream)to ensure the Warp kernel executes on the minimizer’s CUDA stream.
Methods to override
plus(x_ptr, delta_ptr, x_plus_delta_ptr, stream_handle) -> None— same contract asCustomStateBatch.plus. Typical implementations wrap the pointers withself.wrap_array, build awp.Streamwithself.make_warp_stream, and launch a@wp.kernel.
See Custom Warp State for a complete example.