Linear Solver API#
cunls/linear_solver hosts linear-system abstractions (block-Jacobi PCG,
cuDSS integration, dense pivoted LDLT, dense Cholesky, and dense QR
solvers) behind a common CSR-based interface.
SparseLinearSolverType#
Enum in cunls/linear_solver/sparse_linear_solver.h:
BlockSparsePCG— iterative block-Jacobi preconditioned conjugate gradient solver. Default backend for Gauss-Newton and Levenberg-Marquardt: outperformscuDSSon most SBA / PGO workloads (seeprofile/PCG_RESULTS.md).cuDSS— NVIDIA cuDSS sparse direct solver. Pick when each Solve sees a tiny system and PCG’s per-iter kernel-launch overhead dominates.DenseLDLT— converts CSR to dense and solves with a custom CUDA pivoted LDLT kernel.DenseCholesky— converts CSR to dense and solves with cuSOLVER Cholesky (requires SPD).DenseQR— converts CSR to dense and solves with cuSOLVER QR.
SparseLinearSolverConfig#
Struct in sparse_linear_solver.h. Only the member corresponding to the
chosen SparseLinearSolverType is used:
block_sparse_pcg_options- [in] BlockSparsePCG-specific knobs; ignored when a different backend is selected.cudss_solver_options- [in] cuDSS-specific options; ignored when a different backend is selected.Dense backends take no extra configuration.
CSRSparseLinearSolver#
Abstract base (cunls/linear_solver/csr_sparse_linear_solver.h).
- bool Initialize(
- cudaStream_t stream,
- const Problem &problem,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Performs setup work (at minimum symbolic analysis; some modes also perform an initial factorization) for the given sparsity pattern. Must be called before
Solve()and re-called whenever the matrix structure changes.Solvers may inspect
problemto specialize their setup (e.g.BlockSparsePCGSolverreads each state batch’sTangentSizeto build the block-Jacobi layout). Solvers that do not need it (cuDSS, dense backends) simply ignore it; for callers that operate on a raw matrix without a cuNLS problem context, pass a default-constructedProblem.Both
rhsandresultmust be pre-allocated to the same number of elements as the number of rows inspd_matrix; the solver does not resize them.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
problem – [in] Originating problem; used by problem-aware backends to derive per-solver structure.
spd_matrix – [in] Symmetric matrix in CSR format (backend-specific definiteness requirements may apply).
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
- bool Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Performs factorization and solve phases to compute the solution
xofA x = b. Bothrhsandresultmust be pre-allocated to the same number of elements as the number of rows inspd_matrix; the solver does not resize them. CallInitialize()first for the current sparsity pattern.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
spd_matrix – [in] Symmetric matrix in CSR format (backend-specific definiteness requirements may apply).
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
-
void DisableSafetyChecks()#
Disables runtime safety checks in the solver. By default (safety checks enabled), dense solvers validate every factorization and solve step: Cholesky checks cuSOLVER
devInfoafterpotrfandpotrs; QR inspects the diagonal ofRfor rank deficiency; LDLT performs in-kernel pivot and diagonal checks. Failures causeSolve()to returnfalsewith a diagnostic viaLogError(). Calling this method skips all of the above (no device-to-host memcpy, no stream synchronization, no in-kernel validation), which can noticeably reduce per-iteration latency for small systems but may produce silently incorrect results for singular or ill-conditioned matrices. Normally called by the minimizer whenMinimizerOptions::disable_safety_checksistrue.
-
bool SafetyChecksEnabled() const#
- Returns:
truewhen post-factorization safety checks are enabled (the default).
BlockSparsePCGOptions#
Struct in cunls/linear_solver/block_sparse_pcg_solver.h. Convergence
and preconditioner-layout knobs for BlockSparsePCGSolver.
block_size- [in] Uniform diagonal tile size; used only whenblock_layoutis empty. Must divide the matrix dimension.block_size = 1reduces the preconditioner to scalar Jacobi. Default:6.block_layout- [in] Optionalstd::vector<std::pair<int, int>>describing a heterogeneous block-diagonal layout: each(count_i, size_i)element contributescount_i * size_irows withsize_i-square diagonal tiles. When empty (default),Initializederives the layout automatically from theProblem’s state batches (segment per non-empty batch withsize = TangentSize(),count = NumStateBlocks() - NumConstStateBlocks()).max_iterations- [in] PCG iteration cap. Default:200.relative_tolerance- [in] Stop when||r_k|| <= relative_tolerance * ||b||. Default:1e-3.absolute_tolerance- [in] Stop also when||r_k|| <= absolute_tolerance. Default:1e-30.pivot_floor- [in] Floor on|D_{kk}|during the per-tile LDLT pivoting; keeps the preconditioner invertible on near-singular diagonal tiles. Default:1e-12.check_period- [in] Number of PCG iterations between host-side convergence polls. Higher values reduce host syncs but may do up tocheck_period - 1extra iterations after convergence. Default:4.
BlockSparsePCGSolver#
Concrete implementation in block_sparse_pcg_solver.h. Solves
H x = b for symmetric positive-(semi-)definite H with the
standard preconditioned conjugate gradient recurrence (Saad,
Iterative Methods for Sparse Linear Systems, §9.2) and a
block-Jacobi preconditioner formed from the dense diagonal tiles of
H. Each tile is factored independently with an in-shared-memory
LDLT. SpMV is delegated to cuSPARSE.
All scalar quantities (alpha, beta, <p, q>, <r, r>,
<r, z>) live on the device for the whole inner loop. Only the
residual norm is copied to the host, and only once every
check_period iterations.
- BlockSparsePCGSolver(
- BlockSparsePCGOptions options = BlockSparsePCGOptions()
- Parameters:
options – [in] Convergence / preconditioner-layout knobs. See
BlockSparsePCGOptions.- Returns:
Constructor has no return value.
- bool BlockSparsePCGSolver::Initialize(
- cudaStream_t stream,
- const Problem &problem,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Builds the cuSPARSE SpMV plan, allocates the PCG scratch vectors, and (when
options.block_layoutis empty) derives the block-Jacobi layout fromproblem.GetStateBatches().- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
problem – [in] Source of the block-Jacobi layout when
options.block_layoutis empty. Pass a default-constructedProblemwhen calling on a raw matrix.spd_matrix – [in] Coefficient matrix
Hin CSR format.rhs – [in] Right-hand-side vector
b.result – [out] Solution vector
x.
- Returns:
trueon success;falseon dimension mismatch or invalid layout.
- bool BlockSparsePCGSolver::Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Rebuilds the block-Jacobi factor from the current values of
Hand runs up tooptions.max_iterationsPCG iterations with the zero initial guessx_0 = 0.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
spd_matrix – [in] Coefficient matrix
H(same structure as inInitialize).rhs – [in] Right-hand-side vector
b.result – [out] Solution vector
x.
- Returns:
trueon success.
-
int BlockSparsePCGSolver::LastIterations() const#
- Returns:
Number of PCG iterations consumed by the most recent
Solvecall. Useful for convergence diagnostics.
cuDSSLinearSolverMode#
Enum in cunls/linear_solver/cudss_sparse_linear_solver.h:
SlowInitFastSolve—Initializeperforms symbolic analysis only;Solveperforms a full factorization followed by the solve phase. Best when the matrix values change frequently relative to its structure.FastInitSlowSolve—Initializeperforms symbolic analysis and an initial factorization;Solveperforms refactorization followed by the solve phase. Best when the sparsity pattern is stable and fast initialization is more important than per-solve speed.
cuDSSLinearSolverOptions#
Struct fields:
mode- [in] Trade-off between setup and repeated solve speed (seecuDSSLinearSolverMode).nthreads- [in] Host thread count for cuDSS host-side stages.threading_lib_path- [in] Optional threading runtime path for multi-threaded host processing. Empty string disables multi-threading.
cuDSSLinearSolver#
Concrete implementation in cudss_sparse_linear_solver.h.
- cuDSSLinearSolver(
- cuDSSLinearSolverOptions options = cuDSSLinearSolverOptions()
- Parameters:
options – [in] cuDSS mode/threading configuration.
- Returns:
Constructor has no return value.
- bool cuDSSLinearSolver::Initialize(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
For
SlowInitFastSolve: runs the cuDSS symbolic analysis phase only. ForFastInitSlowSolve: runs symbolic analysis followed by an initial factorization so that subsequentcuDSSLinearSolver::Solve()calls can use cheaper refactorization.Both
rhsandresultmust be pre-allocated to the number of rows inspd_matrix; the solver does not resize them. The internal cuDSS handle is lazily initialized on the first call using the provided stream.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations. Also used to lazily initialize the internal cuDSS handle on first call.
spd_matrix – [in] SPD matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
DenseLDLTSolver#
Concrete implementation in dense_linear_solver.h.
- bool DenseLDLTSolver::Initialize(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Validates dimensions and allocates reusable dense buffers used by the dense pivoted LDLT pipeline.
- Parameters:
stream – [in] CUDA stream (unused; buffers are allocated synchronously).
spd_matrix – [in] Symmetric matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
- bool DenseLDLTSolver::Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Converts CSR to dense row-major storage, computes a CUDA pivoted LDLT factorization
P^T A P = L D L^T, then solves in the permuted system. Handles symmetric matrices including indefinite ones (not limited to SPD). Kernel success/failure (singular pivot, zero diagonal) is propagated back to the host via a single async device-to-pinned-host copy after both kernels complete.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
spd_matrix – [in] Symmetric matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseon dimension mismatch, singular pivot during factorization, or zero diagonal during the solve phase.
DenseCholeskySolver#
Concrete implementation in dense_cholesky_solver.h. Uses the cuSOLVER
cusolverDnSpotrf / cusolverDnSpotrs routines (Cholesky factorization
A = L L^T). Requires the input matrix to be symmetric positive-definite.
- bool DenseCholeskySolver::Initialize(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Validates dimensions and pre-allocates internal dense buffers and the cuSOLVER workspace for the given matrix size.
- Parameters:
stream – [in] CUDA stream used to query the cuSOLVER workspace size.
spd_matrix – [in] SPD matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
- bool DenseCholeskySolver::Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Converts CSR to dense, performs Cholesky factorization via
cusolverDnSpotrf, then solves viacusolverDnSpotrs. When safety checks are enabled, inspects cuSOLVERdevInfoafter bothpotrfandpotrs: returnsfalseif the matrix is not positive-definite (devInfo > 0frompotrf) or ifpotrsreports an invalid parameter (devInfo < 0).- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
spd_matrix – [in] SPD matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseon dimension mismatch, non-positive-definite matrix, or invalid parameter frompotrs.
DenseQRSolver#
Concrete implementation in dense_qr_solver.h. Uses the cuSOLVER
cusolverDnSgeqrf / cusolverDnSormqr routines for QR factorization
(A = Q R) followed by a cuBLAS cublasStrsm upper-triangular solve.
Works for any non-singular square matrix (not limited to SPD).
- bool DenseQRSolver::Initialize(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Validates dimensions and pre-allocates internal dense buffers, Householder tau vector, and the cuSOLVER workspace for the given matrix size.
- Parameters:
stream – [in] CUDA stream used to query the cuSOLVER workspace size.
spd_matrix – [in] Matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
- bool DenseQRSolver::Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
Converts CSR to dense, performs QR factorization via
cusolverDnSgeqrf, appliesQ^Tto the RHS viacusolverDnSormqr, then solves the upper-triangular systemR x = Q^T bviacublasStrsm.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations.
spd_matrix – [in] Matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseon dimension mismatch or singular matrix (cuSOLVERdevInfo != 0).
- bool cuDSSLinearSolver::Solve(
- cudaStream_t stream,
- const CSRSparseMatrix &spd_matrix,
- const dvector<float> &rhs,
- dvector<float> &result
For
SlowInitFastSolve: performs full factorization then solve. ForFastInitSlowSolve: performs refactorization (reusing the symbolic analysis fromInitialize) then solve.Both
rhsandresultmust be pre-allocated to the number of rows inspd_matrix; the solver does not resize them and returnsfalseon any dimension mismatch.- Parameters:
stream – [in] CUDA stream for asynchronous GPU operations. Also used to lazily initialize the internal cuDSS handle if needed.
spd_matrix – [in] SPD matrix in CSR format.
rhs – [in] Right-hand-side vector
b(size must equal matrix rows).result – [out] Solution vector
x(size must equal matrix rows).
- Returns:
trueon success;falseif a dimension mismatch is detected.
Factory#
Function in sparse_linear_solver.h:
- SparseLinearSolverPtr CreateCSRSparseLinearSolver(
- SparseLinearSolverType type,
- const SparseLinearSolverConfig &config
- Parameters:
type – [in] Backend type to instantiate.
config – [in] Backend-specific configuration blob.
- Returns:
[out] Heap-allocated solver instance.