Binary package “libcutlass-dev” in ubuntu oracular

CUDA Templates for Linear Algebra Subroutines

 CUTLASS is a collection of CUDA C++ template abstractions for implementing
 high-performance matrix-matrix multiplication (GEMM) and related computations
 at all levels and scales within CUDA. It incorporates strategies for
 hierarchical decomposition and data movement similar to those used to implement
 cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable,
 modular software components abstracted by C++ template classes. Primitives for
 different levels of a conceptual parallelization hierarchy can be specialized
 and tuned via custom tiling sizes, data types, and other algorithmic policy.
 The resulting flexibility simplifies their use as building blocks within custom
 kernels and applications.
 .
 To support a wide variety of applications, CUTLASS provides extensive support
 for mixed-precision computations, providing specialized data-movement and
 multiply-accumulate abstractions for half-precision floating point (FP16),
 BFloat16 (BF16), Tensor Float 32 (TF32), single-precision floating point
 (FP32), FP32 emulation via tensor core instruction, double-precision
 floating point (FP64) types, integer data types (4b and 8b), and binary
 data types (1b). CUTLASS demonstrates warp-synchronous matrix multiply
 operations targeting the programmable, high-throughput Tensor Cores
 implemented by NVIDIA's Volta, Turing, Ampere, and Hopper architectures.
 .
 This is a header-only library.