Ubuntu
nvidia-cutlass package

nvidia-cutlass 3.1.0+ds-2 source package in Ubuntu

nvidia-cutlass (3.1.0+ds-2) unstable; urgency=medium

  * Upload to unstable.

 -- Mo Zhou <email address hidden>  Wed, 28 Feb 2024 12:07:52 -0500

Series	Pocket	Published	Component	Section
Oracular	release	on 2024-04-29	multiverse	misc
Noble	release	on 2024-02-29	multiverse	misc

File	Size	SHA-256 Checksum
nvidia-cutlass_3.1.0+ds-2.dsc	2.0 KiB	4b0c7a34b417bc3ca6bf84f3297b4b8033a1a5fd7f7e8cf78af7f628972b3a40
nvidia-cutlass_3.1.0+ds.orig.tar.xz	11.6 MiB	21771ad5a3ee51083e5ce38f079e9efbf9d5cb34d297756d3cbfb0673b93da59
nvidia-cutlass_3.1.0+ds-2.debian.tar.xz	3.1 KiB	e5222b546da1f34191ce9bdf2d12490e9279d4263443616cbd72feb3aceb9da1

No changes file available.

libcutlass-dev: CUDA Templates for Linear Algebra Subroutines: CUTLASS is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-matrix multiplication (GEMM) and related computations
at all levels and scales within CUDA. It incorporates strategies for
hierarchical decomposition and data movement similar to those used to implement
cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable,
modular software components abstracted by C++ template classes. Primitives for
different levels of a conceptual parallelization hierarchy can be specialized
and tuned via custom tiling sizes, data types, and other algorithmic policy.
The resulting flexibility simplifies their use as building blocks within custom
kernels and applications.
.
To support a wide variety of applications, CUTLASS provides extensive support
for mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for half-precision floating point (FP16),
BFloat16 (BF16), Tensor Float 32 (TF32), single-precision floating point
(FP32), FP32 emulation via tensor core instruction, double-precision
floating point (FP64) types, integer data types (4b and 8b), and binary
data types (1b). CUTLASS demonstrates warp-synchronous matrix multiply
operations targeting the programmable, high-throughput Tensor Cores
implemented by NVIDIA's Volta, Turing, Ampere, and Hopper architectures.
.
This is a header-only library.