The Ginkgo team is proud to announce the new Ginkgo minor release 1.10.0.
This release brings new features such as:
- Support for bfloat16 precision. The type
gko::bfloat16
can now be selected in most instances as the value type
of a matrix, solver, preconditioner, etc. If the selected backend supports bfloat16 as a native type, the native type
is used within the kernels, otherwise they may incur a conversion overhead. The new behavior is enabled by default, but it can be
turned off during CMake configuration. - Mixed precision support in our distributed matrix, provided the underlying matrix formats support mixed precision.
- New pipelined CG solver. This specialization of the CG solver is suitable to reduce the communication overhead in
large scale distributed computations. - New Chebyshev iteration solver.
- An OpenMP implementation of the merge-path based SpMV algorithm.
And more!
If you face an issue, please first check our known issues page and the open issues list and if you do not
find a solution, feel free to open a new issue or ask a question using the github discussions.
Supported systems and requirements:
- For all platforms, CMake 3.16+
- C++17 compliant compiler
- Linux and macOS
- GCC: 7.0+
- clang: 5.0+
- Intel compiler: 2019+
- Apple Clang: 15.0 is tested. Earlier versions might also work.
- NVHPC: 22.7+
- Cray Compiler: 14.0.1+
- CUDA module: CMake 3.18+, and CUDA 11.0+ or NVHPC 22.7+, Compute Capability 5.3+
- HIP module: CMake 3.21+, and ROCm 4.5+
- DPC++ module: Intel oneAPI 2023.1+ with oneMKL and oneDPL. Set the CXX compiler to
dpcpp
oricpx
. - MPI: standard version 3.1+, ideally GPU Aware, for best performance
- Windows
- MinGW: GCC 7.0+
- Microsoft Visual Studio: VS 2019+
- CUDA module: CUDA 11.0+, Microsoft Visual Studio
- OpenMP module: MinGW.
Behavior changes
- A cmake format style has been added to uniformize formatting for CMake files. #1755
- The file config for preconditioner Ic and Ilu now only takes
value_type
, notl_solver_type
oru_solver_type
parameters #1811, #1828 - The distributed matrix now uses collective neighborhood communication if possible #1589
Deprecations
- The
experimental::EnableDistributedLinOp
mixin has been removed,EnableLinOp
can be used instead #1751.
Summary of previous deprecations
- The
Executor::run
overload without a name as the first parameter has been deprecated #1667 - The
device_reset
parameter of CUDA and HIP executors no longer has an effect, and itsallocation_mode
parameters have been deprecated in favor of theAllocator
interface. - The CMake parameter
GINKGO_BUILD_DPCPP
has been deprecated in favor ofGINKGO_BUILD_SYCL
. - The
gko::reorder::Rcm
interface has been deprecated in favor ofgko::experimental::reorder::Rcm
based onPermutation
. - The Permutation class'
permute_mask
functionality. - Multiple functions with typos (
set_complex_subpsace()
, range functions such asconj_operaton
etc). gko::lend()
is not necessary anymore.- The classes
RelativeResidualNorm
andAbsoluteResidualNorm
are deprecated in favor ofResidualNorm
. - The class
AmgxPgm
is deprecated in favor ofPgm
. - Default constructors for the CSR
load_balance
andautomatical
strategies - The PolymorphicObject's move-semantic
copy_from
variant - The templated
SolverBase
class. - The class
MachineTopology
is deprecated in favor ofmachine_topology
. - Logger constructors and create functions with the
executor
parameter. - The virtual, protected, Dense functions
compute_norm1_impl
,add_scaled_impl
, etc. - Logger events for solvers and criteria without the additional
implicit_tau_sq
parameter. - The global
gko::solver::default_krylov_dim
, use insteadgko::solver::gmres_default_krylov_dim
. array::get_num_elems()
has been renamed toget_size()
matrix_data::ensure_row_major_order()
has been renamed tosort_row_major()
device_matrix_data::get_num_elems()
has been renamed toget_num_stored_elements()
- The CMake parameter
GINKGO_COMPILER_FLAGS
has been superseded byCMAKE_CXX_FLAGS
, andGINKGO_CUDA_COMPILER_FLAGS
has been superseded byCMAKE_CUDA_FLAGS
- The
std::initializer_list
overloads of matrixcreate
methods and constructors are deprecated in favor of explicitarray
parameters
Added features
- Add a pipelined CG solver #1824, #1838, #1859
- Add Coo Transpose/Conj-Transpose #1816
- Add Chebyshev iteration solver #1289
- Add a two-level Schwarz preconditioner #1431
- Add simplified configuration for stopping criteria #1613
- Add an example to show the distributed multigrid usage #1769
- Add half precision support for MPI #1759
- Add yaml-cpp reader to parse config files in YAML format #1677
- Add local and distributed L1-Jacobi #1310, #1806
- Add reusable permutation and transpose operations #1338
- Add collective communication interface and dense/neighborhood implementation of the interface #1780
- Add local-to-global index mapping #1707
- Add Minres solver #975
- Add
array::copy_to_host
utility function #1835 - Add bfloat16 support and corresponding MPI functions #1825, #1827
- Add mixed precision support for distributed matrix when the underlying matrix also supports mixed precision #1819.
- Add distributed RowGatherer which is used by the distributed matrix to handle the communication #1589
- Add complex type support for Dense transpose and Fbcsr on AMD GPUs #1839
- Add OMP implementation for Merge-Path CSR #1810
Improvements
- Improve performance of factorization validation in benchmarks #1766
- Allow specifying a ValueType instead of a full SolverType in preconditioners Ic #1811 and Ilu #1828 and Ilu #1828. Note. It introduces the behavior changes for config usage. Please take a look at the behavior changes section.
- Avoid refilling the constant scalar in the workspace in each apply #1846
Fixes
- Fix an oneMKL GEMM issue on zero-sized matrix #1756
- Fix error with ILU/IC generation and default algorithm on OpenMP #1783, #1855
- Avoid NaN values being propagated through multiplications with zero scalars in linear combination apply and simple BLAS operations #1573
- Fix IR
move
operation #1812 - Fix CUDA 12.2 null rowptr issue when setting the cusparse CSR matrix #1843
- Fix COO unsupported exception on an empty matrix with 16bit precision #1843
- Fix METIS detection when GKLib is linked into the METIS library #1847
- Fix bfloat16 issue on CUDA before cuda 12.2 and oneAPI before oneAPI 2024.2 #1848
- Work around compiler bug related to warp ballot on H100 GPUs with CUDA 12.2 - 12.4 #1849
- Fix a race condition in LU factorization #1850
- Fix the 16bit precision NaN check in triangular solve #1860