Content

How To
Configuration
Time Measurement
Tasks
Results

How To

Create github account (if not exists);
Make sure SSH clone & commit is working (Connecting to GitHub with SSH);
Fork this repo (just click Fork button on the top of the page, detailed instructions here)
Clone your forked repo into your local machine, use your user instead of username:

git clone git@github.com:username/cuda-ai-2026.git
cd cuda-ai-2026

Go to your group folder, e.g.:

cd default

Go to needed task folder, e.g.:

cd 1_gelu_omp

Create new folder with your surname and name (make sure it's the same for all tasks), e.g.:

mkdir petrov_ivan

Copy your task source/header files (including main program) into this folder (use copy instead of cp on Windows), e.g.:

cd petrov_ivan
cp /home/usr/lab/*.cpp .
cp /home/usr/lab/*.h .

Push your sources to github repo, e.g.:

cd ..
git add .
git commit -m "1_gelu_omp task"
git push

Go to your repo in browser, click Contribute button on the top of page, then Open pull request. Provide meaningfull request title and description, then Create pull request (see details here).
Go to Pull Requests page in course repo, find your pull request and check if there are no any merge conflicts occur. If merge conflicts happen - resolve it following the instruction provided by github.

Time Measurement

The following scheme is used to measure task execution time:

int main() {
    // ...

    // Warming-up
    Task(input, size);

    // Performance Measuring
    std::vector<double> time_list;
    for (int i = 0; i < 4; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
        Task(input, size);
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> duration = end - start;
        time_list.push_back(duration.count());
    }
    double time = *std::min_element(time_list.begin(), time_list.end());

    // ...
}

Configuration

CPU: Intel Core i5 12600K (4 cores, 4 threads)
RAM: 16 GB
GPU: NVIDIA RTX 4060 (8 GB)
OS: Ubuntu 22.04.3 LTS
Host Compiler: GCC 11.4.0 (C++17)
CUDA: 12.9

Tasks

Task #1: OpenMP GELU Implementation

The Gaussian Error Linear Unit (GELU) is an activation function frequently used in Deep Neural Networks (DNNs) and can be thought of as a smoother ReLU.

To approximate GELU function, use the following formula:

GELU(x) = $0.5x(1 + tanh(\sqrt{2 / \pi}(x + 0.044715 * x^3)))$

Implement the function with the following interface in C++:

std::vector<float> GeluOMP(const std::vector<float>& input);

Size of result vector should be the same as for input. Use OpenMP technology to make your function parallel & fast.

Two files are expected to be uploaded:

gelu_omp.h

#ifndef __GELU_OMP_H
#define __GELU_OMP_H

#include <vector>

std::vector<float> GeluOMP(const std::vector<float>& input);

#endif // __GELU_OMP_H

gelu_omp.cpp

#include "gelu_omp.h"

std::vector<float> GeluOMP(const std::vector<float>& input) {
    // Place your implementation here
}

Performance Hints:

better formula to compute GELU, e.g. replace tanh() with exp();
loop unrolling;
loop vectorization;
vector allocation and computations in different threads (Windows only).

Task #2: CUDA GELU Implementation

Implement the function with the following interface in CUDA C++ using the formula described above:

std::vector<float> GeluCUDA(const std::vector<float>& input);

Size of result vector should be the same as for input. Use CUDA technology to make your function work on NVIDIA GPU. Try to make it fast.

Two files are expected to be uploaded:

gelu_cuda.h

#ifndef __GELU_CUDA_H
#define __GELU_CUDA_H

#include <vector>

std::vector<float> GeluCUDA(const std::vector<float>& input);

#endif // __GELU_CUDA_H

gelu_cuda.cu

#include "gelu_cuda.h"

std::vector<float> GeluCUDA(const std::vector<float>& input) {
    // Place your implementation here
}

Performance Hints:

overlap host memory allocation and CUDA computations;
allocate and free device memory once;
use better formula to compute GELU, e.g. replace tanh() with exp().

Task #3: Naive Matrix Multiplication using CUDA

General matrix multiplication (GEMM) is a very basic and broadly used linear algebra operation applied in high performance computing (HPC), statistics, deep learning and other domains. There are a lot of GEMM algorithms with different mathematical complexity form $O(n^3)$ for naive and block approaches to $O(n^{2.371552})$ for the method descibed by Williams et al. in 2024 [1]. But despite a variety of algorithms with low complexity, block matrix multiplication remains the most used implementation in practice since it fits to modern HW better.

To start learning matrix multiplication smoother, let us start with naive approach here. To compute matrix multiplication result C for matricies A and B, where C = A * B and the size for all matricies are $n*n$, one should use the following formula for each element of C (will consider only square matricies for simplicity):

$c_{ij}=\sum_{k=1}^na_{ik}b_{kj}$

In this task one should implement naive approach for matrix multiplication in CUDA trying to make it fast enough (pay attention to global memory accesses in your code).

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

naive_gemm_cuda.h:

#ifndef __NAIVE_GEMM_CUDA_H
#define __NAIVE_GEMM_CUDA_H

#include <vector>

std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n);

#endif // __NAIVE_GEMM_CUDA_H

naive_gemm_cuda.cu:

#include "naive_gemm_cuda.h"

std::vector<float> NaiveGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n) {
    // Place your implementation here
}

Performance Hints:

warp-friendly memory accesses;
multiple elements per warp processing;
loop unrolling and memory load vectorization;
block size selection;
overlap host memory allocation and CUDA computations.

Task #4: Block Matrix Multiplication using CUDA

In real applications block-based approach for matrix multiplication can get multiple times faster execution comparing with naive version due to cache friendly approach. To prove this in practice, implement such a version in C++ using OpenMP.

In block version algorithm could be divided into three stages:

Split matricies into blocks (block size normally affects performance significantly so choose it consciously);
Multiply two blocks to get partial result;
Replay step 2 for all row/column blocks accumulating values into a single result block.

From math perspective, block matrix multiplication could be described by the following formula, where $C_{IJ}$, $A_{IK}$ and $B_{KJ}$ are sub-matricies with the size $block_size*block_size$:

$C_{IJ}=\sum_{k=1}^{block_count}A_{IK}B_{KJ}$

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

In CUDA C++ block-based approach looks similar. But to get better performance one should use CUDA shared memory to store each particular block while computations. With this consideration, algorithm will be the following:

A single CUDA block should compute a single block of result matrix C, a single CUDA thread - a single matrix C element;
For each A block in a row and B block in a column:
1. Load A block into shared memory;
2. Load B block into shared memory;
3. Synchronize over all threads in block;
4. Compute BlockA * BlockB and accumulate into C block in shared memory;
5. Synchronize over all threads in block;
Dump block C from shared to global memory.

For simplicity, let's consider matrix size is always power of 2.

Two files are expected to be uploaded:

block_gemm_cuda.h:

#ifndef __BLOCK_GEMM_CUDA_H
#define __BLOCK_GEMM_CUDA_H

#include <vector>

std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n);

#endif // __BLOCK_GEMM_CUDA_H

block_gemm_cuda.cu:

#include "block_gemm_cuda.h"

std::vector<float> BlockGemmCUDA(const std::vector<float>& a,
                                 const std::vector<float>& b,
                                 int n) {
    // Place your implementation here
}

Performance Hints:

shared memory usage to store matrix block;
warp-friendly memory accesses;
multiple elements per warp processing;
loop unrolling and memory load vectorization;
block size selection;
overlap host memory allocation and CUDA computations.

Task #5: Matrix Multiplication using cuBLAS

The most performant way to multiply two matrices on particular hardware is to use vendor-provided library for this purpose. In CUDA it's cuBLAS. Try to use cuBLAS API to implement general matrix multiplication in most performant way.

Each matrix must be stored in a linear array by rows, so that a.size()==n*n. Function takes two matricies and their size as inputs, and returns result matrix also stored by rows.

For simplicity, let's consider matrix size is always power of 2.

Note, that in cuBLAS API matrix is expected to be stored by columns, so additional transpose may be required.

Two files are expected to be uploaded:

gemm_cublas.h:

#ifndef __GEMM_CUBLAS_H
#define __GEMM_CUBLAS_H

#include <vector>

std::vector<float> GemmCUBLAS(const std::vector<float>& a,
                              const std::vector<float>& b,
                              int n);

#endif // __GEMM_CUBLAS_H

gemm_cublas.cu:

#include "gemm_cublas.h"

std::vector<float> GemmCUBLAS(const std::vector<float>& a,
                              const std::vector<float>& b,
                              int n) {
    // Place your implementation here
}

Performance Hints:

overlap host memory allocation and CUDA computations;
avoid redundant device memory allocation.

Task #6: CUDA Softmax Implementation

The softmax function is a fundamental operation in machine learning, often used to convert a vector of raw scores into a probability distribution. For an input vector $x$ of length $N$, the softmax is defined element-wise as:

Softmax(x) = $e^{x_i}/(\sum_{j=1}^ne^{x_j})$ for $i=1,..,N$

When the input is a matrix, softmax is applied independently to each row.

To make the computation numerically stable in floating-point arithmetic, the following equivalent formula is used in practice:

Softmax(x) = $e^{(x_i-row_max)}/(\sum_{j=1}^ne^{(x_j-row_max)})$ for $i=1,..,N$

Here $row_max$ is $max(x_i)$ for $i=1,..,N$, normally computed independently for each row in matrix.

Implement the function with the following interface in C++ using CUDA:

std::vector<float> SoftmaxCUDA(const std::vector<float>& input, int row_size);

Note the following:

the parameter input holds the matrix elements in row‑major order (all elements of row 0, then row 1, etc.);
the number of rows is given by row_size;
the number of columns can be derived as col_size = input.size() / row_size (it is guaranteed that input.size() is divisible by row_size);
the function must compute softmax for each row independently and return a vector of the same size containing the row‑wise softmax results.

Use CUDA to parallelize the computation. The implementation should be efficient – consider using shared memory for per‑row reductions and exponentiations.

For simplicity, let's consider matrix sizes are always power of 2.

Two files are expected to be uploaded:

softmax_cuda.h:

#ifndef SOFTMAX_CUDA_H
#define SOFTMAX_CUDA_H

#include <vector>

std::vector<float> SoftmaxCUDA(const std::vector<float>& input, int row_size);

#endif // SOFTMAX_CUDA_H

softmax_cuda.cu:

#include "softmax_cuda.h"

std::vector<float> SoftmaxCUDA(const std::vector<float>& input, int row_size) {
    // Place your implementation here
}

Performance Hints:

overlap host memory allocation and CUDA computations;
use registers and/or shared memory to cache input values.

Task #7: Layer Norm Implementation in PyCUDA

Layer Normalization (LayerNorm) is a widely used technique in deep learning that normalizes activations across the feature dimension for each sample independently. For an input vector x of length N (the features of one sample), LayerNorm is defined as:

$$x'_i=(x_i-\mu)/\sqrt{\sigma^2+\epsilon}$$ $$y_i=\gamma_ix'_i+\beta_i$$

where:

$\mu=1/N*\sum_{j=1}^Nx_j$ is the mean of the features;
$\sigma^2=1/N*\sum_{j=1}^N(x_j-\mu)^2$ is the variance;
$\epsilon$ is a small constant for numerical stability (e.g. $10^-5$);
$\gamma$ and $\beta$ are learnable parameters (vectors of length N) that scale and shift the normalized output.

When the input is a matrix (batch of samples), LayerNorm is applied independently to each row.

To complete the task, one have to implement the following function in PyCUDA, the only file is expected to be upload:

layernorm_pycuda.py

import numpy as np

def layernorm_pycuda(input, gamma, beta, row_size, eps=1e-5):
    """
    Apply Layer Normalization to each row of the input matrix.

    Parameters
    ----------
    input : list or numpy.ndarray of float
        Flattened matrix in row‑major order. Its length must be divisible by row_size.
    gamma : list or numpy.ndarray of float
        Scale parameter, length = row_size.
    beta : list or numpy.ndarray of float
        Shift parameter, length = row_size.
    row_size : int
        Number of features per row (i.e., number of columns).
    eps : float, optional
        Small constant for numerical stability.

    Returns
    -------
    numpy.ndarray
        Flattened matrix of the same shape as input, containing the row‑wise
        normalized results.
    """
    # TODO: Implement using PyCUDA
    pass

For simplicity, let's consider row_size is power of 2. Target data type is float32. One may use numba or C strings to write CUDA kernels.

Results

1_gelu_omp (134217728 elements)

Group	Name	Result	Rank
default	pisarevsky_vadim	0.0806	1
default	lobanova_elizaveta	0.0838	4
FAST	FAST	0.0879	-
default	chekmaryov_petr	0.0882	3
default	zvorykin_aleksandr	0.1554	9
default	chervyakov_ivan	0.1614	8
default	smirnov_denis	0.1632	2
default	zinoviev_vladimir	0.1664	5
default	vikhrev_ivan	0.1686	12
default	znamenskiy_mikhail	0.1723	7
default	pinegina_natalia	0.2212	11
default	lukicheva_polina	0.2277	10
default	korobeynikov_aleksey	0.3856	13
default	pigasin_dmitry	0.3863	6
REF	REF	0.4536	-
default	zemskov_roman	TEST FAILED	-
default	kryukov_dmitry	TEST FAILED	-
default	kireev_daniil	TEST FAILED	-

2_gelu_cuda (134217728 elements)

Group	Name	Result	Rank
FAST	FAST	0.1186	-
default	vikhrev_ivan	0.1559	11
default	zvorykin_aleksandr	0.1598	8
default	znamenskiy_mikhail	0.1648	6
default	pisarevsky_vadim	0.1653	2
default	lobanova_elizaveta	0.1671	3
default	chervyakov_ivan	0.1709	9
default	zinoviev_vladimir	0.1751	5
default	smirnov_denis	0.1770	1
REF	REF	0.1864	-
default	pinegina_natalia	0.2180	7
default	lukicheva_polina	0.2290	4
default	zemskov_roman	0.3112	10
default	chekmaryov_petr	TEST FAILED	-

3_naive_gemm_cuda (4096 elements)

Group	Name	Result	Rank
FAST	FAST	0.0710	-
default	smirnov_denis	0.0769	1
default	zemskov_roman	0.1291	5
default	lobanova_elizaveta	0.1599	4
default	znamenskiy_mikhail	0.1614	6
default	zinoviev_vladimir	0.1660	2
default	chekmaryov_petr	0.1661	3
REF	REF	0.5748	-
default	pinegina_natalia	TEST FAILED	-

4_block_gemm_cuda (4096 elements)

Group	Name	Result	Rank
FAST	FAST	0.0695	-
default	zinoviev_vladimir	0.1322	2
default	smirnov_denis	0.1336	1
REF	REF	0.2981	-

5_gemm_cublas (4096 elements)

Group	Name	Result	Rank
FAST	FAST	0.0388	-
default	smirnov_denis	0.0438	1
REF	REF	0.0467	-

6_softmax_cuda (8192x16384 elements)

Group	Name	Result	Rank
FAST	FAST	0.1318	-
default	smirnov_denis	0.1727	1
REF	REF	0.1814	-

7_layernorm_pycuda (8192x16384 elements)

Group	Name	Result	Rank
REF	REF	0.1930	-

Tasks Done

default

Group	Name	Passed	Score
default	chekmaryov_petr	2/7	117
default	chervyakov_ivan	2/7	104
default	kireev_daniil	0/7	0
default	korobeynikov_aleksey	1/7	41
default	kryukov_dmitry	0/7	0
default	lobanova_elizaveta	3/7	177
default	lukicheva_polina	2/7	97
default	pigasin_dmitry	1/7	47
default	pinegina_natalia	2/7	95
default	pisarevsky_vadim	2/7	124
default	smirnov_denis	6/7	370
default	vikhrev_ivan	2/7	100
default	zemskov_roman	2/7	104
default	zinoviev_vladimir	4/7	230
default	znamenskiy_mikhail	3/7	163
default	zvorykin_aleksandr	2/7	109

Passed: 0

Total Passed: 0

Maximum Score: 448 (64 per task)

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
default		default
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content

How To

Time Measurement

Configuration

Tasks

Task #1: OpenMP GELU Implementation

Task #2: CUDA GELU Implementation

Task #3: Naive Matrix Multiplication using CUDA

Task #4: Block Matrix Multiplication using CUDA

Task #5: Matrix Multiplication using cuBLAS

Task #6: CUDA Softmax Implementation

Task #7: Layer Norm Implementation in PyCUDA

Results

1_gelu_omp (134217728 elements)

2_gelu_cuda (134217728 elements)

3_naive_gemm_cuda (4096 elements)

4_block_gemm_cuda (4096 elements)

5_gemm_cublas (4096 elements)

6_softmax_cuda (8192x16384 elements)

7_layernorm_pycuda (8192x16384 elements)

Tasks Done

default

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content

How To

Time Measurement

Configuration

Tasks

Task #1: OpenMP GELU Implementation

Task #2: CUDA GELU Implementation

Task #3: Naive Matrix Multiplication using CUDA

Task #4: Block Matrix Multiplication using CUDA

Task #5: Matrix Multiplication using cuBLAS

Task #6: CUDA Softmax Implementation

Task #7: Layer Norm Implementation in PyCUDA

Results

1_gelu_omp (134217728 elements)

2_gelu_cuda (134217728 elements)

3_naive_gemm_cuda (4096 elements)

4_block_gemm_cuda (4096 elements)

5_gemm_cublas (4096 elements)

6_softmax_cuda (8192x16384 elements)

7_layernorm_pycuda (8192x16384 elements)

Tasks Done

default

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages