@@ -48,7 +48,7 @@ API](https://docs.nvidia.com/cuda/cuda-driver-api/index.html) and
48
48
Python package. In this example, you copy data from the host to device. You need
49
49
[ NumPy] ( https://numpy.org/doc/stable/contents.html ) to store data on the host.
50
50
51
- ``` {code-cell} python
51
+ ``` python
52
52
from cuda.bindings import driver, nvrtc
53
53
import numpy as np
54
54
```
@@ -58,7 +58,7 @@ example is provided.
58
58
In a future release, this may automatically raise exceptions using a Python
59
59
object model.
60
60
61
- ``` {code-cell} python
61
+ ``` python
62
62
def _cudaGetErrorEnum (error ):
63
63
if isinstance (error, driver.CUresult):
64
64
err, name = driver.cuGetErrorName(error)
@@ -86,7 +86,7 @@ Python that requires some understanding of CUDA C++. For more information, see
86
86
[ An Even Easier Introduction to
87
87
CUDA] ( https://developer.nvidia.com/blog/even-easier-introduction-cuda/ ) .
88
88
89
- ``` {code-cell} python
89
+ ``` python
90
90
saxpy = """ \
91
91
extern "C" __global__
92
92
void saxpy(float a, float *x, float *y, float *out, size_t n)
@@ -108,7 +108,7 @@ In the following code example, the Driver API is initialized so that the NVIDIA
108
108
and GPU are accessible. Next, the GPU is queried for their compute capability. Finally,
109
109
the program is compiled to target our local compute capability architecture with FMAD enabled.
110
110
111
- ``` {code-cell} python
111
+ ``` python
112
112
# Initialize CUDA Driver API
113
113
checkCudaErrors(driver.cuInit(0 ))
114
114
@@ -138,7 +138,7 @@ context. CUDA contexts are analogous to host processes for the device. In the
138
138
following code example, a handle for compute device 0 is passed to
139
139
` cuCtxCreate ` to designate that GPU for context creation.
140
140
141
- ``` {code-cell} python
141
+ ``` python
142
142
# Create context
143
143
context = checkCudaErrors(driver.cuCtxCreate(0 , cuDevice))
144
144
```
@@ -148,7 +148,7 @@ module. A module is analogous to dynamically loaded libraries for the device.
148
148
After loading into the module, extract a specific kernel with
149
149
` cuModuleGetFunction ` . It is not uncommon for multiple kernels to reside in PTX.
150
150
151
- ``` {code-cell} python
151
+ ``` python
152
152
# Load PTX as module data and retrieve function
153
153
ptx = np.char.array(ptx)
154
154
# Note: Incompatible --gpu-architecture would be detected here
@@ -161,7 +161,7 @@ application performance, you can input data on the device to eliminate data
161
161
transfers. For completeness, this example shows how you would transfer data to
162
162
and from the device.
163
163
164
- ``` {code-cell} python
164
+ ``` python
165
165
NUM_THREADS = 512 # Threads per block
166
166
NUM_BLOCKS = 32768 # Blocks per grid
167
167
@@ -184,7 +184,7 @@ Python doesn’t have a natural concept of pointers, yet `cuMemcpyHtoDAsync` exp
184
184
` void* ` . Therefore, ` XX.ctypes.data ` retrieves the pointer value associated with
185
185
XX.
186
186
187
- ``` {code-cell} python
187
+ ``` python
188
188
dXclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
189
189
dYclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
190
190
dOutclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
@@ -209,7 +209,7 @@ Like `cuMemcpyHtoDAsync`, `cuLaunchKernel` expects `void**` in the argument list
209
209
the earlier code example, it creates ` void** ` by grabbing the ` void* ` value of each
210
210
individual argument and placing them into its own contiguous memory.
211
211
212
- ``` {code-cell} python
212
+ ``` python
213
213
# The following code example is not intuitive
214
214
# Subject to change in a future release
215
215
dX = np.array([int (dXclass)], dtype = np.uint64)
@@ -222,7 +222,7 @@ args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
222
222
223
223
Now the kernel can be launched:
224
224
225
- ``` {code-cell} python
225
+ ``` python
226
226
checkCudaErrors(driver.cuLaunchKernel(
227
227
kernel,
228
228
NUM_BLOCKS , # grid x dim
@@ -251,7 +251,7 @@ stream are serialized. After the call to transfer data back to the host is
251
251
executed, ` cuStreamSynchronize ` is used to halt CPU execution until all operations
252
252
in the designated stream are finished.
253
253
254
- ``` {code-cell} python
254
+ ``` python
255
255
# Assert values are same after running kernel
256
256
hZ = a * hX + hY
257
257
if not np.allclose(hOut, hZ):
@@ -261,7 +261,7 @@ if not np.allclose(hOut, hZ):
261
261
Perform verification of the data to ensure correctness and finish the code with
262
262
memory clean up.
263
263
264
- ``` {code-cell} python
264
+ ``` python
265
265
checkCudaErrors(driver.cuStreamDestroy(stream))
266
266
checkCudaErrors(driver.cuMemFree(dXclass))
267
267
checkCudaErrors(driver.cuMemFree(dYclass))
0 commit comments