Numba allows the development of GPU code in Python style. When a Python
script using Numba is executed, the code is compiled just-in-time (JIT)
using the LLVM framework. Using Python for GPU programming can mean a
considerable simplification in the development of parallel applications
compared to C and C-CUDA.
Python, however, has to live with the prejudice of low performance,
especially in HighPerformance Computing.
We wanted to get to the bottom of whether this is really true and
where these differences come from. For this reason, we first analyzed
the performance of typical micro benchmarks used in HPC. By analyzing
the assembly codes, we learned a lot about the difference between
codes produced by C-CUDA and NUMBA- CUDA. Some of these insights have
helped us to improve the performance of our application - and also of
Numba-CUDA. With a few tricks it is possible to achieve very good
performance with our Numba-Codes, which are very close - or sometimes
even better than the C-CUDA versions.
We compared the performance of GPU-Applications written in C-CUDA and
Numba- CUDA. By analyzing the GPU assembly code, we learned about the
reasons for the differences. This helped us to optimize our codes
written in NUMBA-CUDA and NUMBA itself.