#nestedparallelism
Explore tagged Tumblr posts
govindhtech · 23 days ago
Text
Improving Python Threading Strategies For AI/ML Workloads
Tumblr media
Python Threading Dilemma Solution Python excels at AI and machine learning. CPython, the computer language's reference implementation and byte-code interpreter, needs intrinsic support for parallel processing and multithreading. The notorious Global Interpreter Lock (GIL) “locks” the CPython interpreter into running on one thread at a time, regardless of the context. NumPy, SciPy, and PyTorch provide multi-core processing using C-based implementations.
Python should be approached differently. Imagine GIL as a thread and vanilla Python as a needle. That needle and thread make a clothing. Although high-grade, it might have been made cheaper without losing quality. Therefore, what if Intel could circumvent that “limiter” by parallelising Python programs with Numba or oneAPI libraries? What if a sewing machine replaces a needle and thread to construct that garment? What if dozens or hundreds of sewing machines manufacture several shirts extremely quickly?
Intel Distribution of Python uses robust modules and tools to optimise Intel architecture instruction sets.
Using oneAPI libraries to reduce Python overheads and accelerate math operations, the Intel distribution gives compute-intensive Python numerical and scientific apps like NumPy, SciPy, and Numba C++-like performance. This helps developers provide their applications with excellent multithreading, vectorisation, and memory management while enabling rapid cluster expansion.One Let's look at Intel's Python parallelism and composability technique and how it helps speed up AI/ML processes. Numpy/SciPy Nested Parallelism The Python libraries NumPy and SciPy were designed for scientific computing and numerical computation.
Exposing parallelism on all software levels, such as by parallelising outermost loops or using functional or pipeline parallelism at the application level, can enable multithreading and parallelism in Python applications. This parallelism can be achieved with Dask, Joblib, and the multiprocessing module mproc (with its ThreadPool class).
An optimised math library like the Intel oneAPI Math Kernel Library helps accelerate Python modules like NumPy and SciPy for data parallelism. The high processing needs of huge data for AI and machine learning require this. Multi-threat oneMKL using Python Threading runtimes. Environment variable MKL_THREADING_LAYER adjusts the threading layer. Nested parallelism occurs when one parallel part calls a function that contains another parallel portion. Sync latencies and serial parts—parts that cannot operate in parallel—are common in NumPy and SciPy programs. Parallelism-within-parallelism reduces or hides these areas.
Numba
Even though they offer extensive mathematical and data-focused accelerations, NumPy and SciPy are defined mathematical instruments accelerated with C-extensions. If a developer wants it as fast as C-extensions, they may need unorthodox math. Numba works well here. Just-In-Time compilers Numba and LLVM. Reduce the performance gap between Python and statically typed languages like C and C++. We support Workqueue, OpenMP, and Intel oneAPI Python Threading Building Blocks. The three built-in Python Threading layers represent these runtimes. New threading layers are added using conda commands (e.g., $ conda install tbb). Only workqueue is automatically installed. Numba_threading_layer sets the threading layer. Remember that there are two ways to select this threading layer: (1) picking a layer that is normally safe under diverse parallel processing, or (2) explicitly specifying the suitable threading layer name (e.g., tbb). For Numba threading layer information, see the official documentation.
Threading Composability
The Python Threading composability of an application or component determines the efficiency of co-existing multi-threaded components. A “perfectly composable” component operates efficiently without affecting other system components. To achieve a fully composable Python Threading system, over-subscription must be prevented by ensuring that no parallel piece of code or component requires a specific number of threads (known as “mandatory” parallelism). The alternative is to provide "optional" parallelism in which a work scheduler chooses which user-level threads components are mapped to and automates task coordination across components and parallel areas. The scheduler uses a single thread-pool to arrange the program's components and libraries, hence its threading model must be more efficient than the built-in high-performance library technique. Efficiency is lost otherwise.
Intel's Parallelism and Composability Strategy
Python Threading composability is easier with oneTBB as the work scheduler. The open-source, cross-platform C++ library oneTBB, which supports threading composability, optional parallelism, and layered parallelism, enables multi-core parallel processing. The oneTBB version available at the time of writing includes an experimental module that provides threading composability across libraries, enabling multi-threaded performance enhancements in Python. Acceleration comes from the scheduler's improved Python Threading allocation. OneTBB replaces Python ThreadPool with Pool. By dynamically replacing or updating objects at runtime, monkey patching keeps the thread pool active across modules without code changes. OneTBB also substitutes oneMKL by activating its Python Threading layer, which automates composable parallelism using NumPy and SciPy calls.
Nested parallelism can improve performance, as seen in the following composability example on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and IPython kernels. IPython's command shell interface allows interactive computing in multiple programming languages. The demo was ran in Jupyter Notebook to compare performance quantitatively.
If the kernel is changed in the Jupyter menu, the preceding cell must be run again to construct the ThreadPool and deliver the runtime results below.
With the default Python kernel, the following code runs for all three trials:
This method can find matrix eigenvalues with the default Python kernel. Activating the Python-m SMP kernel improves runtime by an order of magnitude. The Python-m TBB kernel boosts even more.
For this composability example, OneTBB's dynamic task scheduler performs best because it manages code where the innermost parallel sections cannot completely leverage the system's CPU and where work may vary. SMP is still useful, however it works best when workloads are evenly divided and outermost workers have similar loads.
Conclusion
In conclusion, multithreading speeds AI/ML operations. Python AI and machine learning apps can be optimised in several ways. Multithreading and multiprocessing will be crucial to pushing AI/ML software development workflows. See Intel's AI Tools and Framework optimisations and the unified, open, standards-based oneAPI programming architecture that underpins its AI Software Portfolio.
0 notes