#SYCLkernel
Explore tagged Tumblr posts
govindhtech · 7 months ago
Text
SYCL 2020’s Five New Features For Modern C++ Programmers
Tumblr media
SYCL
For accelerator-using C++ programmers, SYCL 2020 is interesting. People enjoyed contributing to the SYCL standard, a book, and the DPC++ open source effort to integrate SYCL into LLVM. The SYCL 2020 standard included some of the favorite new features. These are Intel engineers’ views, not Khronos’.
Khronos allows heterogeneous C++ programming with SYCL. After SYCL 2020 was finalized in late 2020, compiler support increased.
SYCL is argued in several places, including Considering a Heterogeneous Future for C++ and other materials on sycl.tech. How will can allow heterogeneous C++ programming with portability across vendors and architectures? SYCL answers that question.
SYCL 2020 offers interesting new capabilities to be firmly multivendor and multiarchitecture with to community involvement.
The Best Five
A fundamental purpose of SYCL 2020 is to harmonize with ISO C++, which offers two advantages. First, it makes SYCL natural for C++ programmers. Second, it lets SYCL test multivendor, multiarchitecture heterogeneous programming solutions that may influence other C++ libraries and ISO C++.
Changing the base language from C++11 to C++17 allows developers to use class template argument deduction (CTAD) and deduction guides, which necessitated several syntactic changes in SYCL 2020.
Backends allow SYCL to target more hardware by supporting languages/frameworks other than OpenCL.
USM is a pointer-based alternative to SYCL 1.2.1’s buffer/accessor concept.
A “built-in” library in SYCL 2020 accelerates reductions, a frequent programming style.
The group library abstracts cooperative work items, improving application speed and programmer efficiency by aligning with hardware capabilities (independent of vendor).
Atomic references aligned with C++20 std::atomic_ref expand heterogeneous device memory models.
These enhancements make the SYCL ecosystem open, multivendor, and multiarchitecture, allowing C++ writers to fully leverage heterogeneous computing today and in the future.
Backends
With backends, SYCL 2020 allows implementations in languages/frameworks other than OpenCL. Thus, the namespace has been reduced to sycl::, and the SYCL header file has been relocated from to .
These modifications affect SYCL deeply. Although implementations are still free to build atop OpenCL (and many do), generic backends have made SYCL a programming approach that can target more diverse APIs and hardware. SYCL can now “glue” C++ programs to vendor-specific libraries, enabling developers to target several platforms without changing their code.
SYCL 2020 has true openness, cross-architecture, and cross-vendor.
This flexibility allows the open-source DPC++ compiler effort to support NVIDIA, AMD, and Intel GPUs by implementing SYCL 2020 in LLVM (clang). SYCL 2020 has true openness, cross-architecture, and cross-vendor.
Unified shared memory
Some devices provide CPU-host memory unified views. This unified shared memory (USM) from SYCL 2020 allows a pointer-based access paradigm instead of the buffer/accessor model from SYCL 1.2.1.
Programming with USM provides two benefits. First, USM provides a single address space across host and device; pointers to USM allocations are consistent and may be provided to kernels as arguments. Porting pointer-based C++ and CUDA programs to SYCL is substantially simplified. Second, USM allows shared allocations to migrate seamlessly between devices, enhancing programmer efficiency and compatibility with C++ containers (e.g., std::vector) and algorithms.
Three USM allocations provide programmers as much or as little data movement control as they want. Device allocations allow programmers full control over application data migration. Host allocations are beneficial when data is seldom utilized and transporting it is not worth the expense or when data exceeds device capacity. Shared allocations are a good compromise that immediately migrate to use, improving performance and efficiency.
Reductions
Other C++ reduction solutions, such as P0075 and the Kokkos and RAJA libraries, influenced SYCL 2020.
The reducer class and reduction function simplify SYCL kernel variable expression using reduction semantics. It also lets implementations use compile-time reduction method specialization for good performance on various manufacturers’ devices.
The famous BabelStream benchmark, published by the University of Bristol, shows how SYCL 2020 reductions increase performance. BabelStream’s basic dot product kernel computes a floating-point total of all kernel work items. The 43-line SYCL 1.2.1 version employs a tree reduction in work-group local memory and asks the user to choose the optimal device work-group size. SYCL 2020 is shorter (20 lines) and more performance portable by leaving algorithm and work-group size to implementation.
Group Library
The work-group abstraction from SYCL 1.2.1 is expanded by a sub-group abstraction and a library of group-based algorithms in SYCL 2020.
Sub_group describes a kernel’s cooperative work pieces running “together,” providing a portable abstraction for various hardware providers. Sub-groups in the DPC++ compiler always map to a key hardware concept SIMD vectorization on Intel architectures, “warps” on NVIDIA architectures, and “wavefronts” on AMD architectures enabling low-level performance optimization for SYCL applications.
In another tight agreement with ISO C++, SYCL 2020 includes group-based algorithms based on C++17: all_of, any_of, none_of, reduce, exclusive_scan, and inclusive_scan. SYCL implementations may use work-group and/or sub-group parallelism to produce finely tailored, cooperative versions of these functions since each algorithm is supported at various scopes.
Atomic references
Atomics improved in C++20 with the ability to encapsulate types in atomic references (std::atomic_ref). This design (sycl::atomic_ref) is extended to enable address spaces and memory scopes in SYCL 2020, creating an atomic reference implementation ready for heterogeneous computing.
SYCL follows ISO C++, and memory scopes were necessary for portable programming without losing speed. Don’t disregard heterogeneous systems’ complicated memory topologies.
Memory models and atomics are complicated, hence SYCL does not need all devices to use the entire C++ memory model to support as many devices as feasible. SYCL offers a wide range of device capabilities, another example of being accessible to all vendors.
Beyond SYCL 2020: Vendor Extensions
SYCL 2020’s capability for multiple backends and hardware has spurred vendor extensions. These extensions allow innovation that provides practical solutions for devices that require it and guides future SYCL standards. Extensions are crucial to standardization, and the DPC++ compiler project’s extensions inspired various elements in this article.
Two new DPC++ compiler features are SYCL 2020 vendor extensions.
Group-local Memory at Kernel Scope
Local accessors in SYCL 1.2.1 allow for group-local memory, which must be specified outside of the kernel and sent as a kernel parameter. This might seem weird for programmers from OpenCL or CUDA, thus has created an extension to specify group-local memory in a kernel function. This improvement makes kernels more self-contained and informs compiler optimizations (where local memory is known at compile-time).
FPGA-Specific Extensions
The DPC++ compiler project supports Intel FPGAs. It seems that the modifications, or something similar, can work with any FPGA suppliers. FPGAs are a significant accelerator sector, and nous believe it pioneering work will shape future SYCL standards along with other vendor extension initiatives.
Have introduced FPGA choices to make buying FPGA hardware or emulation devices easier. The latter allows quick prototyping, which FPGA software writers must consider. FPGA LSU controls allow us to tune load/store operations and request a specific global memory access configuration. Also implemented data placement controls for external memory banks (e.g., DDR channel) to tune FPGA designs via FPGA memory channel. FPGA registers allow major tuning controls for FPGA high-performance pipelining.
Summary
Heterogeneity endures. Many new hardware alternatives focus on performance and performance-per-watt. This trend will need open, multivendor, multiarchitecture programming paradigms like SYCL.
The five new SYCL 2020 features assist provide portability and performance mobility. C++ programmers may maximize heterogeneous computing with SYCL 2020.
Read more on Govindhtech.com
0 notes
govindhtech · 7 months ago
Text
Intel’s oneAPI 2024 Kernel_Compiler Feature Improves LLVM
Tumblr media
Kernel_Compiler
The kernel_compiler, which was first released as an experimental feature in the fully SYCL2020 compliant Intel oneAPI DPC++/C++ compiler 2024.1 is one of the new features. Here’s another illustration of how Intel advances the development of LLVM and SYCL standards. With the help of this extension, OpenCL C strings can be compiled at runtime into kernels that can be used on a device.
For offloading target hardware-specific SYCL kernels, it is provided in addition to the more popular modes of Ahead-of-Time (AOT), SYCL runtime, and directed runtime compilation.
Generally speaking, the kernel_compiler extension ought to be saved for last!
Nonetheless, there might be some very intriguing justifications for leveraging this new extension to create SYCL Kernels from OpenCL C or SPIR-V code stubs.
Let’s take a brief overview of the many late- and early-compile choices that SYCL offers before getting into the specifics and explaining why there are typically though not always better techniques.
Three Different Types of Compilation 
The ability to offload computational work to kernels running on another compute device that may be installed on the machine, such as a GPU or an FPGA, is what SYCL offers your application. Are there thousands of numbers you need to figure out? Forward it to the GPU!
Power and performance are made possible by this, but it also raises more questions:
Which device are you planning to target? In the future, will that change?
Could it be more efficient if it were customized to parameters that only the running program would know, or do you know the complete domain parameter value for that kernel execution? SYCL offers a number of choices to answer those queries:
Ahead-of-Time (AoT) Compile: This process involves compiling your kernels to machine code concurrently with the compilation of your application.
SYCL Runtime Compilation: This method compiles the kernel while your application is executing and it is being used.
With directed runtime compilation, you can set up your application to generate a kernel whenever you’d want.
Let’s examine each one of these:
1. Ahead of Time (AoT) Compile
You can also precompile the kernels at the same time as you compile your application. All you have to do is specify which devices you would like the kernels to be compiled for. All you need to do is pass them to the compiler with the -fsycl-targets flag. Completed! Now that the kernels have been compiled, your application will use those binaries.
AoT compilation has the advantage of being easy to grasp and familiar to C++ programmers. Furthermore, it is the only choice for certain devices such as FPGAs and some GPUs.
An additional benefit is that your kernel can be loaded, given to the device, and executed without the runtime stopping to compile it or halt it.
Although they are not covered in this blog post, there are many more choices available to you for controlling AoT compilation. For additional information, see this section on compiler and runtime design or the -fsycl-targets article in Intel’s GitHub LLVM User Manual.
SPIR-V
2. SYCL Runtime Compilation (via SPIR-V) 
If no target devices are supplied or perhaps if an application with precompiled kernels is executed on a machine with target devices that differ from what was requested, this is SYCL default mode.
SYCL automatically compiles your kernel C++ code to SPIR-V (Standard Portable Intermediate form), an intermediate form. When the SPIR-V kernel is initially required, it is first saved within your program and then sent to the driver of the target device that is encountered. The SPIR-V kernel is then converted to machine code for that device by the device driver.
The default runtime compilation has the following two main benefits:
First of all, you don’t have to worry about the precise target device that your kernel will operate on beforehand. It will run as long as there is one.
Second, if a GPU driver has been updated to improve performance, your application will benefit from it when your kernel runs on that GPU using the new driver, saving you the trouble of recompiling it.
However, keep in mind that there can be a minor cost in contrast to AoT because your application will need to compile from SPIR-V to machine code when it first delivers the kernel to the device. However, this usually takes place outside of the key performance route, before parallel_for loops the kernel.
In actuality, this compilation time is minimal, and runtime compilation offers more flexibility than the alternative. SYCL may also cache compiled kernels in between app calls, which further eliminates any expenses. See kernel programming cache and environment variables for additional information on caching.
However, if you prefer the flexibility of runtime compilation but dislike the default SYCL behavior, continue reading!
3. Directed Runtime Compilation (via kernel_bundles) 
You may access and manage the kernels that are bundled with your application using the kernel_bundle class in SYCL, which is a programmatic interface.
Here, the kernel_bundle techniques are noteworthy.build(), compile(), and link(). Without having to wait until the kernel is required, these let you, the app author, decide precisely when and how a kernel might be constructed.
Additional details regarding kernel_bundles are provided in the SYCL 2020 specification and in a controlling compilation example.
Specialization Constants 
Assume for the moment that you are creating a kernel that manipulates an input image’s numerous pixels. Your kernel must use a replacement to replace the pixels that match a specific key color. You are aware that if the key color and replacement color were constants instead of parameter variables, the kernel might operate more quickly. However, there is no way to know what those color values might be when you are creating your program. Perhaps they rely on calculations or user input.
Specialization constants are relevant in this situation.
The name refers to the constants in your kernel that you will specialize in at runtime prior to the kernel being compiled at runtime. Your application can set the key and replacement colors using specialization constants, which the device driver subsequently compiles as constants into the kernel’s code. There are significant performance benefits for kernels that can take advantage of this.
The Last Resort – the kernel_compiler 
All of the choices that as a discussed thus far work well together. However, you can choose from a very wide range of settings, including directed compilation, caching, specialization constants, AoT compilation, and the usual SYCL compile-at-runtime behavior.
Using specialization constants to make your program performant or having it choose a specific kernel at runtime are simple processes. However, that might not be sufficient. Perhaps all your software needs to do is create a kernel from scratch.
Here is some source code to help illustrate this. Intel made an effort to compose it in a way that makes sense from top to bottom.
When is It Beneficial to Use kernel_compiler? 
Some SYCL users already have extensive kernel libraries in SPIR-V or OpenCL C. For those, the kernel_compiler is a very helpful extension that enables them to use those libraries rather than a last-resort tool.
Download the Compiler 
Download the most recent version of the Intel oneAPI DPC++/C++ Compiler, which incorporates the kernel_compiler experimental functionality, if you haven’t already. Purchase it separately for Windows or Linux, via well-known package managers only for Linux, or as a component of the Intel oneAPI Base Toolkit 2024.
Read more on Govindhtech.com
1 note · View note