#InteloneAPI
Explore tagged Tumblr posts
Text
Intel’s oneAPI 2024 Kernel_Compiler Feature Improves LLVM

Kernel_Compiler
The kernel_compiler, which was first released as an experimental feature in the fully SYCL2020 compliant Intel oneAPI DPC++/C++ compiler 2024.1 is one of the new features. Here’s another illustration of how Intel advances the development of LLVM and SYCL standards. With the help of this extension, OpenCL C strings can be compiled at runtime into kernels that can be used on a device.
For offloading target hardware-specific SYCL kernels, it is provided in addition to the more popular modes of Ahead-of-Time (AOT), SYCL runtime, and directed runtime compilation.
Generally speaking, the kernel_compiler extension ought to be saved for last!
Nonetheless, there might be some very intriguing justifications for leveraging this new extension to create SYCL Kernels from OpenCL C or SPIR-V code stubs.
Let’s take a brief overview of the many late- and early-compile choices that SYCL offers before getting into the specifics and explaining why there are typically though not always better techniques.
Three Different Types of Compilation
The ability to offload computational work to kernels running on another compute device that may be installed on the machine, such as a GPU or an FPGA, is what SYCL offers your application. Are there thousands of numbers you need to figure out? Forward it to the GPU!
Power and performance are made possible by this, but it also raises more questions:
Which device are you planning to target? In the future, will that change?
Could it be more efficient if it were customized to parameters that only the running program would know, or do you know the complete domain parameter value for that kernel execution? SYCL offers a number of choices to answer those queries:
Ahead-of-Time (AoT) Compile: This process involves compiling your kernels to machine code concurrently with the compilation of your application.
SYCL Runtime Compilation: This method compiles the kernel while your application is executing and it is being used.
With directed runtime compilation, you can set up your application to generate a kernel whenever you’d want.
Let’s examine each one of these:
1. Ahead of Time (AoT) Compile
You can also precompile the kernels at the same time as you compile your application. All you have to do is specify which devices you would like the kernels to be compiled for. All you need to do is pass them to the compiler with the -fsycl-targets flag. Completed! Now that the kernels have been compiled, your application will use those binaries.
AoT compilation has the advantage of being easy to grasp and familiar to C++ programmers. Furthermore, it is the only choice for certain devices such as FPGAs and some GPUs.
An additional benefit is that your kernel can be loaded, given to the device, and executed without the runtime stopping to compile it or halt it.
Although they are not covered in this blog post, there are many more choices available to you for controlling AoT compilation. For additional information, see this section on compiler and runtime design or the -fsycl-targets article in Intel’s GitHub LLVM User Manual.
SPIR-V
2. SYCL Runtime Compilation (via SPIR-V)
If no target devices are supplied or perhaps if an application with precompiled kernels is executed on a machine with target devices that differ from what was requested, this is SYCL default mode.
SYCL automatically compiles your kernel C++ code to SPIR-V (Standard Portable Intermediate form), an intermediate form. When the SPIR-V kernel is initially required, it is first saved within your program and then sent to the driver of the target device that is encountered. The SPIR-V kernel is then converted to machine code for that device by the device driver.
The default runtime compilation has the following two main benefits:
First of all, you don’t have to worry about the precise target device that your kernel will operate on beforehand. It will run as long as there is one.
Second, if a GPU driver has been updated to improve performance, your application will benefit from it when your kernel runs on that GPU using the new driver, saving you the trouble of recompiling it.
However, keep in mind that there can be a minor cost in contrast to AoT because your application will need to compile from SPIR-V to machine code when it first delivers the kernel to the device. However, this usually takes place outside of the key performance route, before parallel_for loops the kernel.
In actuality, this compilation time is minimal, and runtime compilation offers more flexibility than the alternative. SYCL may also cache compiled kernels in between app calls, which further eliminates any expenses. See kernel programming cache and environment variables for additional information on caching.
However, if you prefer the flexibility of runtime compilation but dislike the default SYCL behavior, continue reading!
3. Directed Runtime Compilation (via kernel_bundles)
You may access and manage the kernels that are bundled with your application using the kernel_bundle class in SYCL, which is a programmatic interface.
Here, the kernel_bundle techniques are noteworthy.build(), compile(), and link(). Without having to wait until the kernel is required, these let you, the app author, decide precisely when and how a kernel might be constructed.
Additional details regarding kernel_bundles are provided in the SYCL 2020 specification and in a controlling compilation example.
Specialization Constants
Assume for the moment that you are creating a kernel that manipulates an input image’s numerous pixels. Your kernel must use a replacement to replace the pixels that match a specific key color. You are aware that if the key color and replacement color were constants instead of parameter variables, the kernel might operate more quickly. However, there is no way to know what those color values might be when you are creating your program. Perhaps they rely on calculations or user input.
Specialization constants are relevant in this situation.
The name refers to the constants in your kernel that you will specialize in at runtime prior to the kernel being compiled at runtime. Your application can set the key and replacement colors using specialization constants, which the device driver subsequently compiles as constants into the kernel’s code. There are significant performance benefits for kernels that can take advantage of this.
The Last Resort – the kernel_compiler
All of the choices that as a discussed thus far work well together. However, you can choose from a very wide range of settings, including directed compilation, caching, specialization constants, AoT compilation, and the usual SYCL compile-at-runtime behavior.
Using specialization constants to make your program performant or having it choose a specific kernel at runtime are simple processes. However, that might not be sufficient. Perhaps all your software needs to do is create a kernel from scratch.
Here is some source code to help illustrate this. Intel made an effort to compose it in a way that makes sense from top to bottom.
When is It Beneficial to Use kernel_compiler?
Some SYCL users already have extensive kernel libraries in SPIR-V or OpenCL C. For those, the kernel_compiler is a very helpful extension that enables them to use those libraries rather than a last-resort tool.
Download the Compiler
Download the most recent version of the Intel oneAPI DPC++/C++ Compiler, which incorporates the kernel_compiler experimental functionality, if you haven’t already. Purchase it separately for Windows or Linux, via well-known package managers only for Linux, or as a component of the Intel oneAPI Base Toolkit 2024.
Read more on Govindhtech.com
#oneAPI#Kernel_Compiler#LLVM#InteloneAPI#SYCL2020#SYCLkernels#FPGA#SYC#SPIR-Vkernel#OpenCL#News#Technews#Technology#Technologynews#Technologytrends#govindhtech
1 note
·
View note
Text
A poderosa nuvem: Intel® DevCloud com GPU Iris Xe Max
Neste artigo veremos como utilizar o poder computacional das últimas gerações de hardware Intel na nuvem DevCloud gratuitamente mas por tempo limitado. Além de processadores de última geração, poderemos testar a nova e primeira GPU da Intel Dedicada Iris Xe MAX.
Introdução Devcloud:
DevCloud é uma nuvem de computação distribuída baseado projeto PBS desenvolvido pela NASA 1991, a principal função era gerenciar trabalhos em lote muito similar ao agendador de tarefas e gerenciamento de nós e recursos. Esta nuvem da Intel proporciona diversas arquiteturas de hardware de última geração, como CPUs, FPGA GPU.
Sendo assim é possível utilizar a tecnologia Iris Xe Max. Este modelo é a primeira GPU dedicada da Intel. O seu desempenho diferenciado é devido ao suporte PCI Express 4.0 e total integração com a tecnologia Intel Deep Link. Onde sua principal função é praticamente combinar recursos da CPU e da GPU para otimizar o desempenho total do equipamento. Mas Ressalto que esta tecnologia Iris Xe Max trabalha em conjunto com o processador Intel Core de 11ª geração.

Antes de começarmos, devemos efetuar o cadastro na Nuvem da Intel. O cadastro é gratuito, porém limitado. Para estender o limite, seu projeto de desenvolvimento deve ser submetido para aumentar o período de testes. Sendo assim, clique no link https://software.intel.com/content/www/us/en/develop/tools/devcloud.html , selecione a opção Intel® DevCloud for oneAPI e efetue o seu cadastro preenchendo os dados solicitados.
Mais informações sobre a configuração e acesso no Linux via ssh, podemos obter nesta URL: https://devcloud.intel.com/oneapi/documentation/connect-with-ssh-linux-macos/
Conceitos
A nuvem da Intel trabalha com processamento distribuído, e então para entendermos o funcionamento, primeiramente criaremos o script com o nome hello-world-example com o conteúdo abaixo:
$ tee>hello-world-example <<EOF cd $PBS_O_WORKDIR echo “* Hello world from compute server `hostname`!” echo “* The current directory is ${PWD}.” echo “* Compute server’s CPU model and number of logical CPUs:” lscpu | grep ‘Model name\\|^CPU(s)’ echo “* Python available to us:” which python python –version echo “* The job can create files, and they will be visible back in the Notebook.” > newfile.txt sleep 10 echo “*Bye” EOF
Agora como script criado, o submeteremos o trabalho no qual devemos utilizar o comando qsub. O parâmetro -l é utilizado para utilizar o hardware solicitado onde nodes = QTDE DE NÓS, gpu = PROCESSADOR GRÁFICO e ppn = a quantidade de processadores. Já o parâmetro -d . indica o path de trabalho (localização atual) e por último o nome do scritpt. Vejam o exemplo a seguir:
$ qsub -l nodes=1:gpu:ppn=2 -d . hello-world-example 911788.v-qsvr-1.aidevcloud
Se tudo funcionou corretamente, após alguns segundos, veremos armazenados em disco 2 arquivos de saída que representam o resultado do processamento, hello-world-example.eXXXXXX e hello-world-example.oXXXXXX. Um arquivo .eXXXXXX contém os erros do script (se existir), já o .e911788 contem a saída padrão do script submetido posteriormente. Abaixo o exemplo do seu conteúdo:
* Hello world from compute server s001-n140! * The current directory is /home/u45169. * Compute server’s CPU model and number of logical CPUs: CPU(s): 12 Model name: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz * Python available to us: /opt/intel/inteloneapi/intelpython/latest/bin/python Python 3.7.9 :: Intel Corporation *Bye
A seguir um resumo da sintaxe anterior, e também a adição de alguns itens para um melhor aproveitamento dos arquivos para processamento distribuído utilizando o formato PBS (Portable Batch System). Tomaremos como base o script anterior.
$ tee>hello-world-example-2 <<EOF #!/bin/bash #Nome do trabalho: #PBS -N My-Job-InDevCloud #Tempo de execução 1 hora: #PBS -l walltime=1:00:00 #Nome do arquivo de erro: #PBS -e My-Job-with-Error.err #Solicita 1 nó e 2 processadores: #PBS -l nodes=1:ppn=2 #Envio de Email #PBS -M [email protected] cd $PBS_O_WORKDIR echo “* Hello world from compute server hostname!” echo “* The current directory is ${PWD}.” echo “* Compute server’s CPU model and number of logical CPUs:” lscpu | grep ‘Model name\|^CPU(s)’ echo “* Python available to us:” which python python –version echo “* The job can create files, and they will be visible back in the Notebook.” > newfile.txt sleep 10 echo “*Bye” EOF
Após estas alterações podemos submeter novamento o JOB para execução novamente:
$ qsub -l nodes=1:gpu:ppn=2 -d . hello-world-example-2
Mas também podemos executar todos os parâmetros na diretiva da linha de comando.
$ qsub -l nodes=1:gpu:ppn=2 -l walltime=1:00:00 -M [email protected] -d . hello-world-example-2
Com o comando abaixo podemos verificar todos os computer nodes disponíveis na Nuvem da Intel:
$ pbsnodes -a s012-n001 state = job-exclusive power_state = Running np = 2 properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu ntype = cluster jobs = 0-1/911898.v-qsvr-1.aidevcloud status = rectime=1624947718,macaddr=d4:5d:64:08:e0:1b,cpuclock=Fixed,varattr=,jobs=911898.v-qsvr-1.aidevcloud(cput=114,energy_used=0,mem=382320kb,vmem=34364495240kb,walltime=626,Error_Path=/dev/pts/0,Output_ Path=/dev/pts/0,session_id=2291524),state=free,netload=881915012074,gres=,loadave=2.00,ncpus=24,physmem=32558924kb,availmem=33789804kb,totmem=34656072kb,idletime=1003560,nusers=4,nsessions=4,sessions=525427 11938 32 1193846 2291524,uname=Linux s012-n001 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003
Para verificar quantos nós apresentam a GPU Iris XE MAX, basta incluir o seguinte comando:
$ pbsnodes | sort | grep properties |grep iris_xe_max properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,dual_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu
Para verificar as características dos nós presentes no sistema, basta utilizar o seguinte comando abaixo:
$ pbsnodes | sort | grep properties properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = core,cfl,i9-10920x,ram32gb,net1gbe,gpu,iris_xe_max,quad_gpu properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,cfl,e-2176g,ram64gb,net1gbe,gpu,gen9 properties = xeon,skl,ram384gb,net1gbe,renderkit properties = xeon,skl,ram384gb,net1gbe,renderkit properties = xeon,skl,ram384gb,net1gbe,renderkit properties = xeon,skl,ram384gb,net1gbe,renderkit
As propriedades são usadas para descrever vários recursos disponíveis nos nós de computação, como: tipo e nome da CPU, modelo e nome do acelerador, DRAM disponível, tipo de interconexão, número de dispositivos aceleradores disponíveis e seu tipo e uso pretendido ou recomendado.
Algumas das propriedades para das classes de dispositivos:
core
fpga
gpu
xeon
Propriedades dos dispositivos por nome:
arria10
e-2176g
gen9
gold6128
i9-10920x
iris_xe_max
plat8153
Quantidade do dispositivo:
dual_gpu
quad_gpu
Uso desejado:
batch
fpga_compile
fpga_runtime
jupyter
renderkit
Mão na massa com a GPU Iris Xe Max
Agora conecte na DevCloud via ssh utilizando o comando abaixo com sua conta devidamente configurada. Se tudo estiver corretamente funcionando teremos a tela abaixo:
$ ssh devcloud ############################################################################### # # Welcome to the Intel DevCloud for oneAPI Projects! # # 1) See https://ift.tt/2LUWKHK for instructions and rules for # the OneAPI Instance. # # 2) See https://ift.tt/3dsMD83 for instructions and rules for # the FPGA Instance. # # Note: Your invitation email sent to you contains the authentication URL. # # If you have any questions regarding the cloud usage, post them at # https://ift.tt/3dqCokx # # Intel DevCloud Team # ############################################################################### # # Note: Cryptocurrency mining on the Intel DevCloud is forbidden. # Mining will lead to immediate termination of your account. # ############################################################################### Last login: Mon Jun 28 22:51:06 2021 from 10.9.0.249 u99999@login-2:~$
Agora criaremos o arquivo ola_Iris_XE_Max.sh com o seguinte conteudo abaixo.
$ tee > ola_Iris_XE_Max.sh <<EOF > Echo #!/bin/bash > wget https://ift.tt/35XzAHA > tar -zxvf cmake-gpu.tar.gz > mkdir -p cmake-gpu/build > cd cmake-gpu/build > cmake .. > make run > EOF
Este script efetuara o download do código fonte exemplo em C, que utiliza um loop for, e conta até 15 uttilizando a GPU, descompacta o arquivo .tar.gz, cria a pasta build efetua a compilação e executa.
Para testar o funcionamento execute o seguinte comando para submeter o script para processamento:
$ qsub -l nodes=1:iris_xe_max:ppn=2 -d . ola_Iris_XE_Max.sh 911915.v-qsvr-1.aidevcloud Após alguns segundos digite ls e verifique o conteudo do arquivo de saida com o comando cat. Veremos o seguinte resultado:
$ cat ola_Iris_XE_Max.sh.o911915
######################################################################## # Date: Mon 28 Jun 2021 11:45:08 PM PDT # Job ID: 911915.v-qsvr-1.aidevcloud # User: u68892 # Resources: neednodes=1:iris_xe_max:ppn=2,nodes=1:iris_xe_max:ppn=2,walltime=06:00:00 ######################################################################## cmake-gpu/CMakeLists.txt cmake-gpu/License.txt cmake-gpu/README.md cmake-gpu/sample.json cmake-gpu/src/ cmake-gpu/src/CMakeLists.txt cmake-gpu/src/main.cpp cmake-gpu/third-party-programs.txt — The C compiler identification is GNU 9.3.0 — The CXX compiler identification is Clang 12.0.0 — Check for working C compiler: /usr/bin/cc — Check for working C compiler: /usr/bin/cc — works — Detecting C compiler ABI info — Detecting C compiler ABI info – done — Detecting C compile features — Detecting C compile features – done — Check for working CXX compiler: /glob/development-tools/versions/oneapi/2021.2/inteloneapi/compiler/2021.2.0/linux/bin/dpcpp — Check for working CXX compiler: /glob/development-tools/versions/oneapi/2021.2/inteloneapi/compiler/2021.2.0/linux/bin/dpcpp — works — Detecting CXX compiler ABI info — Detecting CXX compiler ABI info – done — Detecting CXX compile features — Detecting CXX compile features – done — Configuring done — Generating done — Build files have been written to: /home/u47345/cmake-gpu/build Scanning dependencies of target cmake-gpu [ 50%] Building CXX object src/CMakeFiles/cmake-gpu.dir/main.cpp.o [100%] Linking CXX executable ../cmake-gpu [100%] Built target cmake-gpu Scanning dependencies of target build [100%] Built target build Scanning dependencies of target run 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [100%] Built target run ######################################################################## # End of output for job 911915.v-qsvr-1.aidevcloud # Date: Mon 28 Jun 2021 11:45:25 PM PDT ########################################################################
Mais informações no link oficial: Intel® DevCloud https://software.intel.com/content/www/us/en/develop/tools/devcloud.html ou em contato diretamente comigo em [email protected]. “O próximo grande salto evolutivo da humanidade será a descoberta de que cooperar é melhor que competir… Pois colaborar atrai amigos, competir atrai inimigos!”
O post A poderosa nuvem: Intel® DevCloud com GPU Iris Xe Max apareceu primeiro em SempreUpdate.
source https://sempreupdate.com.br/a-poderosa-nuvem-intel-devcloud-com-gpu-iris-xe-max/
0 notes
Text
Intel Distribution For Python To Create A Genetic Algorithm

Python Genetic Algorithm
Genetic algorithms (GA) simulate natural selection to solve finite and unconstrained optimization problems. Traditional methods take time and resources to address NP-hard optimization problems, but these algorithms can do it. GAs are based on a comparison between human chromosomal behavior and biological evolution.
This article provides a code example of how to use numba-dpex for Intel Distribution for Python to create a generic GA and offload a calculation to a GPU.
Genetic Algorithms (GA)
Activities inside GAs
Selection, crossover, and mutation are three crucial biology-inspired procedures that may be used to provide a high-quality output for GAs. It’s critical to specify the chromosomal representation and the GA procedures before applying GAs to a particular issue.
Selection
This is the procedure for choosing a partner and recombining them to produce children. Because excellent parents encourage their children to find better and more appropriate answers, parent selection is critical to the convergence rate of GA.
An illustration of the selection procedure whereby the following generation’s chromosomes are reduced by half.
The extra algorithms that decide which chromosomes will become parents are often required for the selection procedure.
Crossover
Biological crossover is the same procedure as this one. In this case, more than one parent is chosen, and the genetic material of the parents is used to make one or more children.
A crossover operation in action.
The crossover procedure produces kid genomes from specific parent chromosomes. There is only one kid genome produced and it may be a one-point crossing. The first and second parents each give the kid half of their DNA.
Mutation
A novel answer may be obtained by a little, haphazard modification to the chromosome. It is often administered with little probability and is used to preserve and add variation to the genetic population.
A mutation procedure involving a single chromosomal value change.
The mutation procedure may alter a chromosome.
Enhance Genetic Algorithms for Python Using Intel Distribution
With libraries like Intel oneAPI Data Analytics Library (oneDAL) and Intel oneAPI Math Kernel Library (oneMKL), developers may use Intel Distribution for Python to obtain near-native code performance. With improved NumPy, SciPy, and Numba, researchers and developers can expand compute-intensive Python applications from laptops to powerful servers.
Use the Data Parallel Extension for Numba (numba-dpex) range kernel to optimize the genetic algorithm using the Intel Distribution for Python. Each work item in this kernel represents a logical thread of execution, and it represents the most basic kind of data-parallel and parallelism across a group of work items.
The vector-add operation was carried out on a GPU in the prior code, and vector c held the result. In a similar vein, the implementation is the same for every other function or method.
Code Execution
Refer to the code sample for instructions on how to develop the generic GA and optimize the method to operate on GPUs using numba-dpex for Intel Distribution for Python. It also describes how to use the various GA operations selection, crossover, and mutation and how to modify these techniques for use in solving other optimization issues.
Set the following values to initialize the population:
5,000 people live there.
Size of a chromosome: 10
Generations: 5.
There are ten random floats between 0 and 1 on each chromosome.
Put the GA into practice by developing an assessment strategy: This function serves as numba-dpex’s benchmark and point of comparison. The calculation of an individual’s fitness involves using any combination of algebraic operations on the chromosome.
Carry out the crossover operation: The inputs are first and second parents to two distinct chromosomes. One more chromosome is returned as the function’s output.
Carry out the mutation operation: There is a one percent probability that every float in the chromosome will be replaced by a random value in this code example.
Put into practice the selection process, which is the foundation for producing a new generation. After crossover and mutation procedures, a new population is generated inside this function.
Launch the prepared functions on a CPU, beginning with a baseline. Every generation includes the following processes to establish the first population:
Utilizing the eval_genomes_plain function, the current population is evaluated
Utilizing a next_generation function, create the next generation.
Wipe fitness standards, since a new generation has already been produced.
Measured and printed is the calculation time for those operations. To demonstrate that the calculations were the same on the CPU and GPU, the first chromosome is also displayed.
Run on a GPU: Create an evaluation function for the GPU after beginning with a fresh population initialization (similar to step 2). With GPU implementation, chromosomes are represented by a flattened data structure, which is the sole difference between it and CPU implementation. Also, utilize a global index and kernels from numba-dpex to avoid looping over every chromosome.
The time for assessment, generation production, and fitness wipe is monitored when a GPU is operating, just like it is for the CPU. Deliver the fitness container and all of the chromosomes to the selected device. After that, a kernel with a specified range may be used.
Conclusion
Use the same procedures for further optimization issues. Describe the procedures of chromosomal selection, crossing, mutation, and assessment. The algorithm is executed the same way in its entirety.
Execute the above code sample and evaluate how well this method performs while executing sequentially on a CPU and parallelly on a GPU. The code result shows that using a GPU-based numba-dpex parallel implementation improves performance speed.
Read more on Govindhtech.com
#IntelDistribution#Python#GeneticAlgorithm#PythonGeneticAlgorithm#genericGA#GPU#InteloneAPI#CPU#News#Technews#Technology#Technologynews#technologytrends#govindhtech
1 note
·
View note
Text
kAI: A Mexican AI Startup, Improves The Everyday Activities

Mexican AI
kAI, a Mexican AI startup, simplifies and improves the convenience of managing daily tasks.
kAI Meaning
“Künstliche Intelligenz” (German for “Artificial Intelligence”) refers to AI technology, techniques, and systems. The word “kAI” may refer to AI-based solutions that use machine learning, data analysis, and other AI methods to improve or automate activities.
AI startup business kAI is based in the technological center of Mexico and is creating an AI-powered organizing software called kAI Tasks. With the help of this software, users can easily arrange their personal days and focus their efforts on the things that really important. With kAI, creating an agenda takes less than a minute because of artificial intelligence’s intuitive capabilities. WatchOS-based smartwatches, tablets, and smartphones running Android and Apple can all use kAI Tasks.
The Problem
In an environment where there are always fresh assignments and meetings, being productive is crucial. Regrettably, rather of increasing user productivity, existing to-do apps actually decrease it. Either important functionality is missing, the user experience is not straightforward enough, or the system does not support the users’ regular daily chores.
The Resolution
The mobile task management software from kAI makes it simple for end users to plan, schedule, and arrange their workdays. Compared to conventional to-do management apps and tools, this can be completed in a fraction of the time because of artificial intelligence.
Block planning appears on one screen daily when using kAI Tasks.Image Credit To Intel
The following are a few of the benefits and features that make the tool so alluring:
Intelligent task management: kAI provides tailored recommendations and reminders to help you stay on track by learning from end users’ behaviors and preferences.
Easy event planning: Arrange agendas and schedules with ease, freeing you time to concentrate on the important things.
Constant adaptation: The more you use the tool, the more it learns about your requirements and adjusts accordingly, personalizing your everyday experience.
AI Tasks may be tailored to the requirements of the final user
To optimize everyday objectives, kAI Tasks may be used in conjunction with a smartphone or wristwatch. The end user may easily control his or her productivity and maintain organization with this configuration.
By the end of September 2024, kAI hopes to provide additional features including wearables and the creation of a bot for Telegram and WhatsApp, among other things. With the aid of these connections, the business will be able to expand its user base and make everyday job organization easier without requiring the usage of another software.
“The foundation of an excellent lifestyle is personal organization. They are redefining time and task management at kAI. Its modern equipment boosts productivity, well-being, and stress reduction. You may easily accomplish your business and personal objectives with kAI while maintaining the ideal balance in your life. According to Kelvin Perea, CEO of kAI, “All of us can even do more in less time because their company is a part of the Intel Liftoff Program.”
kAI chores, which is compatible with almost all smart devices, makes it simple to arrange daily chores. Task management is made more simpler and more straightforward with the aid and assistance of AI, as the software gradually learns the end user’s behavior.
Are you prepared to further innovate and grow your startup? Enroll in the Intel Liftoff program right now to become a part of a community that is committed to fostering your ideas and promoting your development.
Intel Liftoff
Liftoff for Startups using Intel
Take Down Code Barriers, Release Performance, and Turn Your Startup Into a Scalable, AI Company that Defines the Industry.
Early-stage AI and machine learning businesses are eligible to apply for Intel Liftoff for startups. No matter where you are in your entrepreneurial career, this free virtual curriculum supports you in innovating and scaling.
Benefits of the Program for AI Startups
Startups may get the processing power they need to address their most pressing technological problems with Intel Liftoff. The initiative also acts as a launchpad for collaborations, allowing entrepreneurs to improve customer service and strengthen one other’s offers.
Superior Technical Knowledge and Instruction
Availability of the program’s Slack channel
Free online seminars and courses
Engineering advice and assistance
Reduced prices for certification and training
Invitations to forums and activities with experts
Advanced Technology and Research Resources
Offers for Intel Developer Cloud free cloud credits
Cloud service provider credits
Availability of Intel developer tools, which provide several technological advantages
Use the Intel software library to access devices with next-generation artificial intelligence
Opportunities for Networking and Comarketing
Boost consumer awareness using Intel’s marketing channels.
Venture exhibitions at trade shows
Introductions at Intel around the ecosystem
Establish a connection with Intel Capital and the worldwide venture capital (VC) network
Developer Cloud Intel Tiber
Take down the obstacles to hardware access, quicken development times, and increase your AI and HPC processes’ return on investment (ROI).
Register to get instant access to the newest Intel software and hardware innovations, enabling you to write, test, and optimize code more quickly, cheaply, and effectively.
AI Pioneers Who Discovered Intel Liftoff for Startups as Their Launchpad
Their companies are breaking new ground in a variety of AI-related fields. Here’s how they sum up their time in the program and the benefits they’ve received in terms of improved performance.
Enabling businesses to develop and implement vision AI solutions more quickly and consistently
By processing crucial machine learning tasks with AI Tools, the Hasty end-to-end vision AI platform opens up new AI use cases and makes application development more approachable.
“Using Intel OneAPI to unlock computationally demanding vision AI tasks will be a stepwise shift for critical industries like disaster recovery, logistics, agriculture, and medical.”
Use particle-based simulation tools to assist engineers in creating amazing things
Using the Intel HPC Toolkit and the Intel Developer Cloud, Dive Solutions improves their cloud-native computational fluid dynamics simulation software for state-of-the-art hardware.
“It’s used parts from the Intel HPC Toolkit to optimize their solver performance on Intel Xeon processors in an economical manner. The workloads are currently being prepared to execute on both CPU and GPU architectures.
Using a hyperconverged, real-time analytics platform to address the difficulties posed by big data
Using oneAPI, the Isima low-code framework optimizes for cost and performance in the cloud while enabling real-time use cases that drastically shorten time-to-value.
Read more on govindhtech.com
#MexicanAIStartup#EverydayActivities#MexicanAI#machinelearning#smartwatches#artificialintelligence#WhatsApp#smartdevices#IntelLiftoff#IntelOneAPI#IntelDeveloperCloud#IntelHPC#IntelXeonprocessors#hpc#intel#DeveloperCloud#AdvancedTechnology#technology#technews#news#govindhtech
0 notes
Text
Utilizing llama.cpp, LLMs can be executed on Intel GPUs

The open-source project known as llama.cpp is a lightweight LLM framework that is gaining greater and greater popularity. Given its performance and customisability, developers, scholars, and fans have formed a strong community around the project. Since its launch, GitHub has over 600 contributors, 52,000 stars, 1,500 releases, and 7,400 forks. More hardware, including Intel GPUs seen in server and consumer products, is now supported by llama.cpp as a result of recent code merges. Hardware support for GPUs from other vendors and CPUs (x86 and ARM) is now combined with Intel’s GPUs.
Georgi Gerganov designed the first implementation. The project is mostly instructional in nature and acts as the primary testing ground for new features being developed for the machine learning tensor library known as ggml library. Intel is making AI more accessible to a wider range of customers by enabling inference on a greater number of devices with its latest releases. Because Llama.cpp is built in C and has a number of other appealing qualities, it is quick.
16-bit float compatibility
Support for integer quantisation (four-, five-, eight-, etc.)
Absence of reliance on outside parties
There are no runtime memory allocations.
Intel GPU SYCL Backend
GGM offers a number of backends to accommodate and adjust for various hardware. Since oneAPI supports GPUs from multiple vendors, Intel decided to construct the SYCL backend using their direct programming language, SYCL, and high-performance BLAS library, oneMKL. A programming model called SYCL is designed to increase hardware accelerator productivity. It is an embedded, single-source language with a domain focus that is built entirely on C++17.
All Intel GPUs can be used with the SYCL backend. Intel has confirmed with:
Flex Series and Data Centre GPU Max from Intel
Discrete GPU Intel Arc
Intel Arc GPU integrated with the Intel Core Ultra CPU
In Intel Core CPUs from Generations 11 through 13: iGPU
Millions of consumer devices can now conduct inference on Llama since llama.cpp now supports Intel GPUs. The SYCL backend performs noticeably better on Intel GPUs than the OpenCL (CLBlast) backend. Additionally, it supports an increasing number of devices, including CPUs and future processors with AI accelerators. For information on using the SYCL backend, please refer to the llama.cpp tutorial.
Utilise the SYCL Backend to Run LLM on an Intel GPU
For SYCL, llama.cpp contains a comprehensive manual. Any Intel GPU that supports SYCL and oneAPI can run it. GPUs from the Flex Series and Intel Data Centre GPU Max can be used by server and cloud users. On their Intel Arc GPU or iGPU on Intel Core CPUs, client users can test it out. The 11th generation Core and later iGPUs have been tested by Intel. While it functions, the older iGPU performs poorly.
The memory is the only restriction. Shared memory on the host is used by the iGPU. Its own memory is used by the dGPU. For llama2-7b-Q4 models, Intel advise utilising an iGPU with 80+ EUs (11th Gen Core and above) and shared memory that is greater than 4.5 GB (total host memory is 16 GB and higher, and half memory could be assigned to iGPU).
Put in place the Intel GPU driver
There is support for Windows (WLS2) and Linux. Intel suggests Ubuntu 22.04 for Linux, and this version was utilised for testing and development.
Linux:sudo usermod -aG render username sudo usermod -aG video username sudo apt install clinfo sudo clinfo -l
Output (example):Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics
orPlatform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]
Set the oneAPI Runtime to ON
Install the Intel oneAPI Base Toolkit first in order to obtain oneMKL and the SYCL compiler. Turn on the oneAPI runtime next:
First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:
Linux: source /opt/intel/oneapi/setvars.sh
Windows: “C:\Program Files (x86)\Intel\oneAPI\setvars.bat\” intel64
Run sycl-ls to confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Build by one-click:
Linux: ./examples/sycl/build.sh
Windows: examples\sycl\win-build-sycl.bat
Note, the scripts above include the command to enable the oneAPI runtime.
Run an Example by One-Click
Download llama-2–7b.Q4_0.gguf and save to the models folder:
Linux: ./examples/sycl/run-llama2.sh
Windows: examples\sycl\win-run-llama2.bat
Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:
Linux: ./build/bin/ls-sycl-device or ./build/bin/main
Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe
Synopsis
All Intel GPUs are available to LLM developers and users via the SYCL backend included in llama.cpp. Kindly verify whether the Intel laptop, your gaming PC, or your cloud virtual machine have an iGPU, an Intel Arc GPU, or an Intel Data Centre GPU Max and Flex Series GPU. If so, llama.cpp’s wonderful LLM features on Intel GPUs are yours to enjoy. To add new features and optimise SYCL for Intel GPUs, Intel want developers to experiment and contribute to the backend. The oneAPI programming approach is a useful project to learn for cross-platform development.
Read more on Govindhtech.com
#intel#oneapi#onemkl#inteloneapi#llms#llamacpp#llama#intelgpu#govindhtech#cpu#sycl#news#technews#technology#technologynews#technoloy#ai#technologytrends
0 notes
Text
OneAPI Math Kernel Library (oneMKL): Intel MKL’s Successor

The upgraded and enlarged Intel oneAPI Math Kernel Library supports numerical processing not only on CPUs but also on GPUs, FPGAs, and other accelerators that are now standard components of heterogeneous computing environments.
In order to assist you decide if upgrading from traditional Intel MKL is the better option for you, this blog will provide you with a brief summary of the maths library.
Why just oneMKL?
The vast array of mathematical functions in oneMKL can be used for a wide range of tasks, from straightforward ones like linear algebra and equation solving to more intricate ones like data fitting and summary statistics.
Several scientific computing functions, including vector math, fast Fourier transforms (FFT), random number generation (RNG), dense and sparse Basic Linear Algebra Subprograms (BLAS), Linear Algebra Package (LAPLACK), and vector math, can all be applied using it as a common medium while adhering to uniform API conventions. Together with GPU offload and SYCL support, all of these are offered in C and Fortran interfaces.
Additionally, when used with Intel Distribution for Python, oneAPI Math Kernel Library speeds up Python computations (NumPy and SciPy).
Intel MKL Advanced with oneMKL
A refined variant of the standard Intel MKL is called oneMKL. What sets it apart from its predecessor is its improved support for SYCL and GPU offload. Allow me to quickly go over these two distinctions.
GPU Offload Support for oneMKL
GPU offloading for SYCL and OpenMP computations is supported by oneMKL. With its main functionalities configured natively for Intel GPU offload, it may thus take use of parallel-execution kernels of GPU architectures.
oneMKL adheres to the General Purpose GPU (GPGPU) offload concept that is included in the Intel Graphics Compute Runtime for OpenCL Driver and oneAPI Level Zero. The fundamental execution mechanism is as follows: the host CPU is coupled to one or more compute devices, each of which has several GPU Compute Engines (CE).
SYCL API for oneMKL
OneMKL’s SYCL API component is a part of oneAPI, an open, standards-based, multi-architecture, unified framework that spans industries. (Khronos Group’s SYCL integrates the SYCL specification with language extensions created through an open community approach.) Therefore, its advantages can be reaped on a variety of computing devices, including FPGAs, CPUs, GPUs, and other accelerators. The SYCL API’s functionality has been divided into a number of domains, each with a corresponding code sample available at the oneAPI GitHub repository and its own namespace.
OneMKL Assistance for the Most Recent Hardware
On cutting-edge architectures and upcoming hardware generations, you can benefit from oneMKL functionality and optimizations. Some examples of how oneMKL enables you to fully utilize the capabilities of your hardware setup are as follows:
It supports the 4th generation Intel Xeon Scalable Processors’ float16 data type via Intel Advanced Vector Extensions 512 (Intel AVX-512) and optimised bfloat16 and int8 data types via Intel Advanced Matrix Extensions (Intel AMX).
It offers matrix multiply optimisations on the upcoming generation of CPUs and GPUs, including Single Precision General Matrix Multiplication (SGEMM), Double Precision General Matrix Multiplication (DGEMM), RNG functions, and much more.
For a number of features and optimisations on the Intel Data Centre GPU Max Series, it supports Intel Xe Matrix Extensions (Intel XMX).
For memory-bound dense and sparse linear algebra, vector math, FFT, spline computations, and various other scientific computations, it makes use of the hardware capabilities of Intel Xeon processors and Intel Data Centre GPUs.
Additional Terms and Context
The brief explanation of terminology provided below could also help you understand oneMKL and how it fits into the heterogeneous-compute ecosystem.
The C++ with SYCL interfaces for performance math library functions are defined in the oneAPI Specification for oneMKL. The oneMKL specification has the potential to change more quickly and often than its implementations.
The specification is implemented in an open-source manner by the oneAPI Math Kernel Library (oneMKL) Interfaces project. With this project, we hope to show that the SYCL interfaces described in the oneMKL specification may be implemented for any target hardware and math library.
The intention is to gradually expand the implementation, even though the one offered here might not be the complete implementation of the specification. We welcome community participation in this project, as well as assistance in expanding support to more math libraries and a variety of hardware targets.
With C++ and SYCL interfaces, as well as comparable capabilities with C and Fortran interfaces, oneMKL is the Intel product implementation of the specification. For Intel CPU and Intel GPU hardware, it is extremely optimized.
Next up, what?
Launch oneMKL now to begin speeding up your numerical calculations like never before! Leverage oneMKL’s powerful features to expedite math processing operations and improve application performance while reducing development time for both current and future Intel platforms.
Keep in mind that oneMKL is rapidly evolving even while you utilize the present features and optimizations! In an effort to keep up with the latest Intel technology, we continuously implement new optimizations and support for sophisticated math functions.
They also invite you to explore the AI, HPC, and Rendering capabilities available in Intel’s software portfolio that is driven by oneAPI.
Read more on govindhtech.com
#FPGAs#CPU#GPU#inteloneapi#onemkl#python#IntelGraphics#IntelTechnology#mathkernellibrary#API#news#technews#technology#technologynews#technologytrends#govindhtech
0 notes
Text
Intel’s Next Gen AI Mastery Turbocharges Everything

Intel’s Next Gen AI Intel launched 5th Gen Intel Xeon and Intel Core Ultra CPUs for data center, cloud, and edge next gen AI at its AI Everywhere event on December 14.
Intel uses a software-defined, open ecosystem to make next gen AI hardware technologies accessible and easy to utilize. That includes incorporating acceleration into next gen AI frameworks like PyTorch and TensorFlow and providing core libraries (through oneAPI) to make software portable and performant across hardware.
The comprehensive set of enhanced compilers, libraries, analysis and debug tools, and optimized frameworks in Intel Software creation Tools 2024.0 simplifies the creation and deployment of accelerated solutions on these new platforms, maximizing performance and productivity.
AI accelerator engines 5th Gen Intel Xeon processors can handle demanding next gen AI workloads without discrete accelerators, building on 4th Gen’s built-in accelerator engines and providing a more efficient way to increase performance than increasing CPU cores or GPUs.
Intel Accelerator Engines: Intel AMX enhances deep learning training and inference. It excels at NLP, recommendation systems, and picture recognition. Intel QuickAssist Technology (Intel QAT) offloads encryption, decryption, and compression to free up CPU cores so computers may serve more customers or use less power. Fourth-generation Intel Xeon Scalable processors with Intel QAT are the fastest CPUs that can compress and encrypt simultaneously. Intel Data Streaming Accelerator (Intel DSA) improves streaming data transportation and transformation processes for storage, networking, and data-intensive workloads. Intel DSA speeds up data movement across the CPU, RAM, caches, all associated memory, storage, and network devices by offloading the most common data movement operations that generate overhead in data center-scale installations. In-Memory Analytics Accelerator (Intel IAA) speeds up database and analytics workloads and may save electricity. This built-in accelerator boosts query throughput and reduces memory footprint for in-memory databases and big data analytics. In-memory, open-source, and data stores like RocksDB and ClickHouse benefit from Intel IAA. Through oneAPI performance libraries or popular next gen AI frameworks optimized by these libraries, Intel Software Development Tools are essential for maximizing accelerator engine performance. Consider Intel Advanced Matrix Extensions.
Intel AMX activation Intel AMX accelerates next gen AI workload matrix multiplication with new x86 Instruction Set Architecture (ISA) additions. It has two parts:
Two-dimensional registers (tiles) that can store submatrices from larger matrices. The Tile Matrix Multiply (TMUL) accelerator runs tile instructions. Support for int8 and bfloat16 data formats boosts AI machine learning speed. Intel oneAPI performance libraries enable Intel AMX and int8/bfloat16 datatypes:
Intel oneAPI Deep Neural Network Library (oneDNN) is a flexible, scalable deep learning library that performs well on many hardware platforms. Intel oneAPI Data Analytics Library (oneDAL) accelerates batch, online, and distributed big data analysis. Deep learning and other high-performance computing applications employ Intel oneAPI Collective Communications Library (oneCCL) for collective communication primitives like allreduce and broadcast. A popular C++ library for parallel programming, Intel oneAPI Threading Building Blocks (oneTBB) provides a higher-level interface for parallel algorithms and data structures. Intel oneAPI Base Toolkit and Intel AI Tools enhance machine learning and data science pipelines. oneAPI performance libraries massively optimize TensorFlow and PyTorch, major deep learning AI frameworks.
PC AI Accelerator Intel Core Ultra processors will power AI PCs with work and content creation apps. Intel’s software-defined and open ecosystem supports ISVs in developing the AI PC category and gives customers, developers, and data scientists flexibility and choice for scaling next gen AI innovation.
ISVs, developers, and professional content creators can improve performance, power efficiency, and immersive experiences with Intel Core Ultra hybrid processors, Intel Software Development Tools, and optimized frameworks when building innovative gaming, content creation, AI, and media applications. The tools enable advanced CPU, GPU, and NPU functionalities.
Intel oneAPI compilers and libraries use AVX-VNNI and other architectural features to boost speed. Intel VTune Profiler profiles and tunes applications for microarchitecture exploration, optimal task balancing/GPU offload, and memory access analysis. Get them in Intel oneAPI Base Toolkit. These solutions let developers utilize a single, portable codebase for CPU and GPU, lowering development expenses and code maintenance. Intel Graphics Performance Analyzers and Intel VTune Profiler help game creators eliminate bottlenecks for high-performance experiences. Intel Embree and Intel Open Image Denoise in the Intel Rendering Toolkit improve game engine rendering. To create content: Create hyper-realistic CPU and GPU renderings for content development and product design utilizing powerful ray tracing frameworks. Enable scalable, real-time GPU rendering with Intel Embree’s ray-traced hardware acceleration and next gen AI-based denoising in milliseconds with Intel Open Image Denoise from the Intel Rendering Toolkit. Media: Intel DeepLink Hyper Encode enabled by Intel Video Processing Library speeds up video converting by 1.6x. With a specific API, AV1 encode/decode, and Intel Deep Link Hyper Encode, Intel VPL can use multiple graphics accelerators to encode 60% quicker. To implement next gen AI at scale utilizing open source OpenVINO toolkit, use Intel accelerators CPU, GPU, and NPU to optimize inferencing and performance. Start with a TensorFlow or PyTorch-trained model and integrate with OpenVINO compression for easy deployment across hardware platforms. All with little code modifications. Enable Intel Advanced Vector Extensions (AVX-512) on CPU and Intel Xe Matrix Extensions (XMX) on GPU to accelerate deep learning frameworks using Intel oneAPI Base Toolkit’s oneDNN and oneDAL libraries. Optimize TensorFlow and PyTorch training and inference by orders of magnitude using Intel-optimized deep learning AI frameworks. Open source AI reference kits (34 available) accelerate model building and AI innovation across industries.
Read more on Govindhtech.com
0 notes