#OpenXLA
Explore tagged Tumblr posts
govindhtech · 3 months ago
Text
StableHLO & OpenXLA: Enhancing Hardware Portability for ML
Tumblr media
JAX and OpenXLA: Methods and Theory
JAX, a Python numerical computing package with pytorch/XLA compilation and automated differentiation, optimises computations for CPUs, GPUs, and TPUs using OpenXLA.
Even though the Intel articles on JAX and OpenXLA do not define StableHLO, the context of OpenXLA's function suggests that it is related to the portability and stability of the Hardware Abstraction Layer (HAL) in the ecosystem. Intel Extension for OpenXLA with PJRT plug-in.
StableHLO likely matches the sources' scenario:
OpenXLA abstracts low-level hardware backends from high-level machine learning frameworks like JAX. This abstraction lets models operate on different hardware without code changes.
OpenXLA uses an intermediate representation (IR) to connect the backend (XLA compilers for specific hardware) and frontend (JAX).
This abstraction requires IR stability to perform properly and enable reliable deployment across devices. This IR change may break backend compilers and frontend frameworks.
We believe StableHLO is an OpenXLA versioned and standardised HLO (High-Level Optimiser) IR. With this standardisation and versioning, models compiled for a StableHLO version would work on compatible hardware backends that support that version.
Although the sources don't define StableHLO, OpenXLA's role as an abstraction layer with an intermediate representation implies that it's essential to the JAX and OpenXLA ecosystem for ensuring computation stability and portability across hardware targets. Hardware and software (JAX via OpenXLA) would have a solid contract.
To better understand StableHLO, you should read OpenXLA project and component documentation.
Understanding how JAX and OpenXLA interact, especially the compilation and execution cycle, helps Intel and other systems optimise performance. OpenXLA's role in backend-agnostic optimisation, JAX's staged compilation, and cross-device execution are highlighted.
Important topics
Core Functionality and Transformation System of JAX
JIT compilation, vectorisation, parallelisation, and automated differentiation (jax.grad) are added to NumPy by JAX.
These changes make JAX functions more efficient.
Jax.jit converts JAX functions into XLA computations, improving efficiency. “The jax.jit transformation in JAX optimises numerical computations by compiling Python functions that operate on JAX arrays into efficient, hardware-accelerated code using XLA.”
OpenXLA as a Backend-Agnostic Compiler
OpenXLA bridges hardware backends to JAX. The optimisation and intermediate representation pipeline is combined.
The jax.jit converter converts JAX code to OpenXLA HLO IR.
OpenXLA optimises this HLO IR and generates backend machine code.
“OpenXLA serves as a unifying compiler infrastructure that produces optimised machine code for CPUs, GPUs, and TPUs from JAX's computation graph in HLO.”
Compilation in stages in JAX
JAX-decorated functions employ staged compilation.Invoking jit requires a specified input shape and data type (abstract signature).
JAX watches the Python function's execution using abstract variables to describe the calculation.
This traced calculation then reaches the OpenXLA HLO IR.
OpenXLA optimises the HLO and generates target backend code.
Using the resulting code in subsequent calls with the same abstract signature will boost performance. “When a JAX-jitted function is called for the first time with a specific shape and dtype of inputs, JAX traces the sequence of operations, and OpenXLA compiles this computation graph into optimised machine code for the target device.”
CPU and GPU Execution Flow
How OpenXLA lets JAX regulate device computations.
OpenXLA optimises CPU machine code using SIMD and other architectural features.
In OpenXLA, data flows and kernel execution are maintained while the GPU handles calculations.
On GPUs, OpenXLA generates kernels for the GPU's parallel processing units.
This includes initiating and coordinating GPU kernels and managing CPU-GPU memory transfers.
Data management between devices using device buffers (jax.device_buffer.DeviceArray).
Understanding Abstract Signatures and Recompilation
The form and data type of input arguments determine a jax.jit-decorated function's abstract signature.
When a jitted function is called with inputs with a different abstract signature, JAX recompiles. Use consistent input shapes and data types to save compilation costs.
Intel Hardware/Software Optimisation Integration
Since the resources are on the Intel developer website, they likely demonstrate how JAX and OpenXLA may optimise Intel CPUs and GPUs.
This area includes optimised kernels, vectorisation on Intel architectures like AVX-512, and interaction with Intel-specific libraries or tools.
The jax.jit transformation in JAX employs XLA to turn Python functions that operate with JAX arrays into hardware-accelerated code, optimising numerical operations.
OpenXLA serves as a unified compiler infrastructure, converting JAX's compute graph (HLO) into optimised machine code for CPUs, GPUs, and TPUs.
When JAX-jitted functions are initially performed with a specific shape and dtype of inputs, JAX tracks the chain of operations. OpenXLA then compiles this processing graph into device-optimized machine code.
OpenXLA targets GPUs to generate kernels for the GPU's parallel processing units. Launching and synchronising GPU kernels and managing CPU-GPU data flows are required.
0 notes
wayneradinsky · 2 years ago
Link
"OpenXLA is available now to accelerate and simplify machine learning."
XLA stands for "accelerated linear algebra", and it's the compiler that Google has been using with their custom hardware for AI, called tensor processing units (TPUs). It hasn't been open source until now. Now it's open-source and...
"co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA."
Quite a list of companies. What exactly does it do?
"It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware. Developers using OpenXLA will see significant improvements in training time, throughput, serving latency, and, ultimately, time-to-market and compute costs."
They go on to further describe their motivation for creating OpenXLA:
"As model parameter counts grow exponentially and compute for deep learning models doubles every six months, developers seek maximum performance and utilization of their infrastructure. Teams are leveraging a wider array of hardware from power-efficient ML ASICs in the datacenter to edge processors that can deliver more responsive AI experiences."
"Without a common compiler to bridge these diverse hardware devices to the multiple frameworks in use today (e.g. TensorFlow, PyTorch), significant effort is required to run ML efficiently; developers must manually optimize model operations for each hardware target. This means using bespoke software libraries or writing device-specific code, which requires domain expertise. The result is isolated, non-generalizable paths across frameworks and hardware that are costly to maintain, promote vendor lock-in, and slow progress for ML developers."
They go on to say that the OpenXLA Project's core pillars are surprise, fear, ruthless efficiency, and almost fanatical devotion to the Pope -- wait, no, that's the Spanish Inquisition. Nobody expects the Spanish Inquisition! The OpenXLA Project's core pillars are performance, scalability, portability, flexibility, and extensibility for users.
So the idea is that input is in the form of PyTorch, JAX, or TensorFlow code, it goes into StableHLO ("HLO" stands for "high-level operations"), which outputs OpenXLA-formated code, which goes into a target-independent optimizer, then hardware-specific optimizer, and then you run it on your hardware.
The target-independent optimizer does such things as simplification of algebraic expressions, optimization of in-memory data layout, and scheduling optimizations, to minimize for example peak memory usage and peak communication needed. The hardware-specific optimizer generates code for specific hardware including NVIDIA GPUs, AMD GPUs, x86 and ARM CPU architectures, Google tensor processing units (TPUs), AWS Trainium, AWS Inferentia (hardware optimized for training and inference, respectively), Graphcore intelligence processing units (IPUs -- Graphcore's term), and Cerebras's Wafer-Scale Engine (ginormous AI wafers with everything on one wafer).
#ai
0 notes
strategictech · 3 years ago
Text
Google Announces Open Source ML Compiler Project, OpenXLA
At its Next ‘22 event, Google Cloud announced the launch of an open-source machine learning compiler ecosystem, OpenXLA. OpenXLA is a community-led and open-source ecosystem of ML compiler and infrastructure projects co-developed by Google and other AI/ML developers including AMD, Arm, Meta, NVIDIA, AWS, Intel, and Apple.
0 notes
govindhtech · 2 years ago
Text
AMD’s Big Move with Nod.ai Deal on upcoming AI Revolution
Tumblr media
AMD Nod.ai acquisition
AMD has disclosed that it has inked a legally binding agreement to purchase Nod.ai in order to further expand the open AI software capabilities of the company. This move was made in order to further enhance the capabilities of AMD’s GPUs. As a result of AMD’s acquisition of Nod.ai, the company will have access to a team comprised of skilled individuals.
This group has developed a software platform that is the leader in its field and speeds up the deployment of artificial intelligence solutions that are customized for AMD Instinct data center accelerators, Ryzen AI processors, EPYC CPUs, Versal SoCs, and Radeon GPUs. The transaction is perfectly in line with AMD’s artificial intelligence development strategy, which is based on an open software ecosystem that lowers entry barriers for customers by providing developer tools, libraries, and models. This approach was developed to help AMD become a leader in artificial intelligence.
According to Vamsi Boppana, senior vice president of AMD’s Artificial Intelligence Group, “it is expected that the acquisition of Nod.ai will significantly enhance our ability to provide AI customers with open software that allows them to easily deploy highly performant AI models tuned for AMD hardware.” “The acquisition of Nod.ai will allow us to significantly enhance our ability to provide AI customers with open software that allows them to easily deploy highly performant AI models tuned for AMD hardware.”
“The acquisition of Nod.ai” “Our ability to advance open-source compiler technology and enable portable, high-performance AI solutions across the AMD product portfolio has been significantly accelerated by the addition of the talented Nod.ai team,” claimed AMD executives. As of right now, the technologies that were built by Nod.ai are already being utilized on a significant scale in the cloud, at the edge, and across a broad range of end point devices.
In the words of Anush Elangovan, co-founder and Chief Executive Officer of Nod.ai, “We are a team of engineers focused on problem solving quickly and moving at pace in an industry of constant change to develop solutions for the next set of problems.” “At Nod.ai, we are focused on moving at pace in an industry of constant change to develop solutions for the next set of problems,” according to Anush Elangovan.
“Throughout the course of our company’s existence, we have secured our position as the chief maintainer of several of the world’s most prominent AI repositories, as well as a key contributor to those repositories. SHARK, Torch-MLIR, and OpenXLA/IREE are some of the code generation technologies that are included in these repositories. Because to our cooperation with AMD, we will be able to provide our specialized expertise to a larger range of customers around the globe. This will allow us to better serve the needs of the global community.
Nod.ai serves the world’s most successful hyperscalers, companies, and startups by providing them with AI solutions that have been improved. The compiler-based automation software capabilities of Nod.ai’s SHARK software reduce the need for manual optimization and the time required to deploy highly performant AI models to run across a broad portfolio of data center, edge, and client platforms powered by AMD CDNA, XDNA, RDNA, and “Zen” architectures. This makes it possible to run these models on a wider variety of devices than was previously possible. As a result, it is now feasible to execute these models on a far larger number of devices than was previously the case.
A Few Words Regarding AMD
AMD has been an industry leader in high-performance computing, graphics, and visualization technologies for well over half a century now, and it continues to be at the forefront of innovation in these fields. AMD technology is used on a regular basis by billions of people, leading Fortune 500 organizations, and cutting-edge scientific research institutes all around the world to improve the manner in which these groups live, work, and play. AMD was founded in 1972 and is headquartered in Sunnyvale, California. At AMD, the goal of every employee is to create cutting-edge products that are high-performing, versatile, and inventive, and that push the boundaries of what is currently possible.
0 notes