#PyTorch26 | Explore Tumblr posts and blogs

govindhtech · 5 days ago

Text

New PyTorch 2.7 On Intel GPUs Performance For AI Workflows

Intel GPU support and other features are improved in PyTorch 2.7. PyTorch 2.7 improves functionality and performance across platforms, including Intel GPU architectures. This version, with 3262 contributions from 457 contributors since PyTorch 2.6, aims to make AI research and development on more hardware easier and streamline AI workflows.

Improve Intel GPU Performance

PyTorch 2.7 prioritises Intel GPU acceleration, building on previous improvements. A single GPU programming model is used to provide a consistent user experience across Windows, Linux, and Windows Subsystem for Linux.

PyTorch 2.7 supports Intel GPUs in eager (torch.compile) and graph (Windows and Linux) modes. Users may have Intel GPU options. The enhanced support includes:

Intel A- and B-Series graphics

Intel Core Ultra Processors with Arc Graphics

Intel Core Ultra Mobile Processors (Series 2) with Arc Graphics

Max Series Intel Data Centre GPU

Series 2 Intel Core Ultra Desktop Processors with Intel Arc Graphics

Torch-xpu PIP wheels and simplified setup simplify installation. High ATen operation coverage with SYCL and oneDNN adds eager mode support for greater performance and functionality.

PyTorch 2.7 for Intel GPUs greatly improves scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 data types. This was designed to accelerate attention-based models. On Intel Arc B580 Graphics, Intel Core Ultra 7 258V, and Intel Arc Graphics 140V, the new SDPA optimisation for Stable Diffusion float16 inference improves PyTorch 2.6 by up to 3x in eager mode.

Torch development is also crucial. breakthrough.Windows 11 Intel GPU compilation. The first accelerators that enable Torch are Intel GPUs, a milestone.work on Windows. This invention allows Windows to use graph mode compilation, previously only available on Linux, for performance. Torch improves inference and training similarly.PyTorch Dynamo Benchmarking Suite results on an Intel Arc B580 Graphics show compiling over eager mode on Windows.

Additional Intel GPU upgrades include

PyTorch 2 Export Post Training Quantisation (PT2E) performance optimisation for entire graph mode quantisation pipelines and computational efficiency.

Linux's AOTInductor and torch.export simplify deployment.

More Aten operators improve eager mode performance and operator execution continuity.

Developers can use Linux and Windows profilers to analyse model performance.

Further work on Intel GPU compatibility is expected to improve PyTorch-native performance, especially for torch-based GEMM computation.build and enhance LLM model performance using FlexAttention and less precise data types.

Improved accelerator support for torchao, torchtune, and torchtitan and distributed XCCL backend support for Intel Data Centre GPU Max Series will also be prioritised. GitHub and PyTorch Dev Discussion allow developers to track progress.

The Intel Core Ultra 7 258V and Intel Core Ultra 5 245KF with Intel Arc graphics provide detailed CPU, GPU RAM, OS, and driver specs.

Key Features of PyTorch 2.7

PyTorch 2.7 introduces several notable features, including Intel GPU acceleration:

NVIDIA Blackwell GPU support

PyTorch 2.7 now supports NVIDIA's Blackwell architecture and has pre-built wheels for CUDA 12.8 on Linux x86 and arm64. Updates to cuDNN, NCCL, and CUTLASS were needed for compatibility.

The distribution includes Triton 3.3, which supports torch.compile and Blackwell architecture. Users may install CUDA 12.8 with a pip script.

Torch supports Function Modes.compile

This feature is beta. Users can override torch actions to create custom functionality, such as rewriting operations for a backend. FlexAttention rewrites indexing.

Mega Cache

End-to-end portable Torch caching is available in Mega Cache beta. A model may be compiled and executed, then compiler artefacts can be loaded on a different computer to pre-populate torch.Cache compilation speeds later compilations.

PyTorch Native Context Parallel

The PyTorch Context Parallel API, initially a Prototype feature, allows users to create a Python context for parallel execution of calls to torch.nn.functional.scaled_dot_product_attention(). It currently supports cuDNN, Efficient, and Flash attention backends. TorchTitan's Context Parallel LLM training uses this.

FlexAttention improvements

FlexAttention, introduced in PyTorch 2.5.0 to let researchers adjust attention kernels without writing kernel code, is improved in PyTorch 2.7. These include:

Processing LLM tokens on x86 CPUs begins.

This release adds attention options for LLM inference initial token processing to PyTorch 2.6's x86 CPU support. It makes using FlexAttention on x86 CPUs.assemble more smooth by replacing scaled_dot_product_attention operations with a single API and enjoying high torch performance.

Optimisation of x86 CPU LLM throughput mode

By adding a C++ micro-GEMM template capabilities, PyTorch 2.6's huge batch size limitations have been solved and LLM inference throughput on x86 CPUs has improved. Users get better speed and seamlessness using FlexAttention APIs and torch.LLM-throughput x86 compilation.

Focus on inference. This version provides an inference-optimized PagedAttention and GQA decoding backend. New features include trainable biases, performance tuning guides, and layered jagged tensors.

Check each map.

Foreach Map is a torch-using prototype.Compile, like torch, lets users apply user-defined or pointwise functions like torch.add to tensors.foreach lists.

One of its benefits is that it can lift user-defined Python functions and handle lists of tensors or scalars. Torch.compile generates a horizontally fused kernel for optimal results.

Support Inductor Prologue Fusion

Another prototype feature, prologue fusion, optimises matrix multiplication (matmul) by incorporating pre-matmul operations into the kernel. Lowering global memory bandwidth improves performance.

In Conclusion

The release description highlights how upstream Intel GPU efforts have improved since PyTorch 2.4 and how PyTorch 2.7 capabilities have boosted AI workloads on a variety of Intel GPUs. In particular, torch.compile on Windows and SDPA optimisation for Stable Diffusion inference on Intel Arc and Core Ultra systems increase performance.

#PyTorch27 #PyTorch26 #PyTorch24 #IntelArcBSeries #IntelCoreUltra #IntelArcB580Graphics #PyTorch #IntelCoreUltra7258V #LLMinferenc #News #Technews #Technology #Technologynews #Technologytrends #govindhtech

0 notes