#ROCm Containers for Training and Inference | Explore Tumblr posts and blogs

govindhtech · 2 months ago

Text

AMD ROCm 6.4: Scalable Inference and Smarter AI Workflows

AMD ROCm 6.4: Plug-and-Play Containers, Modular Deployment, and Revolutionary Inference for Scalable AI on AMD Instinct GPUs

Modern AI workloads are larger and more sophisticated, increasing deployment simplicity and performance needs. AMD ROCm 6.4 advances AI and HPC development on AMD Instinct GPUs.

With growing support for leading AI frameworks, optimised containers, and modular infrastructure tools, ROCm software helps customers manage their AI infrastructure, develop faster, and work smarter.

Whether you're managing massive GPU clusters, training multi-billion parameter models, or spreading inference over multi-node clusters, AMD ROCm 6.4 delivers great performance with AMD Instinct GPUs.

This post presents five major AMD ROCm 6.4 improvements that directly address infrastructure teams, model developers, and AI researchers' concerns to enable AI development fast, straightforward, and scalable.

ROCm Training and Inference Containers: Instinct GPU Plug-and-Play AI

Setting up and maintaining ideal training and inference settings takes time, is error-prone, and delays iteration cycles. AMD ROCm 6.4 provides a large set of pre-optimized, ready-to-run training and inference containers for AMD Instinct GPUs.

For low-latency LLM inference, vLLM (Inference Container) supports plug-and-play open models like Gemma 3 (day-0), Llama, Mistral, Cohere, and others.

FP8 support, DeepGEMM, and simultaneous multi-head attention give SGLang (Inference Container) exceptional throughput and efficiency for DeepSeek R1 and agentic processes.

PyTorch (Training Container) makes LLM training on AMD Instinct MI300X GPUs simpler with performance-tuned variations that enable advanced attention strategies. Optimised for FLUX, Llama 2 (70B), and 3.1 (8B, 70B).1-dev.

Training Container Megatron-LM This ROCm-tuned Megatron-LM fork can train large-scale language models like Llama 3.1, Llama 2, and DeepSeek-V2-Lite.

These containers allow AI researchers faster access to turnkey settings for experimentation and model evaluation. Model developers may use pre-tuned support for the most advanced LLMs, including as DeepSeek, Gemma 3, and Llama 3.1, without spending time configuring. These containers also simplify infrastructure teams' maintenance and scale-out by deploying uniformly across development, testing, and production environments.

PyTorch for ROCm Improves: Faster Focus and Training

As training large language models (LLMs) pushes computing and memory limits, ineffective attention strategies can impede iteration and increase infrastructure costs. AMD ROCm 6.4 improves Flex Attention, TopK, and Scaled Dot-Product Attention in PyTorch.

Flex Attention: Outperforms ROCm 6.3 in LLM workloads that need advanced attention algorithms, reducing memory overhead and training time.

TopK: TopK processes are now three times faster, improving inference reaction times without compromising output quality (source).

SDPA: expanded context, smoother inference.

These improvements speed up training, reduce memory overhead, and optimise hardware consumption. As a consequence, model developers can improve bigger models faster, AI researchers can do more tests, and Instinct GPU customers see shorter time-to-train and higher infrastructure ROI.

Upgrades are pre-installed in the ROCm PyTorch container.

AMD Instinct GPU Next-Gen Inference Performance with vLLM and SGLang

Low-latency, high-throughput inference for big language models is difficult, especially when new models develop and deployment pace increases. ROCm 6.4 addresses this problem with AMD Instinct GPU-optimized vLLM and SGLang versions. Due to its strong support for popular models like Grok, DeepSeek R1, Gemma 3, and Llama 3.1 (8B, 70B, and 405B), model developers can deploy real-world inference pipelines with minimal modification or rewrite. AI researchers can get faster time-to-results on large-scale benchmarks. Infrastructure teams can ensure scaled performance, consistency, and reliability with stable, production-ready containers that get weekly updates.

Set an Instinct MI300X throughput record using SGLang and DeepSeek R1.

Day-0 compatibility for Instinct GPU deployment with vLLM Gemma 3.

These technologies create a full-stack inference environment with weekly and bi-weekly development and stable container upgrades.

Smooth Instinct GPU Cluster Management by AMD GPU Operator

Scaling and managing GPU workloads across Kubernetes clusters can cause manual driver updates, operational disruptions, and limited GPU health visibility, which can reduce performance and reliability. With AMD ROCm 6.4, the AMD GPU Operator automates GPU scheduling, driver lifecycle management, and real-time telemetry to optimise cluster operations. This allows AI and HPC administrators to confidently deploy AMD Instinct GPUs in air-gapped and secure environments with full observability, infrastructure teams to upgrade with minimal disruption, and Instinct customers to benefit from increased uptime, lower operational risk, and stronger AI infrastructure.

Some new features are:

Automatic cordon, drain, and reboot for rolling updates.

More support for Ubuntu 22.04/24.04 and Red Hat OpenShift 4.16–4.17 ensures compatibility with modern cloud and enterprise settings.

Device Metrics Exporter for real-time Prometheus health measurements.

The New Instinct GPU Driver Modularises Software

Coupled driver stacks hinder upgrade processes, reduce interoperability, and increase maintenance risk. AMD ROCm 6.4 introduces the modular Instinct GPU Driver, which isolates the kernel driver from ROCm user space.

main benefits,

Infrastructure teams may now upgrade ROCm libraries and drivers separately.

Extended compatibility to 12 months (from 6 months in earlier iterations)

More flexibility in installing ISV software, bare metal, and containers

This simplifies fleet-wide upgrades and reduces the risk of breaking changes, which benefits cloud providers, government agencies, and enterprises with strict SLAs.

AITER for Accelerated Inference Bonus Point

AITER, a high-performance inference library with drop-in, pre-optimized kernels, removes tedious tuning in AMD ROCm 6.4.

Gives:

It can decode 17 times quicker.

14X multi-head focus improvements

Double LLM inference throughput

#technology #technews #govindhtech #news #technologynews #AMD Instinct #AMD ROCm 6.4 #AMD ROCm #ROCm 6.4 #ROCm #Plug-and-Play AI on Instinct GPUs #ROCm Containers for Training and Inference #artificial intelligence

0 notes

govindhtech · 7 months ago

Text

New AMD ROCm 6.3 Release Expands AI and HPC Horizons

Opening Up New Paths in AI and HPC with AMD’s Release ROCm 6.3. With the introduction of cutting-edge tools and optimizations to improve AI, ML, and HPC workloads on AMD Instinct GPU accelerators, ROCm 6.3 represents a major milestone for the AMD open-source platform. By increasing developer productivity, ROCm 6.3 is designed to enable a diverse spectrum of clients, from cutting-edge AI startups to HPC-driven businesses.

This blog explores the release’s key features, which include a redesigned FlashAttention-2 for better AI training and inference, the introduction of multi-node Fast Fourier Transform (FFT) to transform HPC workflows, a smooth integration of SGLang for faster AI inferencing, and more. Discover these fascinating developments and more as ROCm 6.3 propels industry innovation.

Super-Fast Inferencing of Generative AI (GenAI) Models with SGLang in ROCm 6.3

Industries are being revolutionized by GenAI, yet implementing huge models frequently involves overcoming latency, throughput, and resource usage issues. Presenting SGLang, a new runtime optimized for inferring state-of-the-art generative models like LLMs and VLMs on AMD Instinct GPUs and supported by ROCm 6.3.

Why It Is Important to You

6X Higher Throughput: According to research, you can outperform current systems on LLM inferencing by up to 6X, allowing your company to support AI applications on a large scale.

Usability: With Python integrated and pre-configured in the ROCm Docker containers, developers can quickly construct scalable cloud backends, multimodal processes, and interactive AI helpers with less setup time.

SGLang provides the performance and usability required to satisfy corporate objectives, whether you’re developing AI products that interact with customers or expanding AI workloads in the cloud.

Next-Level Transformer Optimization: Re-Engineered FlashAttention-2 on AMD Instinct

The foundation of contemporary AI is transformer models, although scalability has always been constrained by their large memory and processing requirements. AMD resolves these issues with FlashAttention-2 designed for ROCm 6.3, allowing for quicker, more effective training and inference.

Why It Will Be Favorite by Developers

3X Speedups: In comparison to FlashAttention-1, achieve up to 3X speedups on backward passes and a highly efficient forward pass. This will speed up model training and inference, lowering the time-to-market for corporate AI applications.

Extended Sequence Lengths: AMD Instinct GPUs handle longer sequences with ease with to their effective memory use and low I/O overhead.

With ROCm’s PyTorch container and Composable Kernel (CK) as the backend, you can easily add FlashAttention-2 on AMD Instinct GPU accelerators into your current workflows and optimize your AI pipelines.

AMD Fortran Compiler: Bridging Legacy Code to GPU Acceleration

With the release of the new AMD Fortran compiler in ROCm 6.3, businesses using AMD Instinct accelerators to run historical Fortran-based HPC applications may now fully utilize the potential of contemporary GPU acceleration.

Principal Advantages

Direct GPU Offloading: Use OpenMP offloading to take advantage of AMD Instinct GPUs and speed up important scientific applications.

Backward Compatibility: Utilize AMD’s next-generation GPU capabilities while building upon pre-existing Fortran code.

Streamlined Integrations: Connect to ROCm Libraries and HIP Kernels with ease, removing the need for intricate code rewrites.

Businesses in sectors like weather modeling, pharmaceuticals, and aerospace may now leverage the potential of GPU acceleration without requiring the kind of substantial code overhauls that were previously necessary to future-proof their older HPC systems. This comprehensive tutorial will help you get started with the AMD Fortran Compiler on AMD Instinct GPUs.

New Multi-Node FFT in rocFFT: Game changer for HPC Workflows

Distributed computing systems that scale well are necessary for industries that depend on HPC workloads, such as oil and gas and climate modeling. High-performance distributed FFT calculations are made possible by ROCm 6.3, which adds multi-node FFT functionality to rocFFT.

The Significance of It for HPC

The integration of the built-in Message Passing Interface (MPI) streamlines multi-node scalability, lowering developer complexity and hastening the deployment of distributed applications.

Scalability of Leadership: Optimize performance for crucial activities like climate modeling and seismic imaging by scaling fluidly over large datasets.

Larger datasets may now be processed more efficiently by organizations in sectors like scientific research and oil and gas, resulting in quicker and more accurate decision-making.

Enhanced Computer Vision Libraries: AV1, rocJPEG, and Beyond

AI developers need effective preprocessing and augmentation tools when dealing with contemporary media and datasets. With improvements to its computer vision libraries, rocDecode, rocJPEG, and rocAL, ROCm 6.3 enables businesses to take on a variety of tasks, from dataset augmentation to video analytics.

Why It Is Important to You

Support for the AV1 Codec: rocDecode and rocPyDecode provide affordable, royalty-free decoding for contemporary media processing.

GPU-Accelerated JPEG Decoding: Use the rocJPEG library’s built-in fallback methods to perform image preparation at scale with ease.

Better Audio Augmentation: Using the rocAL package, preprocessing has been enhanced for reliable model training in noisy situations.

From entertainment and media to self-governing systems, these characteristics allow engineers to produce more complex AI solutions for practical uses.

It’s important to note that, in addition to these noteworthy improvements, Omnitrace and Omniperf which were first released in ROCm 6.2 have been renamed as ROCm System Profiler and ROCm Compute Profiler. Improved usability, reliability, and smooth integration into the existing ROCm profiling environment are all benefits of this rebranding.

Why ROCm 6.3?

AMD With each release, ROCm has advanced, and version 6.3 is no different. It offers state-of-the-art tools to streamline development and improve speed and scalability for workloads including AI and HPC. ROCm enables companies to innovate more quickly, grow more intelligently, and maintain an advantage in cutthroat markets by adopting the open-source philosophy and constantly changing to satisfy developer demands.

Are You Prepared to Jump? Examine ROCm 6.3‘s full potential and discover how AMD Instinct accelerators may support the next significant innovation in your company.

Read more on Govindhtech.com

#AMDROCm6.3 #ROCm6.3 #AMDROCm #AI #HPC #AMDInstinctGPU #AMDInstinct #GPUAcceleration #News #Technews #Technology #Technologynews #Technologytrends #Govindhtech

0 notes

govindhtech · 9 months ago

Text

Radeon PRO V710 For Cloud Gaming, AI/ML On Microsoft Azure

AMD Radeon PRO V710

Radeon PRO V710 from AMD is now available on Microsoft Azure. The Radeon PRO V710, the newest GPU in AMD’s family of visual cloud GPUs, was unveiled today. New capabilities for the public cloud are brought by the Radeon PRO V710, which is now available on Microsoft Azure in private preview.

Using the open-source AMD ROCm software, the 54 Compute Units of the AMD Radeon PRO V710, in conjunction with 28GB of VRAM, 448 GB/s memory transfer rate, and 54MB of L3 AMD Infinity Cache technology, facilitate light to medium ML inference workloads and small model training.

When combined with support for PCI Express SR-IOV compliant hardware virtualization, instances built around the Radeon PRO V710 may provide strong isolation between different virtual machines operating on the same physical GPU as well as between the host and guest environments. Excellent performance per watt is made possible by the effective RDNA 3 design, which also allows for a single slot, passively cooled package factor that complies with PCIe CEM specifications.

Modern PC games can run smoothly with complex visual effects enabled due to AMD Infinity Cache technology and outstanding ray tracing performance compared to AMD RDNA 2. For streaming, hardware video encoders enable AV1, HEVC (H.265), and AVC (H.264). Machine learning computational speed is enhanced with the inclusion of AI accelerators for effective matrix multiplication and support for AMD’s open-source ROCm software.

Cloud gaming, AI/ML use cases, desktop-as-a-service, and workstation-as-a-service are all excellent fits for the Radeon PRO V710.

A series of instances built on the Radeon PRO V710 and tailored for various GPU-accelerated applications will be available via Microsoft Azure. Azure Kubernetes Service (AKS) will support V710-based Linux virtual machines in addition to Windows and Linux operating systems, making the deployment of container-based workflows easier.

NVads V710 V5-series

AMD Radeon Pro V710 GPUs and AMD EPYC 9374F (Genoa) CPUs, which have a base frequency of 3.8 GHz and an all-core max frequency of 4.3 GHz, power the NVads V710 v5 series virtual machines. Virtual machines (VMs) use AMD’s Simultaneous Multithreading technology to allocate specific vCPU threads to every VM. VMs running Linux and Windows are supported.

Five configurations are available in the series, ranging from a complete V710 GPU with a 24-GiB frame buffer to a 1/6 of a GPU with a 4-GiB frame buffer. To utilize AMD GPU-based virtual machines, no additional GPU license is needed. NVMe is also supported by the NVads V710 v5 virtual machines for ephemeral local storage.

In order to provide a smooth end user experience and a cost-effective option for a broad spectrum of graphics-enabled virtual desktop experiences, the NVads V710 v5-series supports right-sizing for demanding GPU-accelerated graphics apps and cloud-based virtual desktops. Additionally, the virtual machines (VMs) are sized to provide excellent, engaging cloud gaming experiences, with complicated graphics rendering and streaming optimized.

By using the computational IP blocks in the Radeon Pro V710 GPUs, the NVads V710 v5-series virtual machines also handle small to medium AI/ML inference workloads, including recommendation systems, sematic indexing, and SLMs.

Summary

The newly released Azure NVads V710 v5-series virtual machines, which launched into public preview on October 3, 2024, have the AMD Radeon PRO V710 GPU. The AMD Radeon Pro V710 GPU and AMD EPYC 9374F (Genoa) CPUs power these virtual machines (VMs), which are intended for cloud-based virtual desktops and graphics acceleration via GPU. The virtual machines (VMs) provide varying degrees of GPU use, from partial to full, to accommodate various workloads.

Since this series is designed for cloud-based applications like rendering and design that need high-performance graphics, the V710 is a compelling option for anybody wishing to take advantage of cloud infrastructure and AMD’s GPU capabilities.