#LLaMa31model | Explore Tumblr posts and blogs

govindhtech · 10 months ago

Text

MLPerf Inference v4.1 For AMD Instinct MI300X Accelerators

Engineering Insights: Introducing AMD Instinct MI300X Accelerators’ MLPerf Results. The full-stack AMD inference platform demonstrated its prowess with the remarkable results AMD Instinct MI300X GPUs, powered by one of the most recent iterations of open-source ROCm, obtained in the MLPerf Inference v4.1 round.

LLaMA2-70B

The first submission concentrated on the well-known LLaMA2-70B type, which is renowned for its excellent performance and adaptability. By outperforming the NVIDIA H100 in Gen AI inference, it established a high standard for what AMD Instinct MI300X accelerators are capable of.

MLPerf Inference

Comprehending MLPerf and Its Relevance to the Industry

Efficient and economical performance is becoming more and more important for inference and training as large language models (LLMs) continue to grow in size and complexity. Robust parallel processing and an optimal software stack are necessary to achieve high-performance LLMs.

This is where the best benchmarking package in the business, MLPerf, comes into play. The open-source AI benchmarks known as MLPerf Inference, which were created by the cross-industry cooperation MLCommons, of which AMD is a founding member, include Gen AI, LLMs, and other models that give exacting, peer-reviewed criteria. Businesses are able to assess the efficacy of AI technology and software by using these benchmarks.

A major accomplishment for AMD, excelling in MLPerf Inference v4.1 demonstrates their dedication to openness and providing standardized data that enables businesses to make wise choices.

An Extensive Analysis of the LLaMA2-70B Benchmark

The AMD LLaMA2-70B model was utilized in their first MLPerf Inference. A major development in LLMs, the LLaMA2-70B model is essential for practical uses such as large-scale inference and natural language processing. A Q&A scenario using 24,576 samples from the OpenORCA dataset, each with up to 1,024 input and output tokens, was included in the MLPerf benchmarking test. Two situations were analyzed by the benchmark to assess inference performance:

In an offline scenario, queries are processed in batches to increase throughput in tokens per second.

Server Scenario: This model tests the hardware’s capacity to provide quick, responsive performance for low-latency workloads by simulating real-time queries with stringent latency limitations (TTFT* < 2s, TPOT* ≤ 200ms).

Performance of AMD Instinct MI300X in MLPerf

With four important entries for the LLaMA2-70B model, the AMD Instinct MI300X demonstrated remarkable performance in its first MLPerf Inference utilizing the Supermicro AS-8125GS-TNMR2 machine. These findings are especially noteworthy since they provide an apples-to-apples comparison with rival AI accelerators, are repeatable, vetted by peer review, and grounded in use cases that are relevant to the industry.

Combination Performance of CPU and GPU

Submission ID 4.1-0002: Two AMD EPYC 9374F (Genoa) CPUs paired with eight AMD Instinct MI300X accelerators in the Available category.

This setup demonstrated the potent synergy between 4th Gen EPYC CPUs (previously codenamed “Genoa”) and AMD Instinct MI300X GPU accelerators for AI workloads, providing performance within 2-3% of NVIDIA DGX H100 with 4th Gen Intel Xeon CPUs in both server and offline environments at FP8 precision.

Previewing Next-Generation CPU Performance

Submission ID 4.1-0070: Two AMD EPYC “Turin” CPUs and eight AMD Instinct MI300X CPUs in the Preview category.

It showcased the performance increases from the next AMD EPYC “Turin” 5th generation CPU when paired with AMD Instinct MI300X GPU accelerators. In the server scenario, it outperformed the NVIDIA DGX H100 with Intel Xeon by a small margin, and it maintained a similar level of performance even offline at FP8 precision.

LLaMA2-70B GPU

Efficiency of a Single GPU

Submission ID 4.1-0001: In the Available category, AMD Instinct MI300X accelerator with AMD EPYC 9374F 4th Gen CPUs (Genoa).

This submission emphasized the AMD Instinct MI300X’s enormous 192 GB memory, which allowed a single GPU to effectively execute the whole LLaMA2-70B model without requiring the network cost that comes with dividing the model over many GPUs at FP8 precision.

The AMD Instinct MI300X has 192 GB of HBM3 memory and a peak memory bandwidth of 5.3 TB/s thanks to its AMD CDNA 3 architecture. The AMD Instinct MI300X can execute and host a whole 70 billion parameter model, such as LLaMA2-70B, on a single GPU with ease because to its large capacity.

The findings in Figure 2 show that the scaling efficiency with the ROCm software stack is almost linear from 1x AMD Instinct MI300X (TP1) to 8x AMD Instinct MI300X (8x TP1), indicating that AMD Instinct MI300X can handle the biggest MLPerf inference model to date.

Outstanding Dell Server Architecture Outcomes Using AMD Instinct MI300X Processors

Submission ID 4.1-0022: Two Intel Xeon Platinum 8460Y+ processors and eight AMD Instinct MI300X accelerators in the Available category.

Along with AMD submissions, Dell used their PowerEdge XE9680 server and LLaMA2-70B to submit their findings, validating the platform-level performance of AMD Instinct accelerators on an 8x AMD Instinct MI300X arrangement. This proposal demonstrates their collaboration and emphasizes how strong it ecosystem is, making them a great option for deployments including both data centers and edge inference. Further information on such outcomes is available here.

Performance Of Engineering Insights

The AMD Instinct MI300X accelerators exhibit great competitive performance due to their high computational power, huge memory capacity with rapid bandwidth, and optimized ROCm software stack. The latter enables effective processing of large AI models such as LLaMA2-70B. A few important elements were pivotal:

Big GPU Memory Capacity

The AMD Instinct MI300X has the most GPU memory that is currently on the market, which enables the whole LLaMA2-70B model to fit into memory while still supporting KV cache. By avoiding model splitting among GPUs, this maximizes inference speed while avoiding network cost.

Batch Sizes: They set the max_num_seqs parameter to 2048 in the offline scenario to optimize throughput, and to 768 in the server scenario to achieve latency requirements. These values are much greater than the 256 default value used in vLLM.

Effective KV cache management is made possible by the vLLM’s paged attention support, which helps prevent memory fragmentation brought on by huge memory AMD Instinct MI300X accelerators.

FP8 Precision

AMD expanded support for the FP8 numerical format throughout the whole inference software stack, using the AMD Instinct MI300X accelerator hardware. They quantized the LLaMA2-70B model weights to FP8 using Quark while maintaining the 99.9% accuracy needed by MLPerf. To further improve speed, it improved the hipBLASLt library, introduced FP8 support to vLLM, and implemented FP8 KV caching.

Software Enhancements

Kernel Optimization: AMD Composable Kernels (CK) based prefill attention, FP8 decode paged attention, and fused kernels such residual add RMS Norm, SwiGLU with FP8 output scaling were among the many profiles and optimizations to carried out.

vLLM Enhancements: The scheduler was improved to optimize both offline and server use cases, allowing for quicker decoding scheduling and better prefill batching.

CPU Enhancement

While GPUs handle the majority of the AI task processing, CPU speed is still quite important. CPUs with fewer cores and higher peak frequencies such as the 32-core EPYC 9374F offer the best performance, particularly in server applications. Performance improvements over the 4th generation EPYC CPUs which were submitted as a preview were seen during testing with the forthcoming “Turin” generation of EPYC CPUs.

LLaMa 3.1 405B

Establishing a Standard for the Biggest Model

The AMD Instinct MI300X GPU accelerators have shown their performance in MLPerf Inference with LLaMA2-70B, and the positive outcomes set a solid precedent for their future efficacy with even bigger models, such as Llama 3.1. They are pleased to provide Day 0 support for AMD Instinct MI300X accelerators with Meta’s new LLaMa 3.1 405B parameter model.

Only a server driven by eight AMD Instinct MI300X GPU accelerators can fit the whole LLaMa 3.1 model, with 405 billion parameters, on a single server utilizing FP16 datatype MI300-7A, owing to the industry-leading memory capacities of the AMD Instinct MI300X platform MI300-25. This lowers expenses and lowers server use. The most ideal way to power the biggest open models on the market right now is with AMD Instinct MI300X accelerators.

Read more on govindhtech.com

#MLPerfInferencev41 #AMDInstinct #MI300XAccelerators #NVIDIAH100 #AMDInstinctMI300X #largelanguagemodels #LLM #news #4thGenEPYCCPU #LLaMa31model #Llama31 #technology #technews #IntelXeon #AMDCDNA #IntelXeonPlatinum #MI300Xaccelerators #govindhtech

0 notes

govindhtech · 11 months ago

Text

NVIDIA NeMo Retriever Microservices Improves LLM Accuracy

NVIDIA NIM inference microservices

AI, Get Up! Businesses can unleash the potential of their business data with production-ready NVIDIA NIM inference microservices for retrieval-augmented generation, integrated into the Cohesity, DataStax, NetApp, and Snowflake platforms. The new NVIDIA NeMo Retriever Microservices Boost LLM Accuracy and Throughput.

Applications of generative AI are worthless, or even harmful, without accuracy, and data is the foundation of accuracy.

NVIDIA today unveiled four new NVIDIA NeMo Retriever NIM inference microservices, designed to assist developers in quickly retrieving the best proprietary data to produce informed responses for their AI applications.

NeMo Retriever NIM microservices, when coupled with the today-announced NVIDIA NIM inference microservices for the Llama 3.1 model collection, allow enterprises to scale to agentic AI workflow, where AI applications operate accurately with minimal supervision or intervention, while delivering the highest accuracy retrieval-augmented generation, or RAG.

Nemo Retriever

With NeMo Retriever, businesses can easily link bespoke models to a variety of corporate data sources and use RAG to provide AI applications with incredibly accurate results. To put it simply, the production-ready microservices make it possible to construct extremely accurate AI applications by enabling highly accurate information retrieval.

NeMo Retriever, for instance, can increase model throughput and accuracy for developers building AI agents and chatbots for customer support, identifying security flaws, or deriving meaning from intricate supply chain data.

High-performance, user-friendly, enterprise-grade inferencing is made possible by NIM inference microservices. The NeMo Retriever NIM microservices enable developers to leverage all of this while leveraging their data to an even greater extent.

Nvidia Nemo Retriever

These recently released NeMo Retriever microservices for embedding and reranking NIM are now widely accessible:

NV-EmbedQA-E5-v5, a well-liked embedding model from the community that is tailored for text retrieval questions and answers.

Snowflake-Arctic-Embed-L, an optimized community model;

NV-RerankQA-Mistral4B-v3, a popular community base model optimized for text reranking for high-accuracy question answering;

NV-EmbedQA-Mistral7B-v2, a well-liked multilingual community base model fine-tuned for text embedding for correct question answering.

They become a part of the group of NIM microservices that are conveniently available via the NVIDIA API catalogue.

Model Embedding and Reranking

The two model types that make up the NeMo Retriever microservices embedding and reranking have both open and commercial versions that guarantee dependability and transparency.

With the purpose of preserving their meaning and subtleties, an embedding model converts a variety of data types, including text, photos, charts, and video, into numerical vectors that can be kept in a vector database. Compared to conventional large language models, or LLMs, embedding models are quicker and less expensive computationally.

After ingesting data and a query, a reranking model ranks the data based on how relevant it is to the query. These models are slower and more computationally complex than embedding models, but they provide notable improvements in accuracy.Image Credit To Nvidia

NeMo Retriever microservices offers advantages over other options. Developers utilising NeMo Retriever microservices may create a pipeline that guarantees the most accurate and helpful results for their company by employing an embedding NIM to cast a wide net of data to be retrieved, followed by a reranking NIM to cut the results for relevancy.

Developers can create the most accurate text Q&A retrieval pipelines by using the state-of-the-art open, commercial models available with NeMo NIM Retriever. NeMo Retriever microservices produced 30% less erroneous responses for enterprise question answering when compared to alternative solutions.Image Credit To Nvidia

NeMo Retriever microservices Principal Use Cases

NeMo Retriever microservices drives numerous AI applications, ranging from data-driven analytics to RAG and AI agent solutions.

With the help of NeMo Retriever microservices, intelligent chatbots with precise, context-aware responses can be created. They can assist in the analysis of enormous data sets to find security flaws. They can help glean insights from intricate supply chain data. Among other things, they can improve AI-enabled retail shopping advisors that provide organic, tailored shopping experiences.

For many use cases, NVIDIA AI workflows offer a simple, supported beginning point for creating generative AI-powered products.

NeMo Retriever NIM microservices are being used by dozens of NVIDIA data platform partners to increase the accuracy and throughput of their AI models.

NIM microservices

With the integration of NeMo Retriever integrating NIM microservices in its Hyper-Converged and Astra DB systems, DataStax is able to provide customers with more rapid time to market with precise, generative AI-enhanced RAG capabilities.

With the integration of NVIDIA NeMo Retriever microservices with Cohesity Gaia, the AI platform from Cohesity will enable users to leverage their data to drive smart and revolutionary generative AI applications via RAG.

Utilising NVIDIA NeMo Retriever, Kinetica will create LLM agents that can converse naturally with intricate networks in order to react to disruptions or security breaches faster and translate information into prompt action.

In order to link NeMo Retriever microservices to exabytes of data on its intelligent data infrastructure, NetApp and NVIDIA are working together. Without sacrificing data security or privacy, any NetApp ONTAP customer will be able to “talk to their data” in a seamless manner to obtain proprietary business insights.

Services to assist businesses in integrating NeMo Retriever NIM microservices into their AI pipelines are being developed by NVIDIA’s global system integrator partners, which include Accenture, Deloitte, Infosys, LTTS, Tata Consultancy Services, Tech Mahindra, and Wipro, in addition to their service delivery partners, Data Monsters, EXLService (Ireland) Limited, Latentview, Quantiphi, Slalom, SoftServe, and Tredence.

Nvidia NIM Microservices

Utilize Alongside Other NIM Microservices

NVIDIA Riva NIM microservices, which boost voice AI applications across industries increasing customer service and enlivening digital humans, can be used with NeMo Retriever microservices.

The record-breaking NVIDIA Parakeet family of automatic speech recognition models, Fastpitch and HiFi-GAN for text-to-speech applications, and Metatron for multilingual neural machine translation are among the new models that will soon be available as Riva NIM microservices.

The modular nature of NVIDIA NIM microservices allows developers to create AI applications in a variety of ways. To give developers even more freedom, the microservices can be connected with community models, NVIDIA models, or users’ bespoke models in the cloud, on-premises, or in hybrid settings.

Businesses may use NIM to implement AI apps in production by utilising the NVIDIA AI Enterprise software platform.

NVIDIA-Certified Systems from international server manufacturing partners like Cisco, Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro, as well as cloud instances from Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure, can run NIM microservices on customers’ preferred accelerated infrastructure.

Members of the NVIDIA Developer Program will soon have free access to NIM for