#MetaLLaMA | Explore Tumblr posts and blogs

govindhtech · 8 months ago

Text

AMD Ryzen AI 300 Series Improves LM Studio And llama.cpp

Using AMD Ryzen AI 300 Series to Speed Up Llama.cpp Performance in Consumer LLM Applications.

What is Llama.cpp?

Meta’s LLaMA language model should not be confused with Llama.cpp. It is a tool, nonetheless, that was created to improve Meta’s LLaMA so that it can operate on local hardware. Because of their very high computational expenses, LLaMA and ChatGPT currently have trouble operating on local computers and hardware. Despite being among of the best-performing models available, they are somewhat demanding and inefficient to run locally since they need a significant amount of processing power and resources.

Here’s where llama.cpp is useful. It offers a lightweight, resource-efficient, and lightning-fast solution for LLaMA models using C++. It even eliminates the need for a GPU.

Features of Llama.cpp

Let’s examine Llama.cpp’s features in further detail and see why it’s such a fantastic complement to Meta’s LLaMA language paradigm.

Cross-Platform Compatibility

One of those features that is highly valued in any business, whether it gaming, artificial intelligence, or other software types, is cross-platform compatibility. It’s always beneficial to provide developers the flexibility to execute applications on the environments and systems of their choice, and llama.cpp takes this very seriously. It is compatible with Windows, Linux, and macOS and works perfectly on any of these operating systems.

Efficient CPU Utilization

The majority of models need a lot of GPU power, including ChatGPT and even LLaMA itself. Because of this, running them most of the time is quite costly and power-intensive. This idea is turned on its head by Llama.cpp, which is CPU-optimized and ensures that you receive respectable performance even in the absence of a GPU. Even while a GPU will provide superior results, it’s still amazing that running these LLMs locally doesn’t cost hundreds of dollars. Additionally encouraging for the future is the fact that it was able to tweak LLaMA to operate so effectively on CPUs.

Memory Efficiency

Llama.cpp excels at more than just CPU economy. Even on devices without strong resources, LLaMA models can function successfully by controlling the llama token limit and minimizing memory utilization. Successful inference depends on striking a balance between memory allocation and the llama token limit, which is something that llama.cpp excels at.

Getting Started with Llama.cpp

The popularity of creating beginner-friendly tools, frameworks, and models is at an all-time high, and llama.cpp is no exception. Installing it and getting started are rather simple processes.

You must first clone the llama.cpp repository in order to begin.

It’s time to create the project when you’ve finished cloning the repository.

Once your project is built, you may use your LLaMA model to do llama inference. The following code must be entered in order to utilize the llama.cpp library to do inference:

./main -m ./models/7B/ -p “Your prompt here” To change the output’s determinism, you may play about with the llama inference parameters, such llama temperature. The llama prompt format and prompt may be specified using the -p option, and llama.cpp will take care of the rest.

An overview of LM Studio and llama.cpp

Since GPT-2, language models have advanced significantly, and users may now rapidly and simply implement very complex LLMs using user-friendly programs like LM Studio. These technologies, together with AMD, enable AI to be accessible to all people without the need for technical or coding skills.

The llama.cpp project, a well-liked framework for rapidly and simply deploying language models, is the foundation of LM Studio. Despite having GPU acceleration available, it is independent and may be accelerated only using the CPU. Modern LLMs for x86-based CPUs are accelerated by LM Studio using AVX2 instructions.

Performance comparisons: throughput and latency

AMD Ryzen AI provides leading performance in llama.cpp-based programs such as LM Studio for x86 laptops and speeds up these cutting-edge tasks. Note that memory speeds have a significant impact on LLMs in general. When the compared the two laptops, the AMD laptop had 7500 MT/s of RAM while the Intel laptop had 8533 MT/s.Image Credit To AMD

Despite this, the AMD Ryzen AI 9 HX 375 CPU outperforms its rivals by up to 27% when considering tokens per second. The parameter that indicates how fast an LLM can produce tokens is called tokens per second, or tk/s. This generally translates to the amount of words that are shown on the screen per second.

Up to 50.7 tokens per second may be produced by the AMD Ryzen AI 9 HX 375 CPU in Meta Llama 3.2 1b Instruct (4-bit quantization).

The “time to first token” statistic, which calculates the latency between the time you submit a prompt and the time it takes for the model to begin producing tokens, is another way to benchmark complex language models. Here, it can see that the AMD “Zen 5” based Ryzen AI HX 375 CPU is up to 3.5 times quicker than a similar rival processor in bigger versions.Image Credit To AMD

Using Variable Graphics Memory (VGM) to speed up model throughput in Windows

Every one of the AMD Ryzen AI CPU’s three accelerators has a certain workload specialty and set of situations in which they perform best. On-demand AI activities are often handled by the iGPU, while AMD XDNA 2 architecture-based NPUs provide remarkable power efficiency for permanent AI while executing Copilot+ workloads and CPUs offer wide coverage and compatibility for tools and frameworks.

With the vendor-neutral Vulkan API, LM Studio’s llama.cpp port may speed up the framework. Here, acceleration often depends on a combination of Vulkan API driver improvements and hardware capabilities. Meta Llama 3.2 1b Instruct performance increased by 31% on average when GPU offload was enabled in LM Studio as opposed to CPU-only mode. The average uplift for larger models, such as Mistral Nemo 2407 12b Instruct, which are bandwidth-bound during the token generation phase, was 5.1%.

In comparison to CPU-only mode, it found that the competition’s processor saw significantly worse average performance in all but one of the evaluated models while utilizing the Vulkan-based version of llama.cpp in LM Studio and turning on GPU-offload. In order to maintain fairness in the comparison, it have excluded the GPU-offload performance of the Intel Core Ultra 7 258v from LM Studio’s Vulkan back-end, which is based on Llama.cpp.

Another characteristic of AMD Ryzen AI 300 Series CPUs is Variable Graphics Memory (VGM). Programs usually use the second block of memory located in the “shared” section of system RAM in addition to the 512 MB block of memory allocated specifically for an iGPU. The 512 “dedicated” allotment may be increased by the user using VGM to up to 75% of the system RAM that is available. When this contiguous memory is present, memory-sensitive programs perform noticeably better.

Using iGPU acceleration in conjunction with VGM, it saw an additional 22% average performance boost in Meta Llama 3.2 1b Instruct after turning on VGM (16GB), for a net total of 60% average quicker speeds when compared to the CPU. Performance improvements of up to 17% were seen even for bigger models, such as the Mistral Nemo 2407 12b Instruct, when compared to CPU-only mode.

Side by side comparison: Mistral 7b Instruct 0.3

It compared iGPU performance using the first-party Intel AI Playground application (which is based on IPEX-LLM and LangChain) in order to fairly compare the best consumer-friendly LLM experience available, even though the competition’s laptop did not provide a speedup using the Vulkan-based version of Llama.cpp in LM Studio.

It made use of the Microsoft Phi 3.1 Mini Instruct and Mistral 7b Instruct v0.3 models that came with Intel AI Playground. To observed that the AMD Ryzen AI 9 HX 375 is 8.7% quicker in Phi 3.1 and 13% faster in Mistral 7b Instruct 0.3 using a same quantization in LM Studio.Image Credit To AMD

AMD is committed to pushing the boundaries of AI and ensuring that it is available to everybody. Applications like LM Studio are crucial because this cannot occur if the most recent developments in AI are restricted to a very high level of technical or coding expertise. In addition to providing a rapid and easy method for localizing LLM deployment, these apps let users to experience cutting-edge models almost immediately upon startup (if the architecture is supported by the llama.cpp project).

AMD Ryzen AI accelerators provide amazing performance, and for AI use cases, activating capabilities like variable graphics memory may result in even higher performance. An amazing user experience for language models on an x86 laptop is the result of all of this.

Read more on Govindhtech.com

#AMDRyzen #AMDRyzenAI300 #ChatGPT #MetaLLaMA #Llama.cpp #languagemodels #MetaLlama #AMDXDNA #IntelCoreUltra7 #MistralNemo #LMStudio #News #Technews #Technology #Technologynews #Technologytrends #govindhtech

0 notes

govindhtech · 8 months ago

Text

AMD EPYC 9005: 5th Gen AMD EPYC CPU With Zen 5 Design

AMD EPYC 5th Gen Processors

The AMD EPYC 9005 family of processors, designed specifically to speed up workloads in data centers, the cloud, and artificial intelligence, are pushing the boundaries of corporate computing performance.

The world’s top server CPU for cloud, AI, and corporate applications, the 5th Gen AMD EPYC processors, originally codenamed “Turin,” are now available, according to AMD.

The AMD EPYC 9005 Series processors build on the record-breaking performance and energy efficiency of the previous generations with the “Zen 5” core architecture, compatible with the widely used SP5 platform and offering a wide range of core counts from 8 to 192. The top of the stack 192 core CPU can deliver up to 2.7X the performance compared to the competition.

AMD EPYC 9575F

The 64-core AMD EPYC 9575F, a new CPU in the AMD EPYC 9005 Series, is designed specifically for GPU-powered AI applications that need the highest host CPU performance. Boosting up to 5GHz5, it offers up to 28% quicker processing than the competition’s 3.8GHz CPU, which is necessary to keep GPUs loaded with data for demanding AI applications.

The World’s Best CPU for Enterprise, AI and Cloud Workloads

From supporting corporate AI-enablement programs to powering massive cloud-based infrastructures to hosting the most demanding business-critical apps, modern data centers handle a wide range of workloads. For the wide range of server workloads that power corporate IT today, the new 5th Gen AMD EPYC processors provide industry-leading performance and capabilities.

AI, HPC, and business computing get up to 37% and 17% more instructions per clock (IPC) with the new “Zen 5” core design. and cloud applications than “Zen 4.”6.

When comparing AMD EPYC 9965 processor-based servers to Intel Xeon 8592+ CPU-based servers, users may anticipate significant improvements in their real-world workloads and applications, including:

Results in commercial applications like video transcoding may be obtained up to 4 times quicker.

The time to insights for scientific and HPC applications that address the most difficult challenges in the world may be up to 3.9 times faster.

Performance per core in virtualized infrastructure may increase by up to 1.6X.

Whether they are running a CPU or a CPU + GPU system, 5th Gen AMD EPYC processors allow clients to drive quick time to insights and deployments for AI installations in addition to providing leading performance and efficiency in general purpose workloads.

In contrast to the competition:

In order to drive an efficient approach to generative AI, the 192 core EPYC 9965 CPU can perform up to 3.7X better on end-to-end AI workloads, such as TPCx-AI (derivative).

In enterprise-class generative AI models of small and medium sizes, such as Meta Llama 3.1-8B, the EPYC 9965 offers 1.9 times the throughput performance of its competitors.

Lastly, a 1,000 node AI cluster may push up to 700,000 additional inference tokens per second with the aid of the EPYC 9575F, a specially designed AI host node CPU, and its 5GHz maximum frequency increase. doing more tasks more quickly.

Customers can achieve 391,000 units of SPECrate2017_int_base general purpose computing performance by modernizing to a data center powered by these new processors. This allows them to use approximately 87% fewer servers and an estimated 71% less power while still receiving impressive performance for a variety of workloads. This offers CIOs the choice to boost performance for routine IT activities while achieving remarkable AI performance, or they may take advantage of the space and power reductions.

AMD EPYC CPUs: Pioneering the Next Innovation Wave

EPYC CPUs have been widely used to power the most demanding computing operations due to their demonstrated performance and extensive ecosystem support among partners and consumers. AMD EPYC CPUs enable clients rapidly and effectively create value in their data centers and IT environments with their industry-leading performance, features, and density.

Features of the 5th Gen AMD EPYC

With support from Cisco, Dell, Hewlett Packard Enterprise, Lenovo, Supermicro, and all major ODMs and cloud service providers, the whole array of 5th Gen AMD EPYC processors is now available, offering businesses looking to lead in compute and AI a straightforward upgrade path.

The AMD EPYC 9005 series CPUs include the following high-level features:

Options for the leadership core count range from 8 to 192 per CPU.

The main architectures for “Zen 5” and “Zen 5c”

Each CPU has 12 DDR5 memory channels.

Up to DDR5-6400 MT/s of support

Up to 5GHz5 leadership increase frequencies

The whole 512b data route for AVX-512

Every component in the series is undergoing FIPS certification, and trusted I/O for confidential computing