#LLMinference | Explore Tumblr posts and blogs

govindhtech · 8 months ago

Text

Using GPU Utilization To Scale Inference Servers Efficiently

Reduce GPU usage with more intelligent autoscaling for your GKE inferencing tasks.

The amount of time the GPU is active, or its duty cycle, is represented by GPU utilization.

Running LLM inference workloads can be expensive, even though LLM models provide enormous benefit for a growing number of use cases. Autoscaling can assist you in cost optimization if you’re utilizing the most recent open models and infrastructure. This will guarantee that you’re fulfilling consumer demand while just spending for the AI accelerators that you require.

Your LLM inference workloads may be easily deployed, managed, and scaled with Google Kubernetes Engine (GKE), a managed container orchestration service. Horizontal Pod Autoscaler (HPA) is a quick and easy solution to make sure your model servers scale with load when you set up your inference workloads on GKE. You can attain your intended inference server performance goals by adjusting the HPA parameters to match your provisioned hardware expenses to your incoming traffic demands.

In order to give best practices, it has evaluated several metrics for autoscaling on GPUs using ai-on-gke/benchmarks, since configuring autoscaling for LLM inference workloads can also be difficult. HPA and the Text-Generation-Inference (TGI) model server are used in this configuration. Keep in mind that these tests can be applied to other inference servers, such vLLM, that use comparable metrics.

Selecting the appropriate metric

Here are a few sample trials from metrics comparison that are displayed using dashboards from Cloud Monitoring. For each experiment, Google used a single L4 GPU g2-standard-16 computer to run TGI with Llama 2 7b using the HPA custom metrics stackdriver adaptor. It then generated traffic with different request sizes using the ai-on-gke locust-load-generation tool. For every trial shown below, it employed the same traffic load. The following thresholds were determined by experimentation.

Keep in mind that the mean-time-per-token graph represents TGI’s metric for the total amount of time spent on prefilling and decoding, divided by the number of output tokens produced for each request. It can examine the effects of autoscaling with various metrics on latency with this metric.

GPU utilization

CPU or memory use are the autoscaling metrics by default. This is effective for CPU-based workloads. However, as inference servers rely heavily on GPUs, these metrics are no longer a reliable way to measure job resource consumption alone. GPU utilization is a measure that is comparable to GPUs. The GPU duty cycle, or the duration of the GPU’s activity, is represented by GPU utilization.

What is GPU utilization?

The percentage of a graphics processing unit’s (GPU) processing power that is being used at any given moment is known as GPU usage. GPUs are specialized hardware parts that manage intricate mathematical computations for parallel computing and graphic rendering.

With a target value threshold of 85%, the graphs below demonstrate HPA autoscaling on GPU utilization.Image credit to Google Cloud

The request mean-time-per-token graph and the GPU utilization graph are not clearly related. HPA keeps scaling up because GPU utilization is rising despite a decline in request mean-time-per-token. GPU utilization is not a useful indicator for LLM autoscaling. This measure is hard to relate to the traffic the inference server is currently dealing with. Since the GPU duty cycle statistic does not gauge flop utilization, it cannot tell us how much work the accelerator is doing or when it is running at maximum efficiency. In comparison to the other measures below, GPU utilization tends to overprovision, which makes it inefficient from a financial standpoint.

In summary, Google does not advise autoscaling inference workloads with GPU utilization.

Batch size

It also looked into TGI’s LLM server metrics because of the limitations of the GPU utilization metric. The most widely used inference servers already offer the LLM server metrics that we looked at.

Batch size (tgi_batch_current_size), which indicates the number of requests handled in each iteration of inferencing, was one of the variables it chose.

With a goal value threshold of 35, the graphs below demonstrate HPA autoscaling on the current batch size.Image credit to Google Cloud

The request mean-time-per-token graph and the current batch size graph are directly related. Latencies are lower with smaller batch sizes. Because it gives a clear picture of the volume of traffic the inference server is currently handling, batch size is an excellent statistic for optimizing for low latency. One drawback of the current batch size metric is that, because batch size can fluctuate slightly with different incoming request sizes, it was difficult to trigger scale up while attempting to attain maximum batch size and, thus, maximum throughput. To make sure HPA would cause a scale-up, it has to select a figure that was somewhat less than the maximum batch size.

If you want to target a particular tail latency, we advise using the current batch size metric.

Queue size

Queue size (tgi_queue_size) was the other TGI LLM server metrics parameter that was used. The amount of requests that must wait in the inference server queue before being added to the current batch is known as the queue size.

HPA scaling on queue size with a goal value threshold of 10 is displayed in the graphs below.Image credit to Google Cloud

*Note that when the default five-minute stabilization time ended, the HPA initiated a downscale, which is when the pod count dropped. This window for the stabilization period and other basic HPA configuration settings can be easily adjusted to suit your traffic needs.

We see that the request mean-time-per-token graph and the queue size graph are directly related. Latencies increase with larger queue sizes. It discovered that queue size, which gives a clear picture of the volume of traffic the inference server is waiting to process, is an excellent indicator for autoscaling inference workloads. When the queue is getting longer, it means that the batch is full. Autoscaling queue size cannot attain latencies as low as batch size since queue size is solely based on the number of requests in the queue, not the number of requests being handled at any given time.

If you want to control tail delay and increase throughput, it suggests using queue size.

Finding the thresholds for goal values

The profile-generator in ai-on-gke/benchmarks to determine the appropriate criteria for these trials in order to further demonstrate the strength of the queue and batch size metrics. In light of this, it selected thresholds:

It determined the queue size at the point when only latency was increasing and throughput was no longer expanding in order to depict an optimal throughput workload.

It decided to autoscale on a batch size at a latency threshold of about 80% of the ideal throughput to simulate a workload that is sensitive to latency.

Image credit to Google Cloud

It used a single L4 GPU to run TGI with Llama 2 7b on two g2-standard-96 machines for each experiment, allowing autoscaling between 1 and 16 copies with the HPA custom metrics stackdriver. The locust-load-generation tool from ai-on-gke was utilized to create traffic with different request sizes. After finding a load that stabilized at about ten replicates, we increased the load by 150% to mimic traffic surges.

Queue size

HPA scaling on queue size with a goal value threshold of 25 is displayed in the graphs below.Image credit to Google Cloud

We observe that even with the 150% traffic spikes, its target threshold can keep the mean time per token below ~0.4s.

Batch size

The HPA scaling on batch size with a goal value threshold of 50 is displayed in the graphs below.Image credit to Google Cloud

Take note that the roughly 60% decrease in traffic is reflected in the about 60% decrease in average batch size.

We observe that even with the 150% traffic increases, its target threshold can keep the mean latency per token nearly below ~0.3s.

In contrast to the queue size threshold chosen at the maximum throughput, the batch size threshold chosen at about 80% of the maximum throughput preserves less than 80% of the mean time per token.

In pursuit of improved autoscaling

You may overprovision LLM workloads by autoscaling with GPU use, which would increase the cost of achieving your performance objectives.

By autoscaling using LLM server measurements, you may spend as little money on accelerators as possible while still meeting your latency or throughput targets. Targeting a particular tail latency is made possible by batch size. You may maximize throughput by adjusting the queue size.

Read more on Govindhtech.com

#GUPutilization #WhatisGPUutilization #Google #googlecloud #LLMs #LLMinference #autoscaling #currenttchsize #threshold #batchsize #govindhtech #news #TechNews #Technology #technologynews #technologytrends #gpu

0 notes

govindhtech · 10 months ago

Text

Intel Data Center GPU SqueezeLLM Inference With SYCLomatic

Turn on SqueezeLLM for Efficient LLM Inference on Intel Data Center GPU Max Series utilizing SYCLomatic for Converting CUDA to SYCL.

In brief

Researchers at the University of California, Berkeley, have devised a revolutionary quantization technique called SqueezeLLM, which enables accurate and efficient generative LLM inference. Cross-platform compatibility, however, requires unique kernel implementations and hence more implementation work.

Using the SYCLomatic tool from the Intel oneAPI Base Toolkit to take advantage of CUDA-to-SYCL migration, they were able to immediately achieve a 2.0x speedup on Intel Data Center GPUs with 4-bit quantization without the need for manual tweaking. Because of this, cross-platform compatibility may be provided with little extra technical effort needed to adapt the kernel implementations to various hardware back ends.

SqueezeLLM: Precise and Effective Low-Precision Quantization for Optimal LLM Interpretation

Because LLM inference allows for so many applications, it is becoming a common task. But LLM inference uses a lot of resources; it needs powerful computers to function. Furthermore, since generative LLM inference requires the sequential generation of output tokens, it suffers from minimal data reuse, while previous machine learning workloads have mostly been compute-bound. Low-precision quantization is one way to cut down on latency and memory use, but it may be difficult to quantize LLMs to low precision (less than 4 bits, for example) without causing an unacceptable loss of accuracy.

SqueezeLLM is a tool that UC Berkeley researchers have created to facilitate precise and efficient low-precision quantization. Two important advances are included into SqueezeLLM to overcome shortcomings in previous approaches. It employs sensitivity-weighted non-uniform quantization, which uses sensitivity to determine the optimal allocation for quantization codebook values, thereby maintaining model accuracy.

This approach addresses the inefficient representation of the underlying parameter distribution caused by the limitations of uniform quantization. Furthermore, SqueezeLLM provides dense-and-sparse quantization, which allows quantization of the remaining parameters to low precision by addressing extremely high outliers in LLM parameters by preserving outlier values in a compact sparse format.

Non-uniform quantization is used by SqueezeLLM to best represent the LLM weights with less accuracy. When generating the non-uniform codebooks, the non-uniform quantization technique takes into consideration not only the magnitude of values but also the sensitivity of parameters to mistake, offering excellent accuracy for low-precision quantization.

Dense-and-sparse quantization, which SqueezeLLM employs, allows for the greater accuracy storage of a tiny portion of outlier values. This enables precise low-precision quantization for the dense matrix by lowering the needed range that must be represented by the remaining dense component.

The difficulty is in offering cross-platform assistance for low-precision LLM quantization

The method in SqueezeLLM provides for considerable latency reduction in comparison to baseline FP16 inference, as well as efficient and accurate low-precision LLM quantization to minimize memory usage during LLM inference. Their goal was to allow cross-platform availability of these methods for improving LLM inference on systems like Intel Data Center GPUs, by enabling cross-platform support.

SqueezeLLM, on the other hand, depends on handcrafted custom kernel implementations that use dense-and-sparse quantization to tackle the outlier problem with LLM inference and non-uniform quantization to offer correct representation with extremely few bits per parameter.

Even though these kernel implementations are rather simple, it is still not ideal to manually convert and optimize them for various target hardware architectures. They predicted a large overhead while converting their SqueezeLLM kernels to operate on Intel Data Center GPUs since they first created the kernels using CUDA and it took weeks to construct, profile, and optimize these kernels.

Therefore, in order to target Intel Data Center GPUs, they needed a way to rapidly and simply migrate their own CUDA kernels to SYCL. To prevent interfering with the remainder of the inference pipeline, this calls for the ability to convert the kernels with little human labor and the ability to more easily modify the Python-level code to use the custom kernels. They also wanted the ported kernels to be as efficient as possible so that Intel customers could benefit fully from SqueezeLLM‘s efficiency.

SYCLomatic

SYCLomatic offers a way to provide cross-platform compatibility without requiring extra technical work. The effective kernel techniques may be separated from the target deployment platform by using SYCLomatic’s CUDA-to-SYCL code conversion. This allows for inference on several target architectures with little extra engineering work.

Their performance investigation shows that the SYCLomatic-ported kernels achieve a 2.0x speedup on Intel Data Center GPUs running the Llama 7B model, and instantly improve efficiency without the need for human tweaking.

CUDA to SYCL

Solution: A SYCLomatic-Powered CUDA-to-SYCL Migration for Quantized LLMs on Multiple Platforms.

First Conversion

SYCLomatic conversion was carried out in a development environment that included the Intel oneAPI Base Toolkit. Using the SYCLomatic conversion command dpct quant_cuda_kernel.cu, the kernel was moved to SYCL. They are happy to inform that the conversion script changed the kernel implementations as needed and automatically produced accurate kernel definitions. The following examples demonstrate how SYCL-compatible code was added to the kernel implementation and invocations without

Change Python Bindings to Allow Custom Kernel Calling

The bindings were modified to utilize the PyTorch XPU CPP extension (DPCPPExtension) in order to call the kernel from Python code. This enabled the migrating kernels to be deployed using a setup in the deployment environment. Python script:

Initial Bindings Installation CUDA Kernel Installation in the Setup Script

1. setup( name="quant_cuda", 2 .ext_modules=[ 3. cpp_extension.CUDAExtension( 4. "quant_cuda", 5. ["quant_cuda.cpp", "quant_cuda_kernel.cu"] 6. ) 7. ], 8. cmdclass={"build_ext": cpp_extension.BuildExtension}, 9. )

Changed Setup Script Kernel Installation to Bindings

1. setup( 2. name='quant_sycl', 3. ext_modules=[ 4. DPCPPExtension( 5. 'quant_sycl', 6. ['quant_cuda.cpp', 'quant_cuda_kernel.dp.cpp',] 7. ) 8. ], 9. cmdclass={ 10. 'build_ext': DpcppBuildExtension 11. } 12. )

The converted SYCL kernels could be called from PyTorch code when the kernel bindings were installed, allowing end-to-end inference to be conducted with the converted kernels. This made it easier to convert the current SqueezeLLM Python code to support the SYCL code, requiring just small changes to call the migrated kernel bindings.

Analysis of Converted Kernels’ Performance

The ported kernel implementations were tested and benchmarked by the SqueezeLLM team using Intel Data Center GPUs made accessible via the Intel Tiber Developer Cloud. As described earlier, SYCLomatic was used to convert the inference kernels, and after that, adjustments were made to enable calling the SYCL code from the SqueezeLLM Python code.

Benchmarking the 4-bit kernels on the Intel Data Center GPU Max Series allowed us to evaluate the performance gains resulting from low-precision quantization. In order to really enable efficient inference on many platforms, this evaluated if the conversion procedure might provide efficient inference kernels.

Table 1 shows the speedup and average latency for matrix-vector multiplications while using the Llama 7B model to generate 128 tokens. These findings show that substantial speedups may be achieved with the ported kernels without the need for human tweaking.

In order to evaluate the latency advantages of low-precision quantization that are achievable across various hardware back ends without requiring changes to the SYCL code, the 4-bit kernels were benchmarked on the Intel Data Center GPU. Running the Llama 7B model without any human adjustment allows SqueezeLLM to achieve a 2.0x speedup on Intel Data Center GPUs compared to baseline FP16 inference, as Table 1 illustrates.KernelLatency (in seconds)Baseline: fp16 Matrix-Vector Multiplication2.584SqueezeLLM: 4-bit (0% sparsity)1.296Speedup2.0x

When this speedup is contrasted with the 4-bit inference results achieved on the NVIDIA A100 hardware platform, which achieved 1.7x speedups above baseline FP16 inference, it can be shown that the ported kernels outperform the handwritten CUDA kernels designed for NVIDIA GPU systems. These findings demonstrate that equivalent speedups on various architectures may be achieved via CUDA-to-SYCL migration utilizing SYCLomatic, all without requiring extra engineering work or manual kernel tweaking after conversion.

In summary

For new applications, LLM inference is a fundamental task, and low-precision quantization is a crucial way to increase inference productivity. SqueezeLLM allows for low-precision quantization to provide accurate and efficient generative LLM inference. However, cross-platform deployment becomes more difficult due to the need for bespoke kernel implementations. The kernel implementation may be easily converted to other hardware architectures with the help of the SYCLomatic migration tool.

For instance, SYCLomatic-migrated 4-bit SqueezeLLM kernels show a 2.0x speedup on Intel Data Center GPUs without the need for human tweaking. Thus, SYCL conversion democratizes effective LLM implementation by enabling support for many hardware platforms with no additional technical complexity.

Red more on Govindhtech.com

#IntelDataCenterGPU #GPU #SqueezeLLM #LLMInference #LLMs #SYCL #SYCLomatic #CUDA #AI #Llmma #news #technews #technology #technologynews #tehnologytrends #govindhtech

0 notes

govindhtech · 10 months ago

Text

ROCm 6.1.3 With AMD Radeon PRO GPUs For LLM Inference

ROCm 6.1.3 Software with AMD Radeon PRO GPUs for LLM inference.

AMD Pro Radeon

Large Language Models (LLMs) are no longer limited to major businesses operating cloud-based services with specialized IT teams. New open-source LLMs like Meta’s Llama 2 and 3, including the recently released Llama 3.1, when combined with the capability of AMD hardware allow even small organizations to execute their own customized AI tools locally, on regular desktop workstations, eliminating the need to keep sensitive data online.

AMD Radeon PRO W7900

Workstation GPUs like the new AMD Radeon PRO W7900 Dual Slot offer industry-leading performance per dollar with Llama, making it affordable for small businesses to run custom chatbots, retrieve technical documentation, or create personalized sales pitches. The more specialized Code Llama models allow programmers to generate and optimize code for new digital products. These GPUs are equipped with dedicated AI accelerators and enough on-board memory to run even the larger language models.Image Credit To AMD

And now that AI tools can be operated on several Radeon PRO GPUs thanks to ROCm 6.1.3, the most recent edition of AMD’s open software stack, SMEs and developers can support more users and bigger, more complicated LLMs than ever before.

LLMs’ new applications in enterprise AI

The prospective applications of artificial intelligence (AI) are much more diverse, even if the technology is commonly used in technical domains like data analysis and computer vision and generative AI tools are being embraced by the design and entertainment industries.

With the help of specialized LLMs, such as Meta’s open-source Code Llama, web designers, programmers, and app developers can create functional code in response to straightforward text prompts or debug already-existing code bases. Meanwhile, Llama, the parent model of Code Llama, has a plethora of potential applications for “Enterprise AI,” including product personalization, customer service, and information retrieval.

Although pre-made models are designed to cater to a broad spectrum of users, small and medium-sized enterprises (SMEs) can leverage retrieval-augmented generation (RAG) to integrate their own internal data, such as product documentation or customer records, into existing AI models. This allows for further refinement of the models and produces more accurate AI-generated output that requires less manual editing.

How may LLMs be used by small businesses?

So what use may a customized Large Language Model have for a SME? Let’s examine a few instances. Through the use of an LLM tailored to its own internal data:

Even after hours, a local retailer may utilize a chatbot to respond to consumer inquiries.

Helpline employees may be able to get client information more rapidly at a bigger shop.

AI features in a sales team’s CRM system might be used to create customized customer pitches.

Complex technological items might have documentation produced by an engineering company.

Contract drafts might be first created by a solicitor.

A physician might capture information from patient calls in their medical records and summarize the conversations.

Application forms might be filled up by a mortgage broker using information from customers’ papers.

For blogs and social media postings, a marketing firm may create specialized text.

Code for new digital items might be created and optimized by an app development company.

Online standards and syntactic documentation might be consulted by a web developer.

That’s simply a small sample of the enormous potential that exists in enterprise artificial intelligence.

Why not use the cloud for running LLMs?

While there are many cloud-based choices available from the IT sector to implement AI services, small companies have many reasons to host LLMs locally.

Data safety

Predibase research indicates that the main barrier preventing businesses from using LLMs in production is their apprehension about sharing sensitive data. Using AI models locally on a workstation eliminates the need to transfer private customer information, code, or product documentation to the cloud.

Reduced latency

In use situations where rapid response is critical, such as managing a chatbot or looking up product documentation to give real-time assistance to clients phoning a helpline, running LLMs locally as opposed to on a distant server minimizes latency.

More command over actions that are vital to the purpose

Technical personnel may immediately fix issues or release upgrades by executing LLMs locally, eliminating the need to wait on a service provider situated in a different time zone.

The capacity to sandbox test instruments

IT teams may test and develop new AI technologies before implementing them widely inside a company by using a single workstation as a sandbox.Image Credit To AMD

AMD GPUs

How can small businesses use AMD GPUs to implement LLMs?

Hosting its own unique AI tools doesn’t have to be a complicated or costly enterprise for a SME since programs like LM Studio make it simple to run LLMs on desktop and laptop computers that are commonly used with Windows. Retrieval-augmented generation may be easily enabled to tailor the result, and LM Studio can use the specialized AI Accelerators in modern AMD graphics cards to increase speed since it is designed to operate on AMD GPUs via the HIP runtime API.

AMD Radeon Pro

While consumer GPUs such as the Radeon RX 7900 XTX have enough memory to run smaller models, such as the 7-billion-parameter Llama-2-7B, professional GPUs such as the 32GB Radeon PRO W7800 and 48GB Radeon PRO W7900 have more on-board memory, which allows them to run larger and more accurate models, such as the 30-billion-parameter Llama-2-30B-Q8.Image Credit To AMD

Users may host their own optimized LLMs directly for more taxing activities. A Linux-based system with four Radeon PRO W7900 cards could be set up by an IT department within an organization to handle requests from multiple users at once thanks to the latest release of ROCm 6.1.3, the open-source software stack of which HIP is a part.

In testing using Llama 2, the Radeon PRO W7900’s performance-per-dollar surpassed that of the NVIDIA RTX 6000 Ada Generation, the current competitor’s top-of-the-range card, by up to 38%. AMD hardware offers unmatched AI performance for SMEs at an unbelievable price.

A new generation of AI solutions for small businesses is powered by AMD GPUs

Now that the deployment and customization of LLMs are easier than ever, even small and medium-sized businesses (SMEs) may operate their own AI tools, customized for a variety of coding and business operations.

Professional desktop GPUs like the AMD Radeon PRO W7900 are well-suited to run open-source LLMs like Llama 2 and 3 locally, eliminating the need to send sensitive data to the cloud, because of their large on-board memory capacity and specialized AI hardware. And for a fraction of the price of competing solutions, companies can now host even bigger AI models and serve more users thanks to ROCm, which enables inferencing to be shared over many Radeon PRO GPUs.