#googlekubernetengine
Explore tagged Tumblr posts
govindhtech · 11 months ago
Text
Kubernetes Cluster Autoscaler Gets Smarter and Faster
Tumblr media
The things that you never have to consider might sometimes have the biggest effects when it comes to cloud infrastructure. Google Cloud has a long history of subtly innovating behind the scenes with Google Kubernetes Engine (GKE), optimising the unseen gears that keep your clusters operating smoothly. Even while these improvements don’t often make the news, users will still benefit from better performance, lower latency, and an easier user experience.
Some of these “invisible” GKE developments are being highlighted by Google  Cloud, especially in the area of infrastructure autoscaling . Let’s see how the latest updates to the Cluster Autoscaler (CA) can greatly improve the performance of your workload without requiring you to make any new configurations.
What has Cluster Autoscaler updated to?
The Cluster Autoscaler the part that regulates your node pools’ size automatically based on demand has been the subject of intense development by the GKE team. Below is a summary of some significant enhancements:
Target replica count tracking
This feature helps with scaling when multiple Pods are added at once (e.g., during major resizes or new deployments). Additionally, a 30-second delay that hampered GPU autoscaling is gone. The community as a whole will gain from enhanced Kubernetes performance when this capability becomes open-source.
Quick homogenous scale-up
By effectively bin-packing pods onto nodes, this optimisation expedites the scaling process if you have many identical pods.
Reduced CPU waste
When several scale-ups across many node pools are required, the Cluster Autoscaler now takes decisions more quickly. Furthermore, Cluster Autoscaler is more intelligent about when to execute its control loop, preventing needless delays.
Memory optimisation
The Cluster Autoscaler has also undergone memory optimisations, which add to its overall efficiency even though they are not immediately evident to the user.
Benchmarking outcomes
In order to showcase the practical implications of these modifications, Google Cloud carried out a sequence of tests utilising two GKE versions (1.27 and 1.29) and several scenarios:
At the infrastructure level
Autopilot generic 5k scaled workload: Google  Cloud assessed the time it took for each pod to become ready after deploying a workload with 5,000 replicas on Autopilot.
Busy batch cluster: By generating 100 node pools and launching numerous 20-replica jobs on a regular basis, Google  Cloud replicated a high-traffic batch cluster. The scheduling latency was then assessed by Google Cloud.
10-replica GPU test: The amount of time it took for each pod to be ready was determined using a 10-replica GPU deployment.
Level of workload:
Application end-user latency test: Google Cloud used a standard web application that, in the absence of load, reacts to an API call with a defined latency and response time. Google  Cloud evaluated the performance of different GKE versions under a typical traffic pattern that causes GKE to scale with both HPA and NAP using the industry-standard load testing tool, Locust. Google  Cloud used an HPA CPU goal of 50% and scaled the application on CPU to assess end-user delay for P50 and P95.
Results highlights 
ScenarioMetricGKE v1.27(baseline)GKE v1.29Autopilot generic 5k replica deploymentTime-to-ready7m 30s3m 30s (55% improvement)Busy batch clusterP99 scheduling latency9m 38s7m 31s(20% improvement)10-replica GPUTime-to-ready2m 40s2m 09s(20% improvement)Application end-user latencyApplication response latency as measured by the end user. P50 and P95 in seconds.P50: 0.43sP95: 3.4sP50: 0.4sP95: 2.7s(P95: 20% improvement)Note: These results are illustrative and will vary based on your specific workload and configuration
Major gains usually require rigorous optimisation or overprovisioned infrastructure, such as cutting the deployment time of 5k Pods in half or improving application response latency at the 95th percentile by 20%. One notable feature of the new modifications to Cluster Autoscaler is that these gains are achieved without the need for elaborate settings, unused resources, or overprovisioning.
Several new features, both visible and unseen, are added with every new version of GKE, so be sure to keep up with the latest updates. And keep checking back for further details on how Google  Cloud is working to adapt GKE to the needs of contemporary cloud-native applications!
This article offers instructions for maintaining the smoothest possible update of your Google Kubernetes Engine (GKE) cluster as well as suggestions for developing an upgrade plan that meets your requirements and improves the availability and dependability of your environments. With little interruption to your workload, you may use this information to maintain your clusters updated for stability and security.
Create several environments
Use of numerous environments is recommended by Google  Cloud as part of your software update delivery procedure. You may reduce risk and unplanned downtime by using multiple settings to test infrastructure and software changes apart from your production environment. You should have a pre-production or test environment in addition to the production environment, at the very least.
Enrol groups in channels for release
Updates for Kubernetes are frequently released in order to bring new features, address known bugs, and provide security updates. You can choose how much emphasis to place on the feature set and stability of the version that is deployed in the cluster using the GKE release channels. Google automatically maintains the version and upgrade cadence for the cluster and its node pools when you enrol a new cluster in a release channel.
In summary
Google  Cloud is dedicated to ensuring that using and managing Kubernetes is not only powerful but also simple. Google  Cloud helps GKE administrators focus on their applications and business objectives while ensuring that their clusters are expanding effectively and dependably by optimising core processes like Cluster Autoscaler.
Read more on govindhtech.com
0 notes
govindhtech · 1 year ago
Text
How ML Productivity Goodput Optimizes AI Workflows
Tumblr media
ML Productivity Goodput
One of the most exciting times for computer scientists. Large-scale generative models have entered research and human-technology interaction, including software design, education, and creativity. As ever-larger computation becomes available, the performance and capabilities of these foundation models continue to advance. This is typically indicated by the number of floating-point operations needed to train a model.
Larger and more effective compute clusters enable this quick increase in compute scale. Nevertheless, the mean time between failures (MTBF) of the entire system decreases linearly with increasing compute cluster scale (as indicated by the number of accelerators or nodes), which causes a linear rise in the failure rate. Moreover, infrastructure costs rise linearly as well, meaning that the total cost of failure grows quadratically as compute cluster size increases.
The overall machine learning system’s true efficiency is crucial to its sustainability for large-scale training, as unchecked performance can make scaling up to a particular level impossible. However, if properly designed, it can assist you in opening up new avenues on a bigger scale. To quantify this efficiency, they present a brand-new metric called ML Productivity Goodput in this blog post. They also present techniques to maximise ML Productivity Goodput, as well as an API that you can incorporate into your projects to measure and track Goodput.
What is Goodput
The three goodput metrics that make up ML Productivity Goodput are Scheduled Goodput, Runtime Goodput, and Programme Goodput.
The fraction of time that all the resources needed to complete the training job are available is measured by scheduling goodput. Due to possible stockouts, this factor is less than 100% in on-demand or preemptible consumption models. Therefore, in order to maximise your Scheduling Goodput score, they advise you to reserve your resources.
The Runtime Goodput metric quantifies the amount of time required to advance when all training resources are available as a percentage of total time. Careful engineering considerations are necessary to maximise runtime. They go over how to measure and optimise runtime for your large-scale training jobs on Google Cloud in the following section.
The percentage of hardware performance that can be extracted by the training job is measured by Programme Goodput. Programme goodput, or the model training throughput as a percentage of the system’s peak throughput, is also known as Model Flop Utilisation or Effective Model Flop Utilisation. Effective compute communication overlaps and thoughtful distribution strategies are two important factors that determine programme output when scaling to the required number of accelerators.
Google’s Hypercomputer with AI
An AI hypercomputer is a supercomputing architecture designed to increase machine learning (ML) productivity for AI training, tuning, and serving applications. It combines a carefully chosen set of functions created through systems-level codesign. The way that various components of ML Productivity Goodput are encoded into AI Hypercomputer is shown in the following diagram:
AI Hypercomputer encodes particular capabilities targeted at optimising the Programme and Runtime Goodput across the framework, runtime, and orchestration layers, as seen in the diagram above. The remaining portion of this post will concentrate on AI Hypercomputer components that can assist you in making the most of it.
Comprehending Runtime Goodput
The quantity of beneficial training steps finished within a specified time frame is the fundamental component of runtime goodput. They can estimate Runtime Goodput as follows based on an assumed checkpointing interval, the time to reschedule the slice, and the time to resume training
In addition, this analytical model gives us the precise three factors that They should minimise to maximise the Runtime Goodput:
Timing of the failure from the last checkpoint.
Timing of training resume . Another important consideration is the time required to reschedule the slice; this is covered under scheduling output.
Presenting the API for Goodput Measurement
Measuring something is the first step towards making improvements. With a Python package, you can use the Goodput Measurement API to instrument (Scheduling Goodput Runtime Goodput) measurement into your code. The Goodput Measurement API offers ways to read the progress from Cloud Logging and report your training step progress to it, allowing you to measure and track Runtime Goodput.
Optimising Scheduling Productivity
Goodput scheduling is subject to the availability of EVERY resource needed to carry out the training. They introduced a DWS calendar mode that reserves compute resources for the training job in order to maximise Goodput for short-term usage. Moreover, they advise employing “hot spares” to reduce the amount of time needed to schedule resources when returning from a break. By using the hot spares and reserved resources, they can increase Scheduling Goodput.
Optimising Runtime Productivity
The following suggested techniques are provided by AI Hyper computer to optimise Runtime Goodput:
Turn on auto-checkpointing.
Make use of container pre-loading, which Google Kubernetes Engine offers.
Employ a long-lasting compilation cache.
The automatic checkpointing
With auto-checkpointing, you can start checkpointing when a SIGTERM signal is received, indicating that the training job will soon be interrupted. In the event of maintenance events or preemption related to defragmentation, auto-checkpointing can be helpful in minimising loss since the last checkpoint.
Both Maxtext, a reference implementation for high-performance training and serving on Google Cloud, and orbax provide an example implementation of auto-checkpointing.
For training on both Cloud TPUs and GPUs, auto-checkpointing is available for training orchestrators that are based on GKE as well as those that are not.
Pre-loading containers
Following a failure or other disruption, it’s critical to quickly resume training in order to attain the highest possible Goodput score. Hence, Google Kubernetes Engine (GKE), which enables model and container preloading from a secondary boot disc, is what they advise. GKE’s container and model preloading enables a workload, particularly a large container image, to start up very quickly. It is currently available in preview. This implies that there will be little loss of time in training when it fails or experiences other disruptions. This is crucial because, for large images, retrieving a container image from object storage can be important when resuming a job.
Pre-loading enables you to designate a backup boot disc with the necessary container image for auto-provisioning or nodepool creation. As soon as GKE displays the failed node, the necessary container images become available, allowing you to quickly resume training.
They found that the image pull operation for a 16GB container, with container preloading, was roughly 29X faster than the baseline (image pull from container registry).
Long-term compilation cache
XLA compiler-based computation stacks are made possible in large part by just-in-time compilation and system-aware optimisations. Computation graphs are compiled only once and run repeatedly with different input data in the majority of effective training loops.
If the graph shapes don’t change, recompilation is avoided thanks to a compilation cache. This cache could be lost in the case of a failure or interruption, which would slow down the training resumption process and negatively impact the Runtime Goodput. By enabling users to save compilation cache to Cloud Storage so that it endures restart events, a persistent compilation cache helps to address this issue.
Moreover, recent developments have improved the job-scheduling throughput by 3X using GKE, the orchestration layer that is advised for AI Hyper computers, which helps decrease time to resume (trm).
Increasing Programme Output
Script Efficiency or Model Failure? As the training programme advances, utilisation is dependent upon how well the underlying compute is used. Programme Goodput is influenced by distribution strategy, effective compute communication overlap, optimised memory access, and pipeline design.
One of the key elements of the AI Hypercomputer is the XLA compiler, which enables you to optimise programme output through built-in optimisations and straightforward, effective scaling APIs like GSPMD, which let users express a variety of parallelisms with ease and effectively take advantage of scale. Three major features have been added recently to help users of PyTorch/XLA and Jax get the most out of their programmes.
Unique Kernel utilising XLA
In compiler-driven computation optimisation, they frequently require a “escape hatch” that enables users to surpass the default performance by writing more effective implementations using basic primitives for complex computation blocks. The library designed to support custom kernels for Cloud TPUs and GPUs is called Jax/Pallas. It is compatible with PyTorch/XLA and Jax. Pallas can be used to write custom kernels, such as Block Sparse Kernels and Flash Attention. For longer sequence lengths, the Flash attention kernel contributes to better Programme Goodput or Model Flop Utilisation more pronounced for sequence lengths 4K or above.
Offloading the host
Accelerator memory is a scarce resource for large-scale model training, so they frequently trade off compute cycles for accelerator memory resources by doing things like activation re-materialization. Another method they recently added to the XLA compiler is host offload, which uses host DRAM to offload activations computed during the forward pass and reuse them for gradient computation during the backward pass. By reducing the number of activation recomputation cycles, host offload enhances programme throughput.
AQT-Based Int8 Mixed Precision Training
In order to increase training efficiency and, consequently, Programme Goodput without sacrificing convergence, another method called Accute Quantized Training maps a subset of matrix multiplications in the training step to int8.
The aforementioned methods are combined in the following benchmark to increase programme goodput for a MaxText 128b dense LLM implementation.
Using all three of these strategies together increases the Programme Goodput by a cumulative 46% in this benchmark. Enhancing programme output is frequently an iterative procedure. The model architecture and training hyper parameters determine the real gains for a given training task.
In summary
Business value is enabled by large-scale training for generative models; however, as ML training scales, productivity suffers. The ML Productivity Goodput is a metric that they defined in this post to assess the overall ML productivity of large-scale training tasks. they covered the components of the AI Hypercomputer that can help you maximise ML Productivity Goodput at scale, as well as the introduction of the Goodput measurement API. With AI Hypercomputer, they look forward to assisting you in maximising your ML productivity at scale.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Benefits of Gemma on GKE for Generative AI
Tumblr media
Gemma On GKE: New features to support open generative AI models
Now is a fantastic moment for businesses using AI to innovate. Their biggest and most powerful AI model, Gemini, was just released by Google. Gemma, a family of modern, lightweight open models derived from the same technology and research as the Gemini models, was then introduced. In comparison to other open models, the Gemma 2B and 7B models perform best-in-class for their size.
They are also pre-trained and come with versions that have been fine-tuned to facilitate research and development. With the release of Gemma and their expanded platform capabilities, they will take the next step towards opening up AI to developers on Google Cloud and making it more visible.
Let’s examine the improvements they introduced to Google Kubernetes Engine (GKE) now to assist you with serving and deploying Gemma on GKE Standard and Autopilot:
Integration with Vertex AI Model Garden, Hugging Face, and Kaggle: As a GKE client, you may begin using Gemma in Vertex AI Model Garden, Hugging Face, or Kaggle. This makes it simple to deploy models to the infrastructure of your choice from the repositories of your choice.
GKE notebook using Google Colab Enterprise: Developers may now deploy and serve Gemma using Google Colab Enterprise if they would rather work on their machine learning project in an IDE-style notebook environment.
A low-latency, dependable, and reasonably priced AI inference stack: They previously revealed JetStream, a large language model (LLM) inference stack on GKE that is very effective and AI-optimized. In addition to JetStream, they have created many AI-optimized inference stacks that are both affordable and performante, supporting Gemma across ML Frameworks (PyTorch, JAX) and powered by Cloud GPUs or Google’s custom-built Tensor Processor Units (TPU).
They released a performance deepdive of Gemma on Google Cloud AI-optimized infrastructure earlier now a days, which is intended for training and servicing workloads related to generative AI.
Now, you can utilise Gemma to create portable, customisable AI apps and deploy them on GKE, regardless of whether you are a developer creating generative AI applications, an ML engineer streamlining generative AI container workloads, or an infrastructure engineer operationalizing these container workloads.
Vertex AI Model Garden, hugging face, and connecting with Kaggle
Their aim is to simplify the process of deploying AI models on GKE, regardless of the source.
Putting a Face Hug
They established a strategic alliance with Hugging Face, one of the go-to places for the AI community, earlier this year to provide data scientists, ML engineers, and developers access to the newest models. With the introduction of the Gemma model card, Hugging Face made it possible for Gemma to be deployed straight to Google Cloud. You may choose to install and serve Gemma on Vertex AI or GKE after selecting the Google Cloud option, which will take you to Vertex Model Garden.
Model Garden Vertex
Gemma now has access to over 130 models in the Vertex AI Model Garden, including open-source models, task-specific models from Google and other sources, and enterprise-ready foundation model APIs.
Kaggle
Developers can browse through thousands of trained, deployment-ready machine learning models in one location with Kaggle. A variety of model versions (PyTorch, FLAX, Transformers, etc.) are available on the Gemma model card on Kaggle, facilitating an end-to-end process for downloading, installing, and managing Gemma on a GKE cluster. Customers of Kaggle may also choose to “Open in Vertex,” which directs them to Vertex Model Garden and gives them the option to deploy Gemma as previously mentioned on Vertex AI or GKE. Gemma’s model page on Kaggle allows you to examine real-world examples that the community has posted using Gemma.
Google Colab Enterprise
Notebooks from Google Colab Enterprise
Through Vertex AI Model Garden, developers, ML engineers, and ML practitioners may now use Google Colab Enterprise notebooks to deploy and serve Gemma on GKE. The pre-populated instructions in the code cells of Colab Enterprise notebooks provide developers, ML engineers, and scientists the freedom to install and perform inference on GKE using an interface of their choice.
Serve Gemma models on infrastructure with AI optimizations
Performance per dollar and cost of service are important considerations when doing inference at scale. With Google Cloud TPUs and GPUs, an AI-optimized infrastructure stack, and high-performance and economical inference, GKE is capable of handling a wide variety of AI workloads. 
By smoothly combining TPUs and GPUs, GKE enhances their ML pipelines, enabling us to take use of each device’s advantages for certain jobs while cutting down on latency and inference expenses. For example, they deploy a big text encoder on TPU to handle text prompts effectively in batches. Then, they use GPUs to run their proprietary diffusion model, which uses the word embeddings to produce beautiful visuals. Yoav HaCohen, Ph.D., Head of Lightricks’ Core Generative AI Research Team.
Gemma using TPUs on GKE
The most widely used LLMs are already supported by a number of AI-optimized inference and serving frameworks that now enable Gemma on Google Cloud TPUs, should you want to employ Google Cloud TPU accelerators with your GKE infrastructure. Among them are:
Jet Stream Today
They introduced JetStream(MaxText) and JetStream(PyTorch-XLA), a new inference engine particularly made for LLM inference, to optimise inference performance for PyTorch or JAX LLMs on Google Cloud TPUs. JetStream provides good throughput and latency for LLM inference on Google Cloud TPUs, marking a major improvement in both performance and cost effectiveness. JetStream combines sophisticated optimisation methods including continuous batching, int8 quantization for weights, activations, and KV caching to provide efficiency while optimising throughput and memory utilisation. Google’s suggested TPU inference stack is called JetStream.
Use this guide to get started with JetStream inference for Gemma on GKE and Google Cloud TPUs.
Gemma using GPUs on GKE
The most widely used LLMs are already supported by a number of AI-optimized inference and serving frameworks that now enable Gemma on Google Cloud GPUs, should you want to employ Google Cloud GPU accelerators with your GKE infrastructure.
What is vLLM
To improve serving speed for PyTorch generative AI users, vLLM is an open-source LLM serving system that has undergone extensive optimisation.
Some of the attributes of vLLM include:
An improved transformer programme using PagedAttention
Continuous batching to increase serving throughput overall
Tensor parallelism and distributed serving across several GPUs
To begin using vLLM for Gemma on GKE and Google Cloud GPUs, follow this tutorial
Text Generation Inference (TGI)
Text creation Inference (TGI), an open-source LLM serving technology developed by Hugging Face, is highly optimised to enable high-performance text generation during LLM installation and serving. Tensor parallelism, continuous batching, and distributed serving over several GPUs are among the features that TGI offers to improve overall serving performance.
Hugging Face Text Generation Inference for Gemma on GKE and Google Cloud GPUs may be used with the help of this tutorial.
Tensor RT-LLM
To improve the inference performance of the newest LLMs, customers utilising Google cloud GPU VMs with NVIDIA Tensor Core GPUs may make use of NVIDIA Tensor RT-LLM, a comprehensive library for compiling and optimising LLMs for inference. Tensor RT-LLM supports features like continuous in-flight batching and paged attention.
This guide will help you build up NVIDIA Tensor Core GPU-powered GPU virtual machines (GKE) and Google Cloud GPU VMs for NVIDIA Triton with Tensor RT LLM backend.
Google Cloud provides a selection of options to meet your needs, whether you’re a developer utilising Gemma to design next-generation AI models or choosing training and serving infrastructure for those models. GKE provides an independent, adaptable, cost-effective, and efficient platform for AI model development that may be used to the creation of subsequent models.
Read more on Govindhtech.com
0 notes