#AIHypercomputers | Explore Tumblr posts and blogs

govindhtech · 8 months ago

Text

AI Hypercomputer’s New Resource Hub & Speed Enhancements

Google AI Hypercomputer

Updates to the AI hypercomputer software include a new resource center, quicker training and inference, and more.

AI has more promise than ever before, and infrastructure is essential to its advancement. Google Cloud’s supercomputing architecture, AI Hypercomputer, is built on open software, performance-optimized hardware, and adaptable consumption models. When combined, they provide outstanding performance and efficiency, scalability and resilience, and the freedom to select products at each tier according to your requirements.

A unified hub for AI Hypercomputer resources, enhanced resiliency at scale, and significant improvements to training and inference performance are all being announced today.

Github resources for AI hypercomputers

The open software layer of AI Hypercomputer offers reference implementations and workload optimizations to enhance the time-to-value for your particular use case, in addition to supporting top ML Frameworks and orchestration options. Google Cloud is launching the AI Hypercomputer GitHub organization to make the advancements in its open software stack easily accessible to developers and practitioners. This is a central location where you can find reference implementations like MaxText and MaxDiffusion, orchestration tools like xpk (the Accelerated Processing Kit for workload management and cluster creation), and GPU performance recipes on Google Cloud. It urges you to join us as it expand this list and modify these resources to reflect a quickly changing environment.

A3 Mega VMs are now supported by MaxText

MaxText is an open-source reference implementation for large language models (LLMs) that offers excellent speed and scalability. Performance-optimized LLM training examples are now available for A3 Mega VMs, which provide a 2X increase in GPU-to-GPU network capacity over A3 VMs and are powered by NVIDIA H100 Tensor Core GPUs. To make it possible for collaborative communication and computing on GPUs to overlap, Google Cloud collaborated closely with NVIDIA to enhance JAX and XLA. It has included example scripts and improved model settings for GPUs with XLA flags enabled.

As the number of VMs in the cluster increases, MaxText with A3 Mega VMs can provide training performance that scales almost linearly, as seen below using Llama2-70b pre-training.

Moreover, FP8 mixed-precision training on A3 Mega VMs can be used to increase hardware utilization and acceleration. Accurate Quantized Training (AQT), the quantization library that drives INT8 mixed-precision training on Cloud TPUs, is how it added FP8 capability to MaxText.

Its results on dense models show that FP8 training with AQT can achieve up to 55% more effective model flop use (EMFU) than bf16.

Reference implementations and kernels for MoEs

Consistent resource usage of a small number of experts is beneficial for the majority of mixture of experts (MoE) use cases. But for some applications, it is more crucial to be able to leverage more experts to create richer solutions. Google Cloud has now added both “capped” and “no-cap” MoE implementations to MaxText to give you this flexibility, allowing you to select the one that best suits your model architecture. While no-cap models dynamically distribute resources for maximum efficiency, capped MoE models provide predictable performance.

Pallas kernels, which are optimized for block-sparse matrix multiplication on Cloud TPUs, have been made publicly available to speed up MoE training even more. Pallas is an extension to JAX that gives fine-grained control over code created for XLA devices like GPUs and TPUs; at the moment, block-sparse matrix multiplication is only available for TPUs. These kernels offer high-performance building pieces for training your MoE models and are compatible with both PyTorch and JAX.

With a fixed batch size per device, our testing using the no-cap MoE model (Mixtral-8x7b) shows nearly linear scalability. When it raised the number of experts in the base setup with the number of accelerators, it also saw almost linear scaling, which is suggestive of performance on models with larger sparsity.

Monitoring large-scale training

MLOps can be made more difficult by having sizable groups of accelerators that are supposed to collaborate on a training task. “Why is this one device in a segfault?” is a question you may have. “Did host transfer latencies spike for a reason?” is an alternative. However, monitoring extensive training operations with the right KPIs is necessary to maximize your resource use and increase overall ML Goodput.

Google has provided a reference monitoring recipe to make this important component of your MLOps charter easier to understand. In order to detect anomalies in the configuration and take remedial action, this recipe assists you in creating a Cloud Monitoring dashboard within your Google Cloud project that displays helpful statistical metrics like average or maximum CPU consumption.

Cloud TPU v5p SparseCore is now GA

High-performance random memory access is necessary for recommender models and embedding-based models to utilize the embeddings. The TPU’s hardware embedding accelerator, SparseCore, lets you create recommendation systems that are more potent and effective. With four dedicated SparseCores per Cloud TPU v5p chip, DLRM-V2 can perform up to 2.5 times faster than its predecessor.

Enhancing the performance of LLM inference

Lastly, it implemented ragged attention kernels and KV cache quantization in JetStream, an open-source throughput-and-memory-optimized engine for LLM inference, to enhance LLM inference performance. When combined, these improvements can increase inference performance on Cloud TPU v5e by up to 2X.

Boosting your AI adventure

Each part of the AI Hypercomputer serves as a foundation for the upcoming AI generation, from expanding the possibilities of model training and inference to improving accessibility through a central resource repository.

Read more on Govindhtech.com

#AIHypercomputers #AI #GitHub #MaxText #FP8 #A3MegaVMs #MoEs #LLM #News #Technews #Technology #Technologynews #Technologytrends #govindhtech

0 notes

phonemantra-blog · 1 year ago

Link

Get ready for a revolution in AI! Google has unveiled its latest creation, the Gemini 1.5 Pro, a groundbreaking AI model boasting a significantly larger context window than its predecessor. This advancement unlocks a new level of understanding and responsiveness, paving the way for exciting possibilities in human-AI interaction. Understanding the Context Window: The Key to Smarter AI Imagine a conversation where you can reference details mentioned hours ago, or seamlessly switch between topics without losing the thread. That's the power of a large context window in AI. Essentially, the context window determines the amount of information an AI can consider at once. This information can be text, code, or even audio (as we'll see later). The larger the context window, the better the AI can understand complex relationships and nuances within the information it's processing. Google Unveils Gemini 1.5 Pro Gemini 1.5 Pro: A Quantum Leap in Contextual Understanding The standard version of Gemini 1.5 Pro boasts a massive 128,000 token window. Compared to the 32,000 token window of its predecessor, Gemini 1.0, this represents a significant leap forward. For those unfamiliar with the term "token," it can be a word, part of a word, or even a syllable. But Google doesn't stop there. A limited version of Gemini 1.5 Pro is available with an astronomical one million token window. This allows the model to process information equivalent to roughly 700,000 words, or about ten full-length books! Imagine the possibilities! This "super brain" can analyze vast amounts of data, identify subtle connections, and generate insightful responses that would be beyond the reach of traditional AI models. Beyond Context: New Features Empower Developers The impressive context window is just the tip of the iceberg. Gemini 1.5 Pro comes packed with exciting new features designed to empower developers and unlock even greater potential: Native Audio and Speech Support: Gemini 1.5 Pro can now understand and respond to spoken language. This opens doors for applications like voice search, real-time translation, and intelligent virtual assistants. Simplified File Management: The new File API streamlines how developers handle files within the model. This improves efficiency and simplifies the development process. Granular Control: System instructions and JSON mode offer developers more control over how Gemini 1.5 Pro functions. This allows them to tailor the model's behavior to specific tasks and applications. Multimodal Capabilities: The model's ability to analyze not just text but also images and videos makes it a truly versatile tool. This paves the way for innovative applications in areas like visual search, content moderation, and even autonomous vehicles. Global Accessibility: Gemini 1.5 Pro Reaches Over 180 Countries The launch of Gemini 1.5 Pro in over 180 countries, including India, marks a significant step towards democratizing AI technology. This powerful model, with its unparalleled context window and suite of new features, is no longer limited to a select few. Developers and users worldwide can now explore the potential of AI and create innovative solutions that address local and global challenges. Google's AI and Hardware Advancements: A Multi-faceted Approach Google's commitment to AI advancement extends beyond the impressive capabilities of Gemini 1.5 Pro. Here are some additional highlights from their announcement: Axion Chip Unveiled: Google has entered the ARM-based CPU market with the Axion chip. This chip promises significant improvements, boasting "up to 50% better performance and up to 60% better energy efficiency" compared to current x86-based options. This advancement could have a major impact on the efficiency and scalability of AI applications. AI Hypercomputer Gets a Boost: Google's AI Hypercomputer architecture receives an upgrade with A3 Mega VMs powered by NVIDIA H100 Tensor Core GPUs. This translates to higher performance for large-scale training and research in the field of AI. Cloud TPU v5p Now Generally Available: Cloud TPU v5p, Google's custom-designed Tensor Processing Units specifically designed for AI workloads, are now generally available. This will provide developers and researchers with easier access to the powerful processing capabilities needed for cutting-edge AI projects. FAQs Q: What is a context window in AI? A: A context window refers to the amount of information an AI model can consider at once. A larger context window allows the AI to understand complex relationships and nuances within the information it's processing. Q: How much bigger is the context window in Gemini 1.5 Pro compared to its predecessor? A: The standard version of Gemini 1.5 Pro boasts a 128,000 token window, which is four times larger than the 32,000 token window of Gemini 1.0. Q: Can Gemini 1.5 Pro understand spoken language? A: Yes, Gemini 1.5 Pro features native audio and speech support, allowing it to understand and respond to spoken language. Q: Is Gemini 1.5 Pro available in my country? A: The launch of Gemini 1.5 Pro in over 180 countries marks a significant step towards democratizing AI technology. It's likely available in your country, but you can confirm on Google's official website.

#A3MegaVMs #AIHypercomputer #AIModelCapabilities #AxionChip #ContextWindow #FileAPI #Gemini1.5Pro #GlobalAIAccessibility #googleai #GoogleUnveilsGemini1.5Pro #JSONMode #LargeLanguageModel #MultimodalCapabilities #NativeAudioSupport #SystemInstructions

0 notes

govindhtech · 11 months ago

Text

Hex-LLM: High-Efficiency LLM Serving to Vertex AI with TPUs

Google Cloud aims to provide extremely effective and cost-optimized ML workflow recipes with Vertex AI Model Garden. At the moment, it provides over 150 first-party, open, and third-party foundation models. Google debuted the well-liked open source LLM serving stack vLLM on GPUs at Vertex Model Garden last year. Since then, the deployment of serving has grown rapidly. Google is excited to present Hex-LLM, the Vertex AI Model Garden’s High-Efficiency LLM Serving with XLA on TPUs.

Vertex AI’s proprietary LLM serving framework, Hex-LLM, was created and refined for Google Cloud TPU hardware, which is offered as a component of the AI Hypercomputer. Modern LLM serving technologies, such as paged attention and continuous batching, are combined with internal XLA/TPU-specific optimisations to provide Hex-LLM, Google’s most recent low-cost, high-efficiency LLM serving solution on TPU for open-source models. Hex-LLM may now be deployed with a single click, on a notebook, or on the playground in Vertex AI Model Garden. Google is eager to learn how your LLM serving routines might benefit from Hex-LLM and Cloud TPUs.

Structure and standards

Inspired by several popular open-source projects, such as vLLM and FlashAttention, Hex-LLM combines the most recent LLM serving technology with custom optimisations made specifically for XLA/TPU.

Hex-LLM’s primary optimisations consist of:

For KV caching, the token-based continuous batching approach guarantees the highest possible memory utilisation.

A total overhaul of the PagedAttention kernel with XLA/TPU optimisations.

To efficiently execute big models on several TPU processors, flexible and composable data parallelism and tensor parallelism algorithms with particular weights sharding optimisation are used.

Furthermore, a large variety of well-liked dense and sparse LLM models are supported by Hex-LLM, such as:

Gemma 2B and 7B

Gemma 2 9B and 27B

Llama 2 7B, 13B, and 70B

Llama 3 8B and 70B

Mistral 7B and Mixtral 8x7B

Google is dedicated to providing Hex-LLM with the newest and best foundation models along with increasingly sophisticated technology as the LLM area continues to develop.

Hex-LLM offers low latency, high throughput, and competitive performance. The metrics that measured in Google’s benchmark experiments are described as follows:

The average quantity of tokens an LLM server receives each second is measured in TPS (tokens per second). TPS is a more precise way to measure the traffic of an LLM server than QPS (queries per second), which is used to measure the traffic of a generic server.

The amount of tokens the server can produce at a given TPS over a given period of time is measured by throughput. This is a crucial indicator for assessing a system’s capacity to handle several requests at once.

The average time to generate a single output token at a given TPS is measured by latency. This calculates the total amount of time, including queueing and processing time, that is spent on the server side for every request.

Keep in mind that low latency and high throughput are typically mutually exclusive. Throughput and latency should both rise as TPS rises. At a given TPS, the throughput will saturate, but the latency will keep rising as the TPS rises. As a result, Google team may monitor the server’s throughput and latency metrics given a specific TPS. The throughput-latency plot that is produced in relation to various TPS provides a precise assessment of the LLM server’s performance.

A sample of the ShareGPT dataset, a commonly used dataset with prompts and outputs of varying durations, is used to benchmark Hex-LLM.

The performance of the Llama 2 70B (int8 weight quantised) and Gemma 7B versions on eight TPU v5e chips is shown in the following charts:

The Gemma 7B model has a maximum TPS of 6250 output tokens per second and a minimum TPS of 6ms per output token.

Llama 2 70B int8 model: at the lowest TPS, 26 ms per output token, and at the highest TPS, 1510 output tokens per second.

Image credit to Google CloudImage credit to Google cloud

Launch Vertex AI Model Garden now

The Hex-LLM TPU serving container has been incorporated into Vertex AI Model Garden. Through the playground, one-click deployment, or Colab Enterprise Notebook examples for a range of models, users can utilise this serving technology.

Vertex Artificial Intelligence Model, a pre-deployed Vertex AI Prediction endpoint that is incorporated into the user interface is Garden’s playground. After entering the request’s optional arguments and the prompt text and clicking the SUBMIT button, users can rapidly receive the model’s response. Have a go at it with Gemma!

The simplest method for deploying a custom Vertex Prediction endpoint with Hex-LLM is to use the model card UI for one-click deployment:

Go to the page of the model card and select “DEPLOY.”

Choose the TPU v5e machine type ct5lp-hightpu-*t for deployment for the model variation of interest. To start the deployment process, select “DEPLOY” at the bottom. Two email notifications are available to you: one upon uploading the model and another upon the endpoint’s readiness.

Users can utilise the Vertex Python SDK to deploy a Vertex Prediction endpoint with Hex-LLM utilising the Colab Enterprise notebook examples for maximum flexibility.

Go to the page of the model card and select “OPEN NOTEBOOK.”

Pick out the notebook for Vertex Serving. By doing this, the Colab Enterprise notebook will open.

Use the notebook to send prediction queries to the endpoint and deploy using Hex-LLM.

Users can adjust the deployment in this way to suit their needs the best. For example, they can accommodate a high volume of anticipated traffic by deploying with several clones.