#*maxtext
Explore tagged Tumblr posts
govindhtech · 8 months ago
Text
AI Hypercomputer’s New Resource Hub & Speed Enhancements
Tumblr media
Google AI Hypercomputer
Updates to the AI hypercomputer software include a new resource center, quicker training and inference, and more.
AI has more promise than ever before, and infrastructure is essential to its advancement. Google Cloud’s supercomputing architecture, AI Hypercomputer, is built on open software, performance-optimized hardware, and adaptable consumption models. When combined, they provide outstanding performance and efficiency, scalability and resilience, and the freedom to select products at each tier according to your requirements.
A unified hub for AI Hypercomputer resources, enhanced resiliency at scale, and significant improvements to training and inference performance are all being announced today.
Github resources for AI hypercomputers
The open software layer of AI Hypercomputer offers reference implementations and workload optimizations to enhance the time-to-value for your particular use case, in addition to supporting top ML Frameworks and orchestration options. Google Cloud is launching the AI Hypercomputer GitHub organization to make the advancements in its open software stack easily accessible to developers and practitioners. This is a central location where you can find reference implementations like MaxText and MaxDiffusion, orchestration tools like xpk (the Accelerated Processing Kit for workload management and cluster creation), and GPU performance recipes on Google Cloud. It urges you to join us as it expand this list and modify these resources to reflect a quickly changing environment.
A3 Mega VMs are now supported by MaxText
MaxText is an open-source reference implementation for large language models (LLMs) that offers excellent speed and scalability. Performance-optimized LLM training examples are now available for A3 Mega VMs, which provide a 2X increase in GPU-to-GPU network capacity over A3 VMs and are powered by NVIDIA H100 Tensor Core GPUs. To make it possible for collaborative communication and computing on GPUs to overlap, Google Cloud collaborated closely with NVIDIA to enhance JAX and XLA. It has included example scripts and improved model settings for GPUs with XLA flags enabled.
As the number of VMs in the cluster increases, MaxText with A3 Mega VMs can provide training performance that scales almost linearly, as seen below using Llama2-70b pre-training.
Moreover, FP8 mixed-precision training on A3 Mega VMs can be used to increase hardware utilization and acceleration. Accurate Quantized Training (AQT), the quantization library that drives INT8 mixed-precision training on Cloud TPUs, is how it added FP8 capability to MaxText.
Its results on dense models show that FP8 training with AQT can achieve up to 55% more effective model flop use (EMFU) than bf16.
Reference implementations and kernels for MoEs
Consistent resource usage of a small number of experts is beneficial for the majority of mixture of experts (MoE) use cases. But for some applications, it is more crucial to be able to leverage more experts to create richer solutions. Google Cloud has now added both “capped” and “no-cap” MoE implementations to MaxText to give you this flexibility, allowing you to select the one that best suits your model architecture. While no-cap models dynamically distribute resources for maximum efficiency, capped MoE models provide predictable performance.
Pallas kernels, which are optimized for block-sparse matrix multiplication on Cloud TPUs, have been made publicly available to speed up MoE training even more. Pallas is an extension to JAX that gives fine-grained control over code created for XLA devices like GPUs and TPUs; at the moment, block-sparse matrix multiplication is only available for TPUs. These kernels offer high-performance building pieces for training your MoE models and are compatible with both PyTorch and JAX.
With a fixed batch size per device, our testing using the no-cap MoE model (Mixtral-8x7b) shows nearly linear scalability. When it raised the number of experts in the base setup with the number of accelerators, it also saw almost linear scaling, which is suggestive of performance on models with larger sparsity.
Monitoring large-scale training
MLOps can be made more difficult by having sizable groups of accelerators that are supposed to collaborate on a training task. “Why is this one device in a segfault?” is a question you may have. “Did host transfer latencies spike for a reason?” is an alternative. However, monitoring extensive training operations with the right KPIs is necessary to maximize your resource use and increase overall ML Goodput.
Google has provided a reference monitoring recipe to make this important component of your MLOps charter easier to understand. In order to detect anomalies in the configuration and take remedial action, this recipe assists you in creating a Cloud Monitoring dashboard within your Google Cloud project that displays helpful statistical metrics like average or maximum CPU consumption.
Cloud TPU v5p SparseCore is now GA
High-performance random memory access is necessary for recommender models and embedding-based models to utilize the embeddings. The TPU’s hardware embedding accelerator, SparseCore, lets you create recommendation systems that are more potent and effective. With four dedicated SparseCores per Cloud TPU v5p chip, DLRM-V2 can perform up to 2.5 times faster than its predecessor.
Enhancing the performance of LLM inference
Lastly, it implemented ragged attention kernels and KV cache quantization in JetStream, an open-source throughput-and-memory-optimized engine for LLM inference, to enhance LLM inference performance. When combined, these improvements can increase inference performance on Cloud TPU v5e by up to 2X.
Boosting your AI adventure
Each part of the AI Hypercomputer serves as a foundation for the upcoming AI generation, from expanding the possibilities of model training and inference to improving accessibility through a central resource repository.
Read more on Govindhtech.com
0 notes
hackernewsrobot · 1 year ago
Text
Maxtext: A simple, performant and scalable Jax LLM
https://github.com/google/maxtext
0 notes
dromologue · 1 year ago
Link
Comments
0 notes
gcnerr · 8 years ago
Conversation
–– ( * &&. IMESSAGE TO 👽 !
MAX: listen
MAX: dont ignore me
MAX: i can see you looking at your phone
MAX: are you really that mad ??
90 notes · View notes
treasureslowlyfade · 8 years ago
Conversation
Text || Max & Athanasia
Max: How are my girls?
14 notes · View notes
govindhtech · 1 year ago
Text
MaxDiffusion: Efficient inference from diffusion models
Tumblr media
AI Inference with Google Cloud GPUs and TPUs
There is a growing need for high-performance, low-cost AI inference (serving) in the quickly changing field of artificial intelligence. JetStream and MaxDiffusion are two new open source software products that we introduced this week.
Starting with Cloud TPUs, JetStream is a new inference engine for XLA devices. With up to three times more inferences per dollar for large language models (LLMs) than earlier Cloud TPU inference engines, JetStream is particularly designed for LLMs and marks a major advancement in both performance and cost effectiveness. JetStream provides support for JAX models via MaxText, Google’s highly scalable, high-performance reference implementation for LLMs that users may fork to expedite their development, and PyTorch models via PyTorch/XLA.
The equivalent of MaxText for latent diffusion models, MaxDiffusion simplifies the process of training and serving diffusion models that are optimized for optimal performance on XLA devices, beginning with Cloud TPUs.
Furthermore, Google is pleased to provide the most recent MLPerf Inference v4.0 performance results, which highlight the strength and adaptability of Google Cloud’s A3 virtual machines (VMs) driven by NVIDIA H100 GPUs.
JetStream: Cost-effective, high-performance LLM inference
Image credit to Google cloud
With their ability to power a broad variety of applications including natural language comprehension, text production, and language translation, LLMs are at the vanguard of the AI revolution. Google developed JetStream, an inference engine that offers up to three times more inferences per dollar than earlier Cloud TPU inference engines, to lower the LLM inference costs for their clients.
Advanced speed optimizations are included in JetStream, including sliding window attention, continuous batching, and int8 quantization for weights, activations, and key-value (KV) caching. And JetStream supports your favourite framework, whether you’re using PyTorch or JAX. Google provide optimized MaxText and PyTorch/XLA versions of popular open models, such Gemma and Llama, for maximum cost-efficiency and speed, to further expedite your LLM inference procedures.
JetStream provides up to 4783 tokens/second for open models, such as Gemma in MaxText and Llama 2 in PyTorch/XLA, using Cloud TPU v5e-8:Image credit to Google cloud
Because of JetStream’s excellent speed and efficiency, Google Cloud users pay less for inference, increasing the accessibility and affordability of LLM inference:Image credit to Google cloud
JetStream is used by clients like Osmos to reduce the time it takes for LLM inference tasks:
“At Osmos, Google is created a data transformation engine driven by AI to assist businesses in growing their commercial partnerships by automating data processing. In order to map, evaluate, and convert the often disorganized and non-standard data that is received from clients and business partners into excellent, useable data, intelligence must be applied to each row of data. High-performance, scalable, and reasonably priced AI infrastructure for inference, training, and fine-tuning is required to do this.
Google Cloud TPU v5e
For their end-to-end AI processes, Google is decided on Cloud TPU v5e with MaxText, JAX, and JetStream for this reason. They used Google Cloud to rapidly and easily use JetStream to install Google’s most recent Gemma open model for inference on Cloud TPU v5e, and MaxText to fine-tune the model on billions of tokens. All are able to see results in a matter of hours rather than days thanks to Google’s AI-optimized hardware and software stack.”
Google are driving the next wave of AI applications by giving academics and developers an open-source, robust, and affordable framework for LLM inference. JetStream can help you explore new avenues in natural language processing and expedite your journey, regardless of your experience level with LLMs or AI.
With JetStream, experience LLM inference as it will be in the future. To get started on your next LLM project and learn more about JetStream, visit Google’s GitHub site. Long-term development and maintenance of JetStream on GitHub and via Google Cloud Customer Care are Google commitments. To further enhance the state of the art, they are extending an invitation to the community to collaborate on projects and make changes.
MaxDiffusion
Diffusion models are revolutionizing computer vision, much as LLMs revolutionized natural language processing. Google developed MaxDiffusion, a set of open-source diffusion-model reference implementations, to lower the expenses associated with installing these models for our clients. These JAX-written solutions are very efficient, scalable, and adaptable; for computer vision, imagine MaxText.
MaxDiffusion offers high-performance implementations of diffusion model building blocks, including high-throughput picture data loading, convolutions, and cross attention. MaxDiffusion is designed to be very flexible and customizable. Whether you’re a developer looking to include state-of-the-art gen AI capabilities into your products or a researcher pushing the limits of picture production, MaxDiffusion offers the framework you need to be successful.
Utilizing the full potential of Cloud TPUs’ high speed and scalability, the MaxDiffusion implementation of the new SDXL-Lightning model delivers 6 images/s on Cloud TPU v5e-4 and throughput increases linearly to 12 images/s on Cloud TPU v5e-8.Image credit to Google cloud
Additionally, MaxDiffusion is economical, much like MaxText and JetStream; producing 1000 photos on Cloud TPU v5e-4 or Cloud TPU v5e-8 only costs $0.10.Image credit to Google cloud
Google Cloud is being used by clients like Codeway to increase cost-effectiveness for diffusion model inference at scale:
“At Codeway, Google develop popular applications and games that are used by over 115 million users in 160 countries worldwide. An AI-powered programme called “Wonder,” for instance, transforms words into digital artworks, while “Facedance” causes faces to dance with a variety of entertaining animations. Millions of people need access to AI, which means a very scalable and economical inference infrastructure is needed. Compared to competing inference systems, Google were able to serve diffusion models 45% quicker and handle 3.6 times more queries per hour using Cloud TPU v5e. This result in considerable infrastructure cost reductions at our size and enables us to economically reach even more consumers with AI-powered products.
A scalable, adaptable, and high-performance basis for picture production is offered by MaxDiffusion. MaxDiffusion can help you along the way, regardless of your level of experience with computer vision or whether you’re just getting started with picture production.
To find out more about MaxDiffusion and to get started on your next creative project, go over to Google’s  GitHub repository.
Good outcomes in MLPerf 4.0 Inference for A3 Virtual Machines
Google made the broad availability of A3 virtual machines known in August 2023. The A3s are designed to train and handle difficult tasks such as LLMs, and they are powered by eight NVIDIA H100 Tensor Core GPUs in a single virtual machine. NVIDIA H100 GPU-powered A3 Mega, which doubles A3’s GPU-to-GPU networking capacity, will be on sale next month.
Google provided 20 results utilising A3 VMs for the MLPerf Inference v4.0 benchmark testing, spanning seven models, including the new Stable Diffusion XL and Llama 2 (70B) benchmarks:
RetinaNet (Offline and on Server)
3D U-Net: accuracy of 99.9% and 99% (Offline)
BERT: (Server and Offline) accuracy of 99 and 99%
99.9% accuracy for DLRM v2 (Server and Offline)
GPT-J: accuracy rates of 99% and 99% (Server and Offline)
Stable Diffusion XL (Offline and on the server)
Llama 2: Accuracy (Server and Offline): 99% and 99%
Every result fell between 0 and 5% of the maximum performance shown in NVIDIA’s submissions. These outcomes demonstrate how closely Google Cloud and NVIDIA have collaborated to provide workload-optimized end-to-end solutions for gen AI and LLMs.
Using NVIDIA GPUs with Google Cloud TPUs to power AI in the future
With the help of software breakthroughs like JetStream, MaxText, and MaxDiffusion, as well as hardware improvements in Google Cloud TPUs and NVIDIA GPUs, Google’s AI inference innovation enables their clients to develop and expand AI applications. Developers may reach new levels of LLM inference performance and cost-efficiency with JetStream, opening up new possibilities for applications using natural language processing. With the help of MaxDiffusion, developers and researchers may investigate the full potential of diffusion models to generate images more quickly. Google’s strong MLPerf4.0 inference results on NVIDIA H100 Tensor Core GPU-powered A3 virtual machines (VMs) demonstrate the capability and adaptability of Cloud GPUs.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Gemma open models now available on Google Cloud
Tumblr media
Google today unveiled Gemma, a line of cutting-edge, lightweight open models developed using the same science and technology as the Gemini models. They are happy to announce that Google Cloud users can begin utilizing Gemma open models in Vertex AI for customization and building, as well as for deployment on Google Kubernetes Engine (GKE), right now. Google’s next step in making AI more transparent and available to developers on Google Cloud is the launch of Gemma and their enhanced platform capabilities.
Gemma displays models
The Gemma family of open models is composed of lightweight, cutting-edge models that are constructed using the same technology and research as the Gemini models. Gemma, which was created by Google DeepMind and various other Google teams, was named after the Latin gemma, which means “precious stone.” Gemma was inspired by Gemini AI. Along with their model weights, Google is also making available resources to encourage developer creativity, promote teamwork, and direct the ethical use of Gemma models.
Gemma is currently accessible via Google Cloud
Google-capable Gemini models and Gemma models share infrastructure and technical components. When compared to other open models, this allows Gemma models to achieve best-in-class performance for their sizes. Gemma 2B and Gemma 7B are the two sizes of weights that they are releasing. To facilitate research and development, pre-trained and instruction-tuned variants of each size are released.
Gemma supports frameworks like JAX, PyTorch, Keras 3.0, Hugging Face Transformers, and Colab and Kaggle notebooks tools that Google Cloud developers love and use today. Gemma open models can be used on Google Cloud, a workstation, or a laptop. Developers can now work with and customize Vertex AI and run it on GKE thanks to these new open models. They have worked with NVIDIA to optimize Gemma for NVIDIA GPUs to maximize industry-leading performance.
Gemma is now accessible everywhere in the world. What you should know in particular is this:
The Gemma 2B and Gemma 7B model weights are the two sizes that they are releasing. There are trained and fine-tuned versions available for every size.
Using Gemma, a new Responsible Generative AI Toolkit offers instructions and necessary resources for developing safer AI applications.
Google is offering native Keras 3.0 toolchains for inference and supervised fine-tuning (SFT) across all major frameworks, including PyTorch, TensorFlow, and JAX.
Gemma is simple to get started with thanks to pre-configured Colab and Kaggle notebooks and integration with well-known programs like Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM.
Pre-trained and fine-tuned Gemma open models can be easily deployed on Vertex AI and Google Kubernetes Engine (GKE) and run on your laptop, workstation, or Google Cloud.
Industry-leading performance is ensured through optimization across multiple AI hardware platforms, such as NVIDIA GPUs and Google Cloud TPUs.
All organizations, regardless of size, are permitted to use and distribute the terms of use in a responsible commercial manner.
Unlocking Gemma’s potential in Vertex AI
Gemma has joined more than 130 models in the Vertex AI Model Garden, which now includes the Gemini 1.0 Pro, 1.0 Ultra, and 1.5 Pro models they recently announced expanded access to Gemini.
Developers can benefit from an end-to-end ML platform that makes managing, tuning, and monitoring models easy and intuitive by utilizing Gemma open models on Vertex AI. By utilizing Vertex AI, builders can lower operational costs and concentrate on developing customized Gemma versions that are tailored to their specific use cases.
For instance, developers can do the following with Vertex AI’s Gemma open models:
Create generative AI applications for simple tasks like Q&A, text generation, and summarization.
Utilize lightweight, customized models to facilitate research and development through experimentation and exploration.
Encourage low-latency real-time generative AI use cases, like streaming text
Developers can easily transform their customized models into scalable endpoints that support AI applications of any size with Vertex AI’s assistance.
Utilize Gemma open models on GKE to scale from prototype to production
GKE offers resources for developing unique applications, from basic project prototyping to large-scale enterprise deployment. Developers can now use Gemma to build their generation AI applications for testing model capabilities or constructing prototypes by deploying it directly on GKE:
Use recognizable toolchains to deploy personalized, optimized models alongside apps in portable containers.
Adapt infrastructure configurations and model serving without having to provision or maintain nodes.
Quickly integrate AI infrastructure that can grow to accommodate even the most complex training and inference scenarios.
GKE offers autoscaling, reliable operations environments, and effective resource management. Furthermore, it facilitates the easy orchestration of Google Cloud AI accelerators, such as GPUs and TPUs, to improve these environments and speed up training and inference for generative AI model construction.
Cutting-edge performance at scale
The infrastructure and technical components of Gemma open models are shared with Gemini, their most powerful AI model that is currently accessible to the public. In comparison to other open models, this allows Gemma 2B and 7B to achieve best-in-class performance for their sizes. Additionally, Gemma open models can be used directly on a desktop or laptop computer used by developers. Interestingly, Gemma meets their strict requirements for responsible and safe outputs while outperforming noticeably larger models on important benchmarks. For specifics on modeling techniques, dataset composition, and performance, consult the technical report.
At Google, we think AI should benefit everyone. Google has a long history of developing innovations and releasing them to the public, including JAX, AlphaFold, AlphaCode, Transformers, TensorFlow, BERT, and T5. They are thrilled to present a fresh batch of Google open models today to help researchers and developers create ethical AI.
Begin using Gemma on Google Cloud right now
Working with Gemma models in Vertex AI and GKE on Google Cloud is now possible. Visit ai.google.dev/gemma to access quickstart guides and additional information about Gemma.
Read more on Govindhtech.com
0 notes
gcnerr · 8 years ago
Conversation
–– ( * &&. IMESSAGE TO 👻 !
MAX: i dont really know how i feel right now
MAX: some woman just asked if i did bachelorette parties
MAX: and i was just in the queue at subway
MAX: i did have a footlong so, maybe she got some ideas
13 notes · View notes