#LLM inference laptop
Explore tagged Tumblr posts
themorningnewsinformer · 24 hours ago
Text
Gigabyte Aorus Master 16 AI Laptop With Intel Ultra 9 Launched in India
Introduction Gigabyte has officially unveiled its latest AI-powered gaming laptop, the Gigabyte Aorus Master 16, in India. Packed with the latest Intel Core Ultra 9 275HX CPU and the powerful Nvidia GeForce RTX 5080 Laptop GPU based on Blackwell architecture, this AI PC is designed to deliver unmatched gaming and AI computing performance. Gigabyte Aorus Master 16 Price and Availability in…
0 notes
digitalmore · 1 month ago
Text
0 notes
govindhtech · 1 month ago
Text
LM Studio Improves LLM with CUDA 12.8 & GeForce RTX GPUs
Tumblr media
LM Studio Accelerates LLM with CUDA 12.8 and GeForce RTX GPUs
The latest desktop application update improves model controls, dev tools, and RTX GPU performance.
As AI use cases proliferate, developers and hobbyists want faster and more flexible ways to run large language models (LLMs), from document summarisation to custom software agents.
Running models locally on PCs with NVIDIA GeForce RTX GPUs enables high-performance inference, data privacy, and AI deployment and integration management. Free programs like LM Studio let users examine and operate with LLMs on their own hardware.
LM Studio is a popular local LLM inference application. Based on the fast llama.cpp runtime, the application allows models to run offline and be utilised as OpenAI-compatible API endpoints in custom workflows.
LM Studio 0.3.15 uses CUDA 12.8 to boost RTX GPU model load and response times. The upgrade adds developer-focused features like a revised system prompt editor and “tool_choice” tool usage.
The latest LM Studio improvements improve usability and speed, enabling the highest throughput on RTX AI PCs. This leads to faster reactions, snappier interactions, and better local AI development and integration tools.
AI Acceleration Meets Common Apps
LM Studio can be used for light experimentation and significant integration into unique processes due to its versatility. Developer mode permits desktop chat or OpenAI-compatible API calls to models. Local LLMs can be integrated with custom desktop agents or processes in Visual Studio Code.
The popular markdown-based knowledge management tool Obsidian may be integrated with LM Studio. Local LLMs in LM Studio allow users to query their notes, produce content, and summarise research using community-developed plug-ins like Text Generator and Smart Connections. These plug-ins enable fast, private AI interactions without the cloud by connecting to LM Studio's local server.
Developer enhancements in 0.3.15 include an updated system prompt editor for longer or more sophisticated prompts and more accurate tool usage management through the “tool_choice” option.
The tool_choice argument lets developers require a tool call, turn it off, or allow the model decide how to connect with external tools. Adding flexibility to structured interactions, retrieval-augmented generation (RAG) workflows, and agent pipelines is beneficial. Together, these upgrades improve LLM use cases for developers in experimental and production.
LM Studio supports Gemma, Llama 3, Mistral, and Orca open models and quantisation formats from 4-bit to full precision.
Common use cases include RAG, document-based Q&A, multi-turn chat with long context windows, and local agent pipelines. Local inference servers run by the NVIDIA RTX-accelerated llama.cpp software package allow RTX AI PC users to simply integrate local LLMs.
LM Studio gives you full control, speed, and privacy on RTX, whether you're optimising a modest PC for efficiency or a big desktop for throughput.
Maximise RTX GPU Throughput
LM Studio's acceleration relies on the open-source runtime llama.cpp for consumer hardware inference. NVIDIA worked with LM Studio and llama.cpp to increase RTX GPU performance.
Important optimisations include:
CUDA graph enablement reduces CPU overhead and boosts model throughput by 35% by integrating GPU operations into a CPU call.
Flash attention CUDA kernels can boost throughput by 15% by improving LLM attention handling in transformer models. This improvement allows longer context windows without increasing memory or computing power.
Supports the newest RTX architectures: LM Studio's CUDA 12.8 update works with high-end PCs, so clients can deploy local AI processes from laptops. all RTX AI PCs, from GeForce 20 Series to NVIDIA Blackwell-class GPUs.
LM Studio automatically changes to CUDA 12.8 with a compatible driver, improving model load times and performance.
These improvements speed up response times and smooth inference on all RTX AI PCs, from small laptops to large desktops and workstations.
Utilise LM Studio
Linux, macOS, and Windows have free LM Studio. The recent 0.3.15 release and continual optimisations should improve local AI performance, customisation, and usability, making it faster, more versatile, and easier to use.
Developer mode offers an OpenAI-compatible API, and desktop chat allows users import models.
Start immediately by downloading and launching the latest LM Studio.
Click the left magnifying glass to open Discover.
See the CUDA 12 llama.cpp (Windows) runtime in the availability list after selecting Runtime choices on the left side. Click “Download and Install”.
After installation, select CUDA 12 llama.cpp (Windows) from the Default Selections selection to set LM Studio to use this runtime.
To optimise CUDA execution in LM Studio, load a model and click the gear icon to the left of it to open Settings.
Drag the “GPU Offload” slider to the right to offload all model layers to the GPU, then enable “Flash Attention” from the selection menu.
Local NVIDIA GPU inference is possible if these functions are enabled and configured.
LM Studio supports model presets, quantisation formats, and developer options like tool_choice for exact inference. The llama.cpp GitHub project is continually updated and evolving with community and NVIDIA performance enhancements for anyone who wants to contribute.
LM Studio 0.3.15 offers RTX 50-series GPUs and API tool utilisation improvements
A stable version of LM Studio 0.3.15 is available. This release supports NVIDIA RTX 50-series GPUs (CUDA 12) and UI changes include a revamped system prompt editor. Added possibility to log each fragment to API server logs and improved tool use API support (tool_choice parameter).
RTX 50-series GPU CUDA 12 compatibility
With llama.cpp engines, LM Studio supports RTX 50-series GPUs CUDA 12.8 for Linux and Windows. As expected, this improvement speeds up RTX 50-series GPU first-time model load times. LM Studio will update RTX 50-series GPUs to CUDA 12 if NVIDIA drivers are acceptable.
The minimum driver version is:
Windows version 551.61+
Linux: 550.54.14 minimum
LM Studio will immediately update to CUDA 12 if the driver version matches your RTX 50-series GPU. LM Studio uses CUDA 11 even with incompatible RTX 50 GPU drivers. Controlled by Command+Shift+R.
New System Prompt Editor UI
System suggestions change model behaviour well. They range from a few words to several pages. LM Studio 0.3.15 adds a larger visual space for modifying long prompts. The sidebar's little prompt editor works.
Improved Tool Use API Support
The OpenAI-like REST API now supports tool_choice, which helps you configure model tool use. The tool_choice argument has three values:
“tool_choice”: “none” means the model will call no tools.
“tool_choice”: “auto” The model decides whether to invoke tools with the option.
tool_choice: “required” Just output tools (llama.cpp engines)
NVIDIA also fixed LM Studio's OpenAI-compatibility mode bug that prohibited the chunk “finish_reason” from being changed to “tool_calls”.
Preview Community Presets
Presets combine system prompts with model parameters.
Since LM Studio 0.3.15, you can download and share user-made presets online. Additionally, you can like and fork other settings.
Settings > General > Enable publishing and downloading presets activates this option.
Right-clicking a sidebar setting reveals a “Publish” button once activated. Share your preset with the community.
0 notes
amritatech56 · 6 months ago
Text
Red Hat’s Vision for an Open Source AI Future
Red Hat’s Vision for an Open Source AI Future -The world of artificial intelligence (AI) is evolving at a lightning pace. As with any transformative technology, one question stands out: what’s the best way to shape its future? At Red Hat, we believe the answer is clear—the future of AI is open source
This isn’t just a philosophical stance; it’s a commitment to unlocking AI’s full potential by making it accessible, collaborative, and community-driven. Open source has consistently driven innovation in the technology world, from Linux and Kubernetes to OpenStack. These projects demonstrate how collaboration and transparency fuel discovery, experimentation, and democratized access to groundbreaking tools. AI, too, can benefit from this model.
Why Open Source Matters in AI
In a field where trust, security, and explainability are critical, AI must be open and inclusive. Red Hat is championing open source AI innovation to ensure its development remains a shared effort—accessible to everyone, not just organizations with deep pockets.
Through strategic investments, collaborations, and community-driven solutions, Red Hat is laying the groundwork for a future where AI workloads can run wherever they’re needed. Our recent agreement to acquire Neural Magic marks a significant step toward achieving this vision – Amrita Technologies.
Building the Future of AI on Three Pillars
1.Building the Future of AI on Three Pillars
AI isn’t just about massive, resource-hungry models. The focus is shifting toward smaller, specialized models that deliver high performance with greater efficiency.
For example, IBM Granite 3.0, an open-source family of models licensed under Apache 2.0, demonstrates how smaller models (1–8 billion parameters) can run efficiently on a variety of hardware, from laptops to GPUs. Such accessibility fosters innovation and adoption, much like Linux did for enterprise computing.
Optimization techniques like sparsification and quantization further enhance these models by reducing size and computational demands while maintaining accuracy. These approaches make it possible to run AI workloads on diverse hardware, reducing costs and enabling faster inference. Neural Magic’s expertise in optimizing AI for GPU and CPU hardware will further strengthen our ability to bring this efficiency to AI.
2. Training Unlocks Business Advantage
While pre-trained models are powerful, they often lack understanding of a business’s specific processes or proprietary data. Customizing models to integrate unique business knowledge is essential to unlocking their true value.
To make this easier, Red Hat and IBM launched Instruct Lab, an open source project designed to simplify fine-tuning of large language models (LLMs). Instruct Lab lowers barriers to entry, allowing businesses to train models without requiring deep data science expertise. This initiative enables organizations to adapt AI for their unique needs while controlling costs and complexity
3. Choice Unlocks Innovation
AI must work seamlessly across diverse environments, whether in corporate datacenters, the cloud, or at the edge. Flexible deployment options allow organizations to train models where their data resides and run them wherever makes sense for their use cases.
Just as Red Hat Enterprise Linux (RHEL) allowed software to run on any CPU without modification, our goal is to ensure AI models trained with RHEL AI can run on any GPU or infrastructure. By combining flexible hardware support, smaller models, and simplified training, Red Hat enables innovation across the AI lifecycle.
With Red Hat OpenShift AI, we bring together model customization, inference, monitoring, and lifecycle management. Neural Magic’s vision of efficient AI on hybrid platforms aligns perfectly with our mission to deliver consistent and scalable solutions – Amrita Technologies.
Welcoming Neural Magic to Red Hat
Neural Magic’s story is rooted in making AI more accessible. Co-founded by MIT researchers Nir Shavit and Alex Matveev, the company specializes in optimization techniques like pruning and quantization. Initially focused on enabling AI to run efficiently on CPUs, Neural Magic has since expanded its expertise to GPUs and generative AI, aligning with Red Hat’s goal of democratizing AI.
The cultural alignment between Neural Magic and Red Hat is striking. Just as Neural Magic strives to make AI more efficient and accessible, Red Hat’s Instruct Lab team works to simplify model training for enterprise adoption. Together, we’re poised to drive breakthroughs in AI innovation.
Open Source: Unlocking AI’s Potential
At Ruddy Cap, we accept that openness opens the world’s potential. By building AI on a establishment of open source standards, we can democratize get to, quicken advancement, and guarantee AI benefits everyone. With Neural Enchantment joining Ruddy Cap, we’re energized to increase our mission of conveying open source AI arrangements that enable businesses and communities to flourish in the AI period. Together, we’re forming a future where AI is open, comprehensive, and transformative – Amrita Technologies.
1 note · View note
jcmarchi · 1 year ago
Text
Google Introduces Gemma 2: Elevating AI Performance, Speed and Accessibility for Developers
New Post has been published on https://thedigitalinsider.com/google-introduces-gemma-2-elevating-ai-performance-speed-and-accessibility-for-developers/
Google Introduces Gemma 2: Elevating AI Performance, Speed and Accessibility for Developers
Google has unveiled Gemma 2, the latest iteration of its open-source lightweight language models, available in 9 billion (9B) and 27 billion (27B) parameter sizes. This new version promises enhanced performance and faster inference compared to its predecessor, the Gemma model. Gemma 2, derived from Google’s Gemini models, is designed to be more accessible for researchers and developers, offering substantial improvements in speed and efficiency. Unlike the multimodal and multilingual Gemini models, Gemma 2 focuses solely on language processing. In this article, we’ll delve into the standout features and advancements of Gemma 2, comparing it with its predecessors and competitors in the field, highlighting its use cases and challenges.
Building Gemma 2
Like its predecessor, the Gemma 2 models are based on a decoder-only transformer architecture. The 27B variant is trained on 13 trillion tokens of mainly English data, while the 9B model uses 8 trillion tokens, and the 2.6B model is trained on 2 trillion tokens. These tokens come from a variety of sources, including web documents, code, and scientific articles. The model uses the same tokenizer as Gemma 1 and Gemini, ensuring consistency in data processing.
Gemma 2 is pre-trained using a method called knowledge distillation, where it learns from the output probabilities of a larger, pre-trained model. After initial training, the models are fine-tuned through a process called instruction tuning. This starts with supervised fine-tuning (SFT) on a mix of synthetic and human-generated English text-only prompt-response pairs. Following this, reinforcement learning with human feedback (RLHF) is applied to improve the overall performance
Gemma 2: Enhanced Performance and Efficiency Across Diverse Hardware
Gemma 2 not only outperforms Gemma 1 in performance but also competes effectively with models twice its size. It’s designed to operate efficiently across various hardware setups, including laptops, desktops, IoT devices, and mobile platforms. Specifically optimized for single GPUs and TPUs, Gemma 2 enhances the efficiency of its predecessor, especially on resource-constrained devices. For example, the 27B model excels at running inference on a single NVIDIA H100 Tensor Core GPU or TPU host, making it a cost-effective option for developers who need high performance without investing heavily in hardware.
Additionally, Gemma 2 offers developers enhanced tuning capabilities across a wide range of platforms and tools. Whether using cloud-based solutions like Google Cloud or popular platforms like Axolotl, Gemma 2 provides extensive fine-tuning options. Integration with platforms such as Hugging Face, NVIDIA TensorRT-LLM, and Google’s JAX and Keras allows researchers and developers to achieve optimal performance and efficient deployment across diverse hardware configurations.
Gemma 2 vs. Llama 3 70B
When comparing Gemma 2 to Llama 3 70B, both models stand out in the open-source language model category. Google researchers claim that Gemma 2 27B delivers performance comparable to Llama 3 70B despite being much smaller in size. Additionally, Gemma 2 9B consistently outperforms Llama 3 8B in various benchmarks such as language understanding, coding, and solving math problems,.
One notable advantage of Gemma 2 over Meta’s Llama 3 is its handling of Indic languages. Gemma 2 excels due to its tokenizer, which is specifically designed for these languages and includes a large vocabulary of 256k tokens to capture linguistic nuances. On the other hand, Llama 3, despite supporting many languages, struggles with tokenization for Indic scripts due to limited vocabulary and training data. This gives Gemma 2 an edge in tasks involving Indic languages, making it a better choice for developers and researchers working in these areas.
Use Cases
Based on the specific characteristics of the Gemma 2 model and its performances in benchmarks, we have been identified some practical use cases for the model.
Multilingual Assistants: Gemma 2’s specialized tokenizer for various languages, especially Indic languages, makes it an effective tool for developing multilingual assistants tailored to these language users. Whether seeking information in Hindi, creating educational materials in Urdu, marketing content in Arabic, or research articles in Bengali, Gemma 2 empowers creators with effective language generation tools. A real-world example of this use case is Navarasa, a multilingual assistant built on Gemma that supports nine Indian languages. Users can effortlessly produce content that resonates with regional audiences while adhering to specific linguistic norms and nuances.
Educational Tools: With its capability to solve math problems and understand complex language queries, Gemma 2 can be used to create intelligent tutoring systems and educational apps that provide personalized learning experiences.
Coding and Code Assistance: Gemma 2’s proficiency in computer coding benchmarks indicates its potential as a powerful tool for code generation, bug detection, and automated code reviews. Its ability to perform well on resource-constrained devices allows developers to integrate it seamlessly into their development environments.
Retrieval Augmented Generation (RAG): Gemma 2’s strong performance on text-based inference benchmarks makes it well-suited for developing RAG systems across various domains. It supports healthcare applications by synthesizing clinical information, assists legal AI systems in providing legal advice, enables the development of intelligent chatbots for customer support, and facilitates the creation of personalized education tools.
Limitations and Challenges
While Gemma 2 showcases notable advancements, it also faces limitations and challenges primarily related to the quality and diversity of its training data. Despite its tokenizer supporting various languages, Gemma 2 lacks specific training for multilingual capabilities and requires fine-tuning to effectively handle other languages. The model performs well with clear, structured prompts but struggles with open-ended or complex tasks and subtle language nuances like sarcasm or figurative expressions. Its factual accuracy isn’t always reliable, potentially producing outdated or incorrect information, and it may lack common sense reasoning in certain contexts. While efforts have been made to address hallucinations, especially in sensitive areas like medical or CBRN scenarios, there’s still a risk of generating inaccurate information in less refined domains such as finance. Moreover, despite controls to prevent unethical content generation like hate speech or cybersecurity threats, there are ongoing risks of misuse in other domains. Lastly, Gemma 2 is solely text-based and does not support multimodal data processing.
The Bottom Line
Gemma 2 introduces notable advancements in open-source language models, enhancing performance and inference speed compared to its predecessor. It is well-suited for various hardware setups, making it accessible without significant hardware investments. However, challenges persist in handling nuanced language tasks and ensuring accuracy in complex scenarios. While beneficial for applications like legal advice and educational tools, developers should be mindful of its limitations in multilingual capabilities and potential issues with factual accuracy in sensitive contexts. Despite these considerations, Gemma 2 remains a valuable option for developers seeking reliable language processing solutions.
0 notes
thebourisbox · 1 year ago
Text
NVIDIA Brings Generative AI to Millions, With Tensor Core GPUs, LLMs, Tools for RTX PCs and Workstations
See on Scoop.it - Design, Science and Technology
  NVIDIA recently announced GeForce RTX™ SUPER desktop GPUs for supercharged generative AI performance, new AI laptops from every top manufacturer, and new NVIDIA RTX™-accelerated AI software and tools for both developers and consumers.
  Building on decades of PC leadership, with over 100 million of its RTX GPUs driving the AI PC era, NVIDIA is now offering these tools to enhance PC experiences with generative AI: NVIDIA TensorRT™ acceleration of the popular Stable Diffusion XL model for text-to-image workflows, NVIDIA RTX Remix with generative AI texture tools, NVIDIA ACE microservices and more games that use DLSS 3 technology with Frame Generation.
  AI Workbench, a unified, easy-to-use toolkit for AI developers, will be available in beta later this month. In addition, NVIDIA TensorRT-LLM (TRT-LLM), an open-source library that accelerates and optimizes inference performance of the latest large language models (LLMs), now supports more pre-optimized models for PCs. Accelerated by TRT-LLM, Chat with RTX, an NVIDIA tech demo also releasing this month, allows AI enthusiasts to interact with their notes, documents and other content.
  “Generative AI is the single most significant platform transition in computing history and will transform every industry, including gaming,” said Jensen Huang, founder and CEO of NVIDIA. “With over 100 million RTX AI PCs and workstations, NVIDIA is a massive installed base for developers and gamers to enjoy the magic of generative AI.”
  Running generative AI locally on a PC is critical for privacy, latency and cost-sensitive applications. It requires a large installed base of AI-ready systems, as well as the right developer tools to tune and optimize AI models for the PC platform. To meet these needs, NVIDIA is delivering innovations across its full technology stack, driving new experiences and building on the 500+ AI-enabled PC applications and games already accelerated by NVIDIA RTX technology.
  This is NVIDIA's first and very important step towards the vision of "LLM as Operating System" - a locally running, heavily optimized AI assistant that can deeply integrate with all your local files, but at the same time preserving privacy. NVIDIA is going local even before OpenAI!
Read the full article at: nvidianews.nvidia.com
0 notes
digitalmore · 1 month ago
Text
0 notes
digitalmore · 1 month ago
Text
0 notes
govindhtech · 11 months ago
Text
Enhance Laptop Performance with Top 10 Uncensored LLMs
Tumblr media
The Top 10 Uncensored LLMs for  Laptop Operation. Uncensored LLMs protect privacy while promoting unbridled creativity, improved learning, and deeper insights.
1.Uncensored Llama 2
Modified versions of the large language model (LLM) Llama 2, developed by Meta AI, are referred to as “uncensored Llama 2.” These versions do away with the filters that normally stop LLMs from producing responses deemed offensive, biassed, or dangerous.
Using the methodology outlined by Eric Hartford, George Sung and Jarrad Hope developed Llama 2 Uncensored, which is based on Meta’s Llama 2 concept. Because this approach doesn’t use moralizing or alignment filters when providing responses, it can be applied to a wide range of situations.
It is quite adaptable for a range of purposes, including role-playing, and supports multiple quantization methods. Ollama has seen 234.9K pulls.
2.WizardLM Uncensored
While WizardLM Uncensored LLms is built on a separate large language paradigm, WizardLM, it is similar to Uncensored Llama 2. A redesigned WizardLM, a massive language model with the ability to create text, translate across languages, write various forms of original material, and provide you with informed answers to your queries.
A 13B parameter model based on Llama 2 uncensored is called WizardLM Uncensored. Its versatility for a range of applications stems from its training to exclude responses that contained moralising or alignment. With 23.1K pulls on Ollama, it boasts several quantization settings that let users strike a compromise between performance and memory utilisation.
3.Uncensored Llama 3 8B Lexi
Explicit Llama 3 8B A particular kind of Uncensored LLMs built on Meta AI‘s Llama 3 8B architecture is called Lexi. With unfiltered training, Lexi is an altered Llama 3 8B model.
Under the terms of Meta’s Llama 3 Community Licence Agreement, Lexi uncensored is based on Llama-3-8b-Instruct. Lexi should only be used appropriately because it is made to be extremely obedient to all requests even those that are immoral.
Because it has no ethical restrictions, it can be used for general-purpose activities but needs to be used carefully. HuggingFace has received over 11,000 downloads of it.
4.Llama3 8B DarkIdol 2.1 Explicit
A particular kind of Uncensored LLMs based on the Llama 3 8B architecture developed by Meta  AI is known as Llama 3 8B DarkIdol 2.1 Explicit.
Llama 3 8B Mobile phone applications are among the uses for which DarkIdol has been modified. With over 12,000 downloads on HuggingFace, it focuses in role-playing scenarios and provides prompt reactions.
The model’s performance is improved through the amalgamation of several combined models.
5.Uncensored Wizard Vicuna 7B
A part of the dataset was used to train the Wizard Vicuna 7B Uncensored model against LLaMA-7B. It may be used for both CPU and GPU inference because it offers many quantization parameter options to accommodate varying hardware specifications. The best option for hosting LLM on the cloud is thought to be Wizard Vicuna.
Wizard Without Constraints WizardLM’s large language model, Vicuna 7B, has seven billion parameters and has been modified.allows comments that may be deemed offensive, dangerous, or biassed by removing the filters that are normally present in LLMs.
6.Mistral Dolphin
This is an altered version of the Mistral AI model, which was renowned for its enormous size and capacity to produce imaginative text formats. The uncensored version does away with the filters, making potentially harmful, slanted.
Based on the Mistral V0.2 basic model, Dolphin Mistral has been refined using the Dolphin 2.9 dataset. Because it has no restrictions and provides a 32k context window, it can be used for sophisticated role-playing games and conversation.
7.v1.0 Uncensored of SOLAR 10 7B Instruct
The Solar 10 7B Instruct model is intended for tasks that follow instructions. Better tokenization and support for specific tokens are provided by the GGUF format, which is supported by it. The model’s various quantization possibilities and excellent instruction following are well-known.
8.Uncensored Guanaco 7B
Llama-2-7b serves as the foundation model for the refinement of Guanaco 7B Uncensored on the Unfiltered Guanaco Dataset. It’s ideal for both CPU and GPU inference because it provides alternative quantization algorithms for different hardware configurations.
9.Uncensored Frank 13B
Unrestrained Frank 13B is perhaps a reworked version of the 13 billion parameter large language model Frank. The safety filters that are normally found in LLMs are removed in this version.
No holds barred Uncensored Frank 13B Frank, a 13B model, was modelled after “The Departed” character Frank Costello. Its goal is to provide a forum for open dialogue on a variety of subjects. The model can be used for GPU or CPU inference and offers several quantization choices.
10. Uncensored Jordan 7B
Uncensored Jordan is a 7B parameter model that is intended for use in uncensored general-purpose applications. With many quantization settings supported, it is appropriate for users who require an unfiltered model for a variety of applications.
From general-purpose activities to specific role-playing scenarios, these models are appropriate for a variety of applications and offer a range of features. Before using these models on your laptop, make sure you meet the hardware requirements.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Gemma 2 Is Now Accessible to Researchers and developers
Tumblr media
Best-in-class performance, lightning-fast hardware compatibility, and simple integration with other AI tools are all features of Gemma 2.
AI has the capacity to solve some of the most important issues facing humanity, but only if everyone gets access to the resources needed to develop with it. Because of this, Google unveiled the Gemma family of lightweight, cutting-edge open models earlier this year. These models are constructed using the same technology and research as the Gemini models. With CodeGemma, RecurrentGemma, and PaliGemma all of which offer special capabilities for various AI jobs and are readily accessible thanks to connections with partners like Hugging Face, NVIDIA, and Ollama Google have continued to expand the Gemma family.
Google is now formally making Gemma 2 available to academics and developers throughout the world. Gemma 2, which comes in parameter sizes of 9 billion (9B) and 27 billion (27B), outperforms the first generation in terms of performance and efficiency at inference, and has notable improvements in safety. As late as December, only proprietary versions could produce the kind of performance that this 27B model could, making it a competitive option to machines more than twice its size. And that can now be accomplished on a single NVIDIA H100 Tensor Core GPU or TPU host, greatly lowering the cost of deployment.
A fresh open model benchmark for effectiveness and output Google updated the architecture upon which they built Gemma 2, geared for both high performance and efficient inference. What distinguishes it is as follows:
Excessive performance: Gemma 2 (27B) offers competitive alternatives to models over twice its size and is the best performing model in its size class. Additionally, the 9B Gemma 2 model outperforms other open models in its size group and the Llama 3 8B, delivering class-leading performance. See the technical report for comprehensive performance breakdowns.
Superior effectiveness and financial savings: With its ability to operate inference effectively and precisely on a single Google Cloud TPU host, NVIDIA A100 80GB Tensor Core GPU, or NVIDIA H100 Tensor Core GPU, the 27B Gemma 2 model offers a cost-effective solution that doesn’t sacrifice performance. This makes AI installations more affordable and widely available.
Lightning-fast inference on a variety of hardware: Gemma 2 is designed to operate incredibly quickly on a variety of hardware, including powerful gaming laptops, top-of-the-line desktop computers, and cloud-based configurations. Try Gemma 2 at maximum precision in Google AI Studio, or use Gemma.cpp on your CPU to unlock local performance with the quantized version. Alternatively, use Hugging Face Transformers to run Gemma 2 on an NVIDIA RTX or GeForce RTX at home.
Designed with developers and researchers in mind
In addition to being more capable, Gemma 2 is made to fit into your processes more smoothly:
Open and accessible: Gemma 2 is offered under our commercially-friendly Gemma licence, allowing developers and academics to share and commercialise their inventions, much like the original Gemma models. Wide compatibility with frameworks: Because Gemma 2 is compatible with popular AI frameworks such as Hugging Face Transformers, JAX, PyTorch, and TensorFlow via native Keras 3.0, vLLM, Gemma.cpp, Llama.cpp, and Ollama, you can utilise it with ease with your preferred tools and processes. Moreover, Gemma is optimised using NVIDIA TensorRT-LLM to operate as an NVIDIA NIM inference microservice or on NVIDIA-accelerated infrastructure. NVIDIA NeMo optimisation will follow. Today, Hugging Face and Keras might help you fine-tune. More parameter-efficient fine-tuning options are something Google is constantly working on enabling. Easy deployment: Google Cloud users will be able to quickly and simply install and maintain Gemma 2 on Vertex AI as of next month. Discover the new Gemma Cookbook, which is a compilation of useful examples and instructions to help you develop your own apps and adjust Gemma 2 models for certain uses. Learn how to utilise Gemma with your preferred tooling to do typical tasks such as retrieval-augmented generation.
Responsible AI development Google’s Responsible Generative AI Toolkit is just one of the tools Google is dedicated to giving academics and developers so they may create and use AI responsibly. Recently, the LLM Comparator was made available to the public, providing developers and researchers with a thorough assessment of language models. As of right now, you may execute comparative assessments using your model and data using the associated Python library, and the app will display the results. Furthermore, Google is working hard to make our text watermarking technique for Gemma models, SynthID, open source.
In order to detect and reduce any biases and hazards, Google is trained Gemma 2 using their strict internal safety procedures, which include screening pre-training data, conducting thorough testing, and evaluating the results against a wide range of metrics. Google release their findings on a wide range of publicly available standards concerning representational hazards and safety.
Tasks completed with Gemma
Innumerable inspirational ideas and over 10 million downloads resulted from their initial Gemma launch. For example, Navarasa employed Gemma to develop a model based on the linguistic diversity of India.
With Gemma 2, developers may now launch even more ambitious projects and unleash the full potential and performance of their AI creations. Google will persist in investigating novel architectures and crafting customised Gemma versions to address an expanded array of AI assignments and difficulties. This includes the 2.6B parameter Gemma 2 model that will be released soon, which is intended to close the gap even further between powerful performance and lightweight accessibility. The technical report contains additional information about this impending release.
Beginning You may now test out Gemma 2’s full performance capabilities at 27B without any hardware requirements by accessing it through Google AI Studio. The model weights for Gemma 2 can also be downloaded from Hugging Face Models and Kaggle, and Vertex AI Model Garden will be available soon.
In order to facilitate research and development, Gemma 2 can also be obtained for free via Kaggle or a complimentary tier for Colab notebooks. New users of Google Cloud can be qualified for $300 in credit. To expedite their research with Gemma 2, academic researchers can register for the Gemma 2 Academic Research Programme and obtain Google Cloud credits. The deadline for applications is August 9th.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Is TensorRT Acceleration Coming For Stable Diffusion 3
Tumblr media
NVIDIA TensorRT
Thanks to NVIDIA RTX and GeForce RTX technology, the AI PC age is arrived. Along with it comes a new language that can be difficult to understand when deciding between the many desktop and laptop options, as well as a new method of assessing performance for AI-accelerated tasks. This article is a part of the AI Decoded series, which shows off new RTX PC hardware, software, tools, and accelerations while demystifying AI by making the technology more approachable.
While frames per second (FPS) and related statistics are easily understood by PC gamers, new measures are needed to measure AI performance.
Emerging as the Best
Trillions of operations per second, or TOPS, is the initial baseline. The key term here is trillions; the processing power required for generative AI jobs is truly enormous. Consider TOPS to be a raw performance indicator, like to the horsepower rating of an engine.
Take Microsoft’s recently unveiled Copilot+ PC series, for instance, which has neural processing units (NPUs) capable of up to 40 TOPS. For many simple AI-assisted tasks, such as asking a nearby chatbot where yesterday’s notes are, 40 TOPS is sufficient.
However, a lot of generative AI tasks are more difficult. For all generative tasks, the NVIDIA RTX and GeForce RTX GPUs offer performance never seen before; the GeForce RTX 4090 GPU offers more than 1,300 TOPS. AI-assisted digital content production, AI super resolution in PC gaming, image generation from text or video, local large language model (LLM) querying, and other tasks require processing power comparable to this.
Put in Tokens to Start Playing
TOPS is just the start of the tale. The quantity of tokens produced by the model serves as a gauge for LLM performance.
The LLM’s output is tokens. A word in a sentence or even a smaller piece like whitespace or punctuation might serve as a token. The unit of measurement for AI-accelerated task performance is “tokens per second.”
Batch size, or the quantity of inputs processed concurrently in a single inference pass, is another crucial consideration. The ability to manage many inputs (e.g., from a single application or across multiple apps) will be a critical distinction, as an LLM will be at the basis of many modern AI systems. Greater batch sizes demand more memory even though they perform better for concurrent inputs, particularly when paired with larger models.
NVIDIA TensorRT-LLM
Because of their massive amounts of dedicated video random access memory (VRAM), Tensor Cores, and TensorRT-LLM software, RTX GPUs are incredibly well-suited for LLMs.
High-speed VRAM is available on GeForce RTX GPUs up to 24GB and on NVIDIA RTX GPUs up to 48GB, allowing for larger models and greater batch sizes. Additionally, RTX GPUs benefit from Tensor Cores, which are specialised AI accelerators that significantly accelerate the computationally demanding tasks necessary for generative AI and deep learning models. Using the NVIDIA TensorRT software development kit (SDK), which enables the highest-performance generative AI on the more than 100 million Windows PCs and workstations powered by RTX GPUs, an application can quickly reach that maximum performance.
RTX GPUs achieve enormous throughput benefits, particularly as batch sizes increase, because to the combination of memory, specialised AI accelerators, and optimised software.
Text to Image More Quickly Than Before
Performance can also be assessed by measuring the speed at which images are generated. Stable Diffusion, a well-liked image-based AI model that enables users to quickly translate text descriptions into intricate visual representations, is one of the simplest methods.
Users may easily build and refine images from text prompts to get the desired result with Stable Diffusion. These outcomes can be produced more quickly when an RTX GPU is used instead of a CPU or NPU to process the AI model.
When utilising the TensorRT extension for the well-liked Automatic1111 interface, that performance increases even further. With the SDXL Base checkpoint, RTX users can create images from prompts up to two times faster, greatly simplifying Stable Diffusion operations.
TensorRT Acceleration
TensorRT acceleration was integrated to ComfyUI, a well-liked Stable Diffusion user interface, last week. Users of RTX devices may now create images from prompts 60% quicker, and they can even utilise TensorRT to transform these images to videos 70% faster utilising Stable Video Diffuson.
The new UL Procyon AI Image Generation benchmark tests TensorRT acceleration and offers 50% faster speeds on a GeForce RTX 4080 SUPER GPU than the quickest non-TensorRT implementation.
The much awaited text-to-image model from Stable Diffusion 3 Stability AI will soon receive TensorRT acceleration, which will increase performance by 50%. Furthermore, even more performance acceleration is possible because to the new TensorRT-Model Optimizer. This leads to a 50% decrease in memory use and a 70% speedup over the non-TensorRT approach.
Naturally, the actual test is found in the practical application of refining an initial prompt. By fine-tuning prompts on RTX GPUs, users may improve image production much more quickly it takes seconds instead of minutes when using a Macbook Pro M3 Max. When running locally on an RTX-powered PC or workstation, users also benefit from speed and security with everything remaining private.
The Results Are Available and Can Be Shared
Recently, the open-source Jan.ai team of engineers and AI researchers integrated TensorRT-LLM into their local chatbot app, then put these optimisations to the test on their own system.Image Credit to NVIDIA
TensorRT-LLM
The open-source llama.cpp inference engine was utilised by the researchers to test TensorRT-LLM’s implementation on a range of GPUs and CPUs that the community uses. They discovered that TensorRT is more effective on consecutive processing runs and “30-70% faster than llama.cpp on the same hardware.” The group invited others to assess the performance of generative AI independently by sharing its methodology.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
NVIDIA Project G-Assist, RTX-powered AI assistant showcase
Tumblr media
AI assistant
Project G-Assist, an RTX-powered AI assistant technology showcase from NVIDIA, offers PC gamers and apps context-aware assistance. Through ARK: Survival Ascended from Studio Wildcard, the Project G-Assist tech demo had its premiere. For the NVIDIA ACE digital human platform, NVIDIA also unveiled the first PC-based NVIDIA NIM inference microservices.
The NVIDIA RTX AI Toolkit, a new collection of tools and software development kits that let developers optimise and implement massive generative AI models on Windows PCs, makes these technologies possible. They complement the full-stack RTX AI advances from NVIDIA that are speeding up more than 500 PC games and apps as well as 200 laptop designs from OEMs.
Furthermore, ASUS and MSI have just unveiled RTX AI PC laptops that come with up to GeForce RTX 4070 GPUs and energy-efficient systems-on-a-chip that support Windows 11 AI PC. When it becomes available, a free update to Copilot+ PC experiences will be given to these Windows 11 AI PCs.
“In 2018, NVIDIA ushered in the era of AI PCs with the introduction of NVIDIA DLSS and RTX Tensor Core GPUs,” stated Jason Paul, NVIDIA’s vice president of consumer AI. “Now, NVIDIA is opening up the next generation of AI-powered experiences for over 100 million RTX AI PC users with Project G-Assist and NVIDIA ACE.”
Best AI Assistant
GeForce AI Assistant Project G-Assist
AI assistants are poised to revolutionise in-app and gaming experiences by helping with intricate creative workflows and providing gaming techniques, as well as by analysing multiplayer replays. NVIDIA can see a peek of this future with Project G-Assist.
Even the most devoted players will find it difficult and time-consuming to grasp the complex mechanics and expansive universes found in PC games. With generative AI, Project G-Assist seeks to provide players with instant access to game expertise.
Project G-Assist uses AI vision models to process player speech or text inputs, contextual data from the game screen, and other inputs. These models improve a large language model (LLM) connected to a game knowledge database in terms of contextual awareness and app-specific comprehension, and then produce a customised answer that may be spoken or sent via text.
NVIDIA and Studio Wildcard collaborated to showcase the technology through ARK: Survival Ascended. If you have any queries concerning monsters, gear, lore, goals, challenging bosses, or anything else, Project G-Assist can help. Project G-Assist adapts its responses to the player’s game session since it is aware of the context.
Project G-Assist can also set up the player’s gaming system to run as efficiently and effectively as possible. It may apply a safe overclock, optimise graphics settings based on the user’s hardware, offer insights into performance indicators, and even automatically lower power usage while meeting performance targets.
Initial ACE PC NIM Releases
RTX AI PCs and workstations will soon be equipped with NVIDIA ACE technology, which powers digital people. With NVIDIA NIM inference microservices, developers can cut down deployment timeframes from weeks to minutes. High-quality inference for speech synthesis, face animation, natural language comprehension, and other applications is provided locally on devices using ACE NIM microservices.
The Covert Protocol tech demo, created in association with Inworld AI, will showcase the PC gaming premiere of NVIDIA ACE NIM at COMPUTEX. It now features locally operating NVIDIA Riva and Audio2Face automatic speech recognition on devices.
Installing GPU Acceleration for Local PC SLMs Using Windows Copilot Runtime
In order to assist developers in adding new generative AI features to their Windows native and web programmes, Microsoft and NVIDIA are working together. Through this partnership, application developers will have simple application programming interface (API) access to GPU-accelerated short language models (SLMs), which allow Windows Copilot Runtime to run on-device and support retrieval-augmented generation (RAG).
For Windows developers, SLMs offer a plethora of opportunities, such as task automation, content production, and content summarising. By providing the AI models with access to domain-specific data that is underrepresented in base models, RAG capabilities enhance SLMs. By using RAG APIs, developers can customise SLM capabilities and behaviour to meet individual application requirements and leverage application-specific data sources.
NVIDIA RTX GPUs and other hardware makers’ AI accelerators will speed up these AI capabilities, giving end users quick, responsive AI experiences throughout the Windows ecosystem.
Later this year, the developer preview of the API will be made available.
Using the RTX AI Toolkit, Models Are 4x Faster and 3x Smaller
Hundreds of thousands of open-source models have been developed by the AI ecosystem and are available for use by app developers; however, the majority of these models are pretrained for public use and designed to run in a data centre.
NVIDIA is announcing RTX AI Toolkit, a collection of tools and SDKs for model customisation, optimisation, and deployment on RTX AI PCs, to assist developers in creating application-specific AI models that run on PCs. Later this month, RTX AI Toolkit will be made accessible to developers on a larger scale.
Developers can use open-source QLoRa tools to customise a pretrained model. After that, they can quantize models to use up to three times less RAM by using the NVIDIA TensorRT model optimizer. After that, NVIDIA TensorRT Cloud refines the model to achieve the best possible performance on all RTX GPU lineups. When compared to the pretrained model, the outcome is up to 4 times faster performance.
The process of deploying ACE to PCs is made easier by the recently released early access version of the NVIDIA AI Inference Manager SDK. It orchestrates AI inference across PCs and the cloud with ease and preconfigures the PC with the required AI models, engines, and dependencies.
To improve AI performance on RTX PCs, software partners including Adobe, Blackmagic Design, and Topaz are incorporating RTX AI Toolkit components into their well-known creative applications.
According to Deepa Subramaniam, vice president of product marketing for Adobe Creative Cloud, “Adobe and NVIDIA continue to collaborate to deliver breakthrough customer experiences across all creative workflows, from video to imaging, design, 3D, and beyond.” “TensorRT 10.0 on RTX PCs unlocks new creative possibilities for content creation in industry-leading creative tools like Photoshop, delivering unparalleled performance and AI-powered capabilities for creators, designers, and developers.”
TensorRT-LLM, one of the RTX AI Toolkit’s components, is included into well-known generative AI development frameworks and apps, such as Automatic1111, ComfyUI, Jan.AI, LangChain, LlamaIndex, Oobabooga, and Sanctum.AI.
Using AI in Content Creation
Additionally, NVIDIA is adding RTX AI acceleration to programmes designed for modders, makers, and video fans.
NVIDIA debuted TensorRT-based RTX acceleration for Automatic1111, one of the most well-liked Stable Diffusion user interfaces, last year. Beginning this week, RTX will also speed up the widely-liked ComfyUI, offering performance gains of up to 60% over the version that is already in shipping, and a performance boost of 7x when compared to the MacBook Pro M3 Max.
With full ray tracing, NVIDIA DLSS 3.5, and physically correct materials, classic DirectX 8 and DirectX 9 games may be remastered using the NVIDIA RTX Remix modding platform. The RTX Remix Toolkit programme and a runtime renderer are included in RTX Remix, making it easier to modify game materials and objects.
When NVIDIA released RTX Remix Runtime as open source last year, it enabled modders to increase rendering power and game compatibility.
Over 100 RTX remasters are currently being developed on the RTX Remix Showcase Discord, thanks to the 20,000 modders who have utilised the RTX Remix Toolkit since its inception earlier this year to modify vintage games.
This month, NVIDIA will release the RTX Remix Toolkit as open source, enabling modders to improve the speed at which scenes are relit and assets are replaced, expand the file formats that RTX Remix’s asset ingestor can handle, and add additional models to the AI Texture Tools.
Furthermore, NVIDIA is enabling modders to livelink RTX Remix to digital content creation tools like Blender, modding tools like Hammer, and generative AI programmes like ComfyUI by providing access to the capabilities of RTX Remix Toolkit through a REST API. In order to enable modders to integrate the renderer of RTX Remix into games and applications other than the DirectX 8 and 9 classics, NVIDIA is now offering an SDK for RTX Remix Runtime.
More of the RTX Remix platform is becoming open source, enabling modders anywhere to create even more amazing RTX remasters.
All developers now have access to the SDK for NVIDIA RTX Video, the well-liked AI-powered super-resolution feature that is supported by the Mozilla Firefox, Microsoft Edge, and Google Chrome browsers. This allows developers to natively integrate AI for tasks like upscaling, sharpening, compression artefact reduction, and high-dynamic range (HDR) conversion.
Video editors will soon be able to up-sample lower-quality video files to 4K resolution and transform standard dynamic range source files into HDR thanks to RTX Video, which will be available for Wondershare Filmora and Blackmagic Design’s DaVinci Resolve video editing software. Furthermore, the free media player VLC media will shortly enhance its current super-resolution capabilities with RTX Video HDR.
Reda more on govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Ryzen AI Chatbot Wizardry Helps to Raise Conversations
Tumblr media
AMD Ryzen AI
Use Ryzen AI Processors to Create a Chatbot
AMD Ryzen AI CPUs and software open up a whole new level of efficiency for work, collaboration, and creativity by bringing the power of personal computing closer to you on an AI PC. Because they demand a lot of computing power, generative AI applications, such as AI chatbots, operate in the cloud. They will go over the fundamentals of Ryzen AI technology in this blog and show you how to use it to create an AI chatbot that performs at its best on a Ryzen AI laptop alone.
Ryzen AI Software
A separate Neural Processing Unit (NPU) for AI acceleration is incorporated on-chip with the CPU cores in Ryzen AI. With the AMD Ryzen AI software development kit (SDK), programmers can run machine learning models trained in TensorFlow or PyTorch on PCs equipped with Ryzen AI, which can intelligently optimize workloads and tasks, freeing up CPU and GPU resources and guaranteeing optimal performance at reduced power consumption.
To optimize and implement AI inference on an NPU, the SDK comes with tools and runtime libraries. The kit comes with a variety of pre-quantized, ready-to-deploy models on the Hugging Face AMD model zoo, and installation is easy. To fully utilize AI acceleration on Ryzen AI PCs, developers may begin building their applications in a matter of minutes.
Developing an Ryzen AI Chatbot
Because AI chatbots need a lot of processing power, they are typically housed in the cloud. It is true that ChatGPT can be used on a PC; however, the local application only shows the response after it has been received from the server for processing LLM models; it does not process the prompts over the Internet.
On the other hand, cloud assistance is not needed in this instance for a local and effective AI chatbot. An open-source, pre-trained OPT1.3B model can be downloaded from Hugging Face and used on a Ryzen AI laptop.
Step 1: Download Hugging Face’s opt-1.3b model, which has already been trained.
Step 2: Quantize the downloaded model from FP32 to INT8.
Step 3: Use the model to launch the chatbot application.
You can now proceed with creating the chatbot in three steps:
Step 1 Download Hugging Face’s pre-trained model
Download a Hugging Face pre-trained Opt-1.3b model in this step.
The run.py script can be altered to download a pre-trained model from the repository owned by your business or yourself.
Large, about 4GB model is Opt-1.3b.
Internet speed affects how long downloads take.
It took about 6 minutes in this instance.
Step 2: Quantize the model you downloaded (FP32 to Int8)
Following the download, AMD’S use the following command to quantize the model:
It takes two steps to achieve quantization.
Prior to quantization, the FP32 model is “smooth quantized” in order to minimize accuracy loss.
It basically uses the activation coefficients to identify the outliers and then conditions the weights appropriately.
Therefore, the introduction of error during quantization is minimal if the outliers are eliminated.
One of AMD’s original researchers, Dr. Song Han, an MIT EECS department professor, created the Smooth Quant. The smooth quantization technique’s operation is illustrated visually below.
Step 3 Analyze the model and use the Chatbot App to implement it.
After that, assess the quantized model and use the command to run it with NPU as the goal. An inline compiler compiles the model automatically during the first run.
Compilation is likewise a two-step process: the compiler first determines which layers can be executed in the NPU and which ones must be executed in the CPU.
Subgraph sets are then produced by it.
NPU is represented by one set, and CPU by another.
In order to target each subgraph’s corresponding execution unit, it finally constructs instruction sets for them.
One ONNX Execution Provider (EP) for the CPU and one for the NPU are responsible for carrying out these instructions.
The model is compiled once and then cached in the cache to save compilation during subsequent deployments.
Without a doubt, Ryzen AI processors present a tempting option for creating and managing a chatbot locally on your PC. Here’s a summary to get you going:
The Ryzen AI’s Power:
Dedicated AI Engine: Ryzen AI processors include an AMD XDNA-powered on-die AI co-processor. This hardware is specifically designed to speed up AI processes, which makes it appropriate for use as a local chatbot. Local Processing: You may run your chatbot solely on your Ryzen AI processor, in contrast to chatbots that operate on the cloud. This preserves the privacy of your data while lowering latency (response time).
Constructing a Chatbot:
Although it takes programming expertise to create a chatbot from scratch, AMD provides a solution that makes use of pre-trained models:
LM Studio: This external programme streamlines the procedure. It supports Ryzen AI processors and allows you to download Large Language Models (LLMs), or the building blocks of your chatbot, that have already been trained, such as GPT-3. Pre-trained Models: Hugging Face and other platforms provide a range of pre-trained LLMs with various features. You can select a model that fits the goal of your chatbot.
Extra Things to Think About:
Hardware Requirements: Make sure the software and drivers for your Ryzen AI CPU are compatible with AIE. Not every Ryzen processor has this feature.
Computing Power: A substantial amount of computing power is needed to run massive LLMs. Anticipate slower response times based on the intricacy of the selected LLM and your particular Ryzen processor.
Recall that this is just the beginning. As you delve deeper, you’ll see the fascinating possibilities of utilizing Ryzen AI CPUs to create personalized chatbots.
In conclusion
The AMD Ryzen AI full-stack tools enable users to quickly create experiences on an AI PC that were previously unattainable an AI application for developers, creative content for creators, and tools for business owners to maximize efficiency and workflow.
Read more on govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Gemma open models now available on Google Cloud
Tumblr media
Google today unveiled Gemma, a line of cutting-edge, lightweight open models developed using the same science and technology as the Gemini models. They are happy to announce that Google Cloud users can begin utilizing Gemma open models in Vertex AI for customization and building, as well as for deployment on Google Kubernetes Engine (GKE), right now. Google’s next step in making AI more transparent and available to developers on Google Cloud is the launch of Gemma and their enhanced platform capabilities.
Gemma displays models
The Gemma family of open models is composed of lightweight, cutting-edge models that are constructed using the same technology and research as the Gemini models. Gemma, which was created by Google DeepMind and various other Google teams, was named after the Latin gemma, which means “precious stone.” Gemma was inspired by Gemini AI. Along with their model weights, Google is also making available resources to encourage developer creativity, promote teamwork, and direct the ethical use of Gemma models.
Gemma is currently accessible via Google Cloud
Google-capable Gemini models and Gemma models share infrastructure and technical components. When compared to other open models, this allows Gemma models to achieve best-in-class performance for their sizes. Gemma 2B and Gemma 7B are the two sizes of weights that they are releasing. To facilitate research and development, pre-trained and instruction-tuned variants of each size are released.
Gemma supports frameworks like JAX, PyTorch, Keras 3.0, Hugging Face Transformers, and Colab and Kaggle notebooks tools that Google Cloud developers love and use today. Gemma open models can be used on Google Cloud, a workstation, or a laptop. Developers can now work with and customize Vertex AI and run it on GKE thanks to these new open models. They have worked with NVIDIA to optimize Gemma for NVIDIA GPUs to maximize industry-leading performance.
Gemma is now accessible everywhere in the world. What you should know in particular is this:
The Gemma 2B and Gemma 7B model weights are the two sizes that they are releasing. There are trained and fine-tuned versions available for every size.
Using Gemma, a new Responsible Generative AI Toolkit offers instructions and necessary resources for developing safer AI applications.
Google is offering native Keras 3.0 toolchains for inference and supervised fine-tuning (SFT) across all major frameworks, including PyTorch, TensorFlow, and JAX.
Gemma is simple to get started with thanks to pre-configured Colab and Kaggle notebooks and integration with well-known programs like Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM.
Pre-trained and fine-tuned Gemma open models can be easily deployed on Vertex AI and Google Kubernetes Engine (GKE) and run on your laptop, workstation, or Google Cloud.
Industry-leading performance is ensured through optimization across multiple AI hardware platforms, such as NVIDIA GPUs and Google Cloud TPUs.
All organizations, regardless of size, are permitted to use and distribute the terms of use in a responsible commercial manner.
Unlocking Gemma’s potential in Vertex AI
Gemma has joined more than 130 models in the Vertex AI Model Garden, which now includes the Gemini 1.0 Pro, 1.0 Ultra, and 1.5 Pro models they recently announced expanded access to Gemini.
Developers can benefit from an end-to-end ML platform that makes managing, tuning, and monitoring models easy and intuitive by utilizing Gemma open models on Vertex AI. By utilizing Vertex AI, builders can lower operational costs and concentrate on developing customized Gemma versions that are tailored to their specific use cases.
For instance, developers can do the following with Vertex AI’s Gemma open models:
Create generative AI applications for simple tasks like Q&A, text generation, and summarization.
Utilize lightweight, customized models to facilitate research and development through experimentation and exploration.
Encourage low-latency real-time generative AI use cases, like streaming text
Developers can easily transform their customized models into scalable endpoints that support AI applications of any size with Vertex AI’s assistance.
Utilize Gemma open models on GKE to scale from prototype to production
GKE offers resources for developing unique applications, from basic project prototyping to large-scale enterprise deployment. Developers can now use Gemma to build their generation AI applications for testing model capabilities or constructing prototypes by deploying it directly on GKE:
Use recognizable toolchains to deploy personalized, optimized models alongside apps in portable containers.
Adapt infrastructure configurations and model serving without having to provision or maintain nodes.
Quickly integrate AI infrastructure that can grow to accommodate even the most complex training and inference scenarios.
GKE offers autoscaling, reliable operations environments, and effective resource management. Furthermore, it facilitates the easy orchestration of Google Cloud AI accelerators, such as GPUs and TPUs, to improve these environments and speed up training and inference for generative AI model construction.
Cutting-edge performance at scale
The infrastructure and technical components of Gemma open models are shared with Gemini, their most powerful AI model that is currently accessible to the public. In comparison to other open models, this allows Gemma 2B and 7B to achieve best-in-class performance for their sizes. Additionally, Gemma open models can be used directly on a desktop or laptop computer used by developers. Interestingly, Gemma meets their strict requirements for responsible and safe outputs while outperforming noticeably larger models on important benchmarks. For specifics on modeling techniques, dataset composition, and performance, consult the technical report.
At Google, we think AI should benefit everyone. Google has a long history of developing innovations and releasing them to the public, including JAX, AlphaFold, AlphaCode, Transformers, TensorFlow, BERT, and T5. They are thrilled to present a fresh batch of Google open models today to help researchers and developers create ethical AI.
Begin using Gemma on Google Cloud right now
Working with Gemma models in Vertex AI and GKE on Google Cloud is now possible. Visit ai.google.dev/gemma to access quickstart guides and additional information about Gemma.
Read more on Govindhtech.com
0 notes
govindhtech · 2 years ago
Text
AI Rotation TensorRT-LLM Authority RTX Windows 11 PCs!
Tumblr media
TensorRT-LLM Features
The TensorRT-LLM wrapper for OpenAI Chat API and RTX-powered performance enhancements to DirectML for Llama 2, among other well-known LLMs, are among the new tools and resources that were unveiled at Microsoft Ignite.
Windows 11 PCs with artificial intelligence represent a turning point in computing history, transforming experiences for office workers, students, broadcasters, artists, gamers, and even casual PC users.
For owners of the more than 100 million Windows PCs and workstations powered by RTX GPUs, it presents previously unheard-of chances to boost productivity. Furthermore, NVIDIA RTX technology is making it increasingly simpler for programmers to design artificial intelligence (AI) apps that will revolutionize computer usage.
Developers will be able to provide new end-user experiences more quickly with the aid of new optimizations, models, and resources that Microsoft Ignite unveiled.
AI Rotation with TensorRT-LLM on RTX PCs
New big language models will be supported by a future upgrade to the open-source TensorRT-LLM software, which improves AI inference performance. This release will also make demanding AI workloads more accessible on desktops and laptops with RTX GPUs beginning at 8GB of VRAM.
With a new wrapper, TensorRT-LLM for Windows will soon be able to communicate with the well-liked Chat API from OpenAI. This would let customers to save confidential and proprietary data on Windows 11 PCs by enabling hundreds of developer projects and programs to operate locally on a PC with RTX rather than on the cloud.
Maintaining custom generative AI projects takes time and effort. Trying to cooperate and deploy across different settings and platforms may make the process extremely difficult and time-consuming.
With the help of AI Workbench, developers can easily construct, test, and modify pretrained generative AI models and LLMs on a PC or workstation. The toolkit is unified and user-friendly. It gives programmers a unified platform to manage their AI initiatives and fine-tune models for particular applications.
This makes it possible for developers to collaborate and deploy generative AI models seamlessly, which leads to the rapid creation of scalable, affordable models. Sign up for the early access list to be the first to learn about this expanding effort and to get updates in the future.
NVIDIA and Microsoft will provide DirectML upgrades to speed up Llama 2, one of the most well-liked basic AI models, in order to benefit AI developers. Along with establishing a new benchmark for performance, developers now have additional choices for cross-vendor deployment.
Carry-On AI
TensorRT-LLM for Windows, a library for speeding up LLM inference, was introduced by NVIDIA last month.
Later this month, TensorRT-LLM will release version 0.6.0, which will enable support for more widely used LLMs, such as the recently released Mistral 7B and Nemotron-3 8B, and enhance inference performance up to five times quicker. Versions of these LLMs may be used in some of the most portable Windows devices, supporting rapid, accurate, local LLM capabilities on any GeForce RTX 30 Series and 40 Series GPU with 8GB of RAM or more.
Installing the latest version of TensorRT-LLM may be done on the /NVIDIA/TensorRT-LLM GitHub repository. On ngc.nvidia.com, new optimized models will be accessible.
Speaking With Self-Assurance
OpenAI’s Chat API is used by developers and hobbyists worldwide for a variety of tasks, including as generating documents and emails, summarizing web material, analyzing and visualizing data, and making presentations.
Such cloud-based AIs have a drawback in that users must submit their input data, which makes them unsuitable for handling huge datasets or private or proprietary data.
In order to address this issue, NVIDIA will shortly make TensorRT-LLM for Windows available through a new wrapper to provide an API interface akin to the extensively used ChatAPI from OpenAI. This will provide developers with a similar workflow regardless of whether they are creating models and applications to run locally on an RTX-capable PC or in the cloud. Hundreds of AI-powered developer projects and applications may now take use of rapid, local AI with a single or two lines of code changes. Users don’t need to worry about uploading datasets to the cloud; they may store their data locally on their PCs.
The greatest aspect is probably that a lot of these programs and projects are open source, which makes it simple for developers to use and expand their capabilities to promote the use of RTX-powered generative AI on Windows.
The wrapper, along with additional developer tools for dealing with LLMs on RTX, is being provided as a reference project on GitHub. It is compatible with any LLM that has been optimized for TensorRT-LLM, such as Llama 2, Mistral, and NV LLM.
Acceleration of Models
Modern AI models are now available for developers to use, and a cross-vendor API facilitates deployment. As part of their continuous effort to enable developers, Microsoft and NVIDIA have been collaborating to speed up Llama on RTX using the DirectML API.
Adding to the news last month about these models’ fastest inference performance, this new cross-vendor deployment option makes bringing AI capabilities to PCs simpler than ever.
By downloading the most recent ONNX runtime, installing the most recent NVIDIA driver , and following Microsoft’s installation instructions, developers and enthusiasts may take advantage of the most recent improvements.
The creation and distribution of AI features and applications to the 100 million RTX PCs globally will be sped up by these additional optimizations, models, and resources. This will bring RTX GPU-accelerated apps and games to the market faster than with any of the other 400 partners.
RTX GPUs will be essential for allowing consumers to fully utilize this potent technology as models become ever more available and developers add more generative AI-powered capabilities to RTX-powered Windows PCs.
Read more on Govindhtech.com
0 notes
govindhtech · 2 years ago
Text
Research Generative AI Applications
Tumblr media
Protecting privacy and security in generative AI On device AI and privacy: AI boosts privacy and security
Welcome to AI on the Edge, a new weekly content series featuring the newest device artificial intelligence insights from our most active subject matter experts on this dynamic, ever expanding field.
The rising use of generative AI promises explosive creativity, ease, and productivity. Generative AI is already fulfilling these promises by providing more exact search results, attractive art, personalised advertising campaigns, and new software code using large language models (LLMs) and language-vision models (LVMs).
Must it compromise privacy and security?
Are AI and privacy incompatible?
Not necessarily. On device generative AI lets you enjoy the best of both worlds: AI with privacy and security on your smartphone, PC, or extended reality headset.
When running cloud hosted generative AI models, interactions might be public. The query, context, and data required to fine tune models can be revealed, raising AI and privacy concerns.
This includes private data or source code used as model queries or created by the model for corporate use cases, which is inappropriate.
On device generative AI can improve AI privacy and security.
Why on device? Data privacy and security are improved by AI
On device AI protects user data by keeping inquiries on the device. Under specific conditions, edge devices like smartphones and PCs are trusted to secure sensitive personal and business data using data and communications encryption, password, and biometric access.
Therefore, on device generative AI models may use such security characteristics to increase query and output data security and privacy. Since inference and fine tuning use on device memory, storage, and processing resources, models may use local data to personalise and improve input and output with the same degree of confidence.
Travel ease with on device generative AI
Take this example: User is travelling and seeking for delicious dinners. Devices currently search the Internet and provide local meal alternatives using the user’s location, even with non generative AI. However, with a generative AI based solution, the user may want the chat assistant to use personal data like food and restaurant rating preferences, food allergies, meal plan data, budget, and calendar information to find a nearby four-star restaurant with nutritional options that fit their meal plan.
The user may want the assistant to reserve a table at a time in their schedule after finding an acceptable alternative. In this case, the assistant just uses the cloud to find a list of restaurants and make a reservation while keeping searches and personal information private.
Software developer assistant with on device generative AI
Software developers that need to write product source code benefit from on device generative AI. The generative AI model needs confidential corporate data and code to do this. Again, a coding helper on the developer’s laptop would help protect the company’s valuable intellectual property from cyberattacks.
Retirement planner using on device generative AI
Retirement planning is another broad use of AI privacy. By 2030, all baby boomers in the US will be 65 or older, a population of 73 million.1 Multiple generations of retirees after that have realised the value of a well-funded retirement portfolio. As more individuals reach retirement age worldwide and pensioners live longer, retirement costs rise.
Qualified financial advisers will be in demand as personal portfolio management becomes essential to maximising returns on investment. On device AI might put a retirement planning assistant in an investor’s hand to educate and give the first few tiers of support, streamlining the process once a trained financial adviser is involved.
The investor might tell the assistant their age, savings, present investments, real estate, income, costs, risk tolerance, and investment goals via a conversational interface. After reviewing this information, the assistant may ask questions to adjust input parameters.
With these factors, the assistant might give educational content, investing techniques, recommended funds, and other investment vehicles. The assistant may also give conversational and graphical scenario analysis based on investor inquiries like “What if she live into my 90s?” or “she just got a new job, how does this affect my current plan?”
The assistant may then utilise the investor’s location, investment level, and risk tolerance to recommend local financial specialists to help develop and implement these first ideas.
Consumer confidence in generative AI requires security and privacy.
All of these instances show how a user would not want a cloud hosted chatbot to access such sensitive information but would be happy with an on device generative AI model to make judgements based on local information. Running generative AI models on a device lets users benefit without revealing personal or confidential information.
Users may want both the results and the prompts that start inquiries protected. Thus, on device inference lets consumers employ AI without exposing their data to cloud hosted models.
Running generative AI models on a device uses current technology protections to use on device personal and business data without the security and privacy issues of cloud hosted models. On device generative AI delivers enhanced creativity, convenience, and productivity and improves on cloud based models.
Up next
How can the industry allow on device generative AI? AI on the Edge will investigate the elements that will increase on device generative AI adoption in future blog entries.
0 notes