#TensorRTLLM | Explore Tumblr posts and blogs

govindhtech · 11 months ago

Text

New NVIDIA L40S GPU-accelerated OCI Compute Instances

Expanding NVIDIA GPU-Accelerated Instances for AI, Digital Twins, and Other Uses is Oracle Cloud Infrastructure

In order to boost productivity, cut expenses, and spur creativity, businesses are quickly using generative AI, large language models (LLMs), sophisticated visuals, and digital twins.

But in order for businesses to use these technologies effectively, they must have access to cutting edge full-stack accelerated computing systems. Oracle Cloud Infrastructure (OCI) today announced the imminent release of a new virtual machine powered by a single NVIDIA H100 Tensor Core GPU and the availability of NVIDIA L40S GPU bare-metal instances that are available for order to match this demand. With the addition of this new virtual machine, OCI’s H100 offering now includes an NVIDIA HGX H100 8-GPU bare-metal instance.

These platforms offer strong performance and efficiency when combined with NVIDIA networking and the NVIDIA software stack, allowing businesses to enhance generative AI.

You can now order the NVIDIA L40S GPU on OCI

Designed to provide innovative multi-workload acceleration for generative AI, graphics, and video applications, the NVIDIA L40S GPU is universal data centre GPU. With its fourth-generation Tensor Cores and FP8 data format support, the L40S GPU is an excellent choice for inference in a variety of generative AI use cases, as well as for training and optimising small- to mid-size LLMs.

For Llama 3 8B with NVIDIA TensorRT-LLM at an input and output sequence length of 128, for instance, a single L40S GPU (FP8) may produce up to 1.4 times as many tokens per second as a single NVIDIA A100 Tensor Core GPU (FP16).

Additionally, the NVIDIA L40S GPU offers media acceleration and best-in-class graphics. It is perfect for digital twin and complex visualisation applications because of its numerous encode/decode engines and third-generation NVIDIA Ray Tracing Cores (RT Cores).

With support for NVIDIA DLSS 3, the L40S GPU offers up to 3.8 times the real-time ray-tracing capabilities of its predecessor, resulting in quicker rendering and smoother frame rates. Because of this, the GPU is perfect for creating apps on the NVIDIA Omniverse platform, which enables AI-enabled digital twins and real-time, lifelike 3D simulations. Businesses may create sophisticated 3D apps and workflows for industrial digitalization using Omnivores on the L40S GPU. These will enable them to design, simulate, and optimise facilities, processes, and products in real time before they go into production.

NVIDIA L40S 48gb

OCI’s BM.GPU.L40S will include the L40S GPU. Featuring four NVIDIA L40S GPUs, each with 48GB of GDDR6 memory, this computational form is bare metal. This form factor comprises 1TB of system memory, 7.38TB local NVMe SSDs, and 112-core 4th generation Intel Xeon CPUs.

With OCI’s bare-metal compute architecture, these forms do away with the overhead of any virtualisation for high-throughput and latency-sensitive AI or machine learning workloads. By removing data centre responsibilities off CPUs, the NVIDIA BlueField-3 DPU in the accelerated compute form improves server efficiency and speeds up workloads related to networking, storage, and security. By utilising BlueField-3 DPUs, OCI is advancing its off-box virtualisation approach for its whole fleet.

OCI Supercluster with NVIDIA L40S allows for ultra-high performance for up to 3,840 GPUs with minimal latency and 800Gbps internode bandwidth. NVIDIA ConnectX-7 NICs over RoCE v2 are used by OCI’s cluster network to handle workloads that are latency-sensitive and high throughput, such as AI training.

“For 30% more efficient video encoding, we chose OCI AI infrastructure with bare-metal instances and NVIDIA L40S GPUs,” stated Beamr Cloud CEO Sharon Carmel.50% or less on the network and storage traffic will be used for videos processed with Beamr Cloud on OCI, resulting in two times faster file transfers and higher end user productivity. Beamr will offer video AI workflows to OCI clients, getting them ready for the future of video.

OCI to Feature Single-GPU H100 VMs Soon

Soon to be available at OCI, the VM.GPU.H100.1 compute virtual machine shape is powered by a single NVIDIA H100 Tensor Core GPU. For businesses wishing to use the power of NVIDIA H100 GPUs for their generative AI and HPC workloads, this will offer affordable, on-demand access.

A decent platform for LLM inference and lesser workloads is an H100 alone. For instance, with NVIDIA TensorRT-LLM at an input and output sequence length of 128 and FP8 precision, a single H100 GPU can produce more than 27,000 tokens per second for Llama 3 8B (up to 4x greater throughput than a single A100 GPU at FP16 precision).

VM.GPU.H100 is the one. form is well-suited for a variety of AI workloads because it has 13 cores of 4th Gen Intel Xeon processors, 246GB of system memory, and a capacity for 2×3.4TB NVMe drives.

“Oracle Cloud’s bare-metal compute with NVIDIA H100 and A100 GPUs, low-latency Supercluster, and high-performance storage delivers up to” claimed Yeshwant Mummaneni, head engineer of data management analytics at Altair. 20% better price-performance for Altair’s computational fluid dynamics and structural mechanics solvers.” “We are eager to use these GPUs in conjunction with virtual machines to power the Altair Unlimited virtual appliance.”

Validation Samples for GH200 Bare-Metal Instances Are Available

The BM.GPU.GH200 compute form is also available for customer testing from OCI. It has the NVIDIA Grace Hopper Superchip and NVLink-C2C, which connects the NVIDIA Grace CPU and NVIDIA Hopper GPU at 900GB/s with high bandwidth and cache coherence. With more than 600GB of RAM that is available, apps handling terabytes of data can operate up to 10 times faster than they would on an NVIDIA A100 GPU.

Software That’s Optimised for Enterprise AI

Businesses can speed up their AI, HPC, and data analytics workloads on OCI with a range of NVIDIA GPUs. But an optimised software layer is necessary to fully realise the potential of these GPU-accelerated compute instances.

World-class generative AI applications may be deployed securely and reliably with the help of NVIDIA NIM, a set of user-friendly microservices that are part of the NVIDIA AI Enterprise software platform that is available on the OCI Marketplace. NVIDIA NIM is designed for high-performance AI model inference.

NIM pre-built containers, which are optimised for NVIDIA GPUs, give developers better security, a quicker time to market, and a lower total cost of ownership. NVIDIA API Catalogue offers NIM microservices for common community models, which can be simply deployed on Open Cross Infrastructure (OCI).

With the arrival of future GPU-accelerated instances, such as NVIDIA Blackwell and H200 Tensor Core GPUs, performance will only get better with time.

Contact OCI to test the GH200 Superchip and order the L40S GPU. Join Oracle and NVIDIA SIGGRAPH, the world’s preeminent graphics conference, which is taking place until August 1st, to find out more.

L40S NVIDIA price

Priced at approximately $10,000 USD, the NVIDIA L40S GPU is intended for use in data centres and AI tasks. It is an improved L40 that was created especially for AI applications rather than visualisation jobs. This GPU can be used for a variety of high-performance applications, including media acceleration, large language model (LLM) training, inference, and 3D graphics rendering. It is driven by NVIDIA’s Ada Lovelace architecture.

Read more on govindhtech.com

0 notes

govindhtech · 1 year ago

Text

NVIDIA Nemotron-4 340B Open LLMs for Synthetic Data Training

NVIDIA Nemotron-4 340B

NVIDIA unveiled Nemotron-4 340B, an open model family that allows developers to produce synthetic data for large language model (LLM) training in the industrial, retail, healthcare, and finance sectors, among other industries.

Robust training datasets might be prohibitively expensive and difficult to get, but they are essential to the performance, accuracy, and quality of responses from a bespoke LLM.

Nemotron-4 340B provides developers with a scalable, free method of creating synthetic data that may be used to construct robust LLMs, with a uniquely liberal open model licence.

Nemotron

The base, instruct, and reward models in the Nemotron-4 340B family work together to create synthetic data that is used to train and improve LLMs. The models are designed to function with NVIDIA NeMo, an open-source platform that enables data curation, customisation, and evaluation during the whole model training process. Additionally, they are designed using the open-source NVIDIA TensorRT-LLM library in mind for inference.

You may now get Nemotron-4 340B from Hugging Face. The models will be packaged as an NVIDIA NIM microservice with a standard application programming interface that can be deployed anywhere.

Getting Around the Nemotron to Produce Synthetic Data

LLMs can be useful in situations where access to big, diverse labelled datasets is limited for developers creating synthetic training data.

The Nemotron-4 340B Instruct model generates a variety of synthetic data that closely resembles real-world data, enhancing data quality to boost the robustness and performance of custom LLMs in a range of domains.

A large language model (LLM) called Nemotron-4-340B-Instruct can be utilised in a pipeline for synthetic data creation to produce training data that will aid in the development of LLMs by researchers and developers. This is a refined Nemotron-4-340B-Base model designed for English-speaking single- and multi-turn chat scenarios. A context length of 4,096 tokens is supported.

A dataset of 9 trillion tokens, comprising a wide range of English-based literature, more than 50 natural languages, and more than 40 coding languages, was used to pre-train the base model. The Nemotron-4-340B-Instruct model then underwent more alignment procedures, such as:

Monitoring and Adjustment (SFT)

Optimisation of Direct Preference (DPO)

Preference Optimisation with Reward Awareness (RPO)

While over 98% of the data utilised for supervised fine-tuning and preference fine-tuning (DPO & RPO) was synthesised by NVIDIA’s data creation pipeline, the company only relied on about 20,000 human-annotated data throughout the alignment process.

As a result, a model that can produce high-quality synthetic data for a range of use scenarios is created that is matched for human chat preferences and enhances mathematical thinking, coding, and instruction following.

NVIDIA affirms under the terms of the NVIDIA Open Model Licence:

The models can be used commercially.

It is not prohibited for you to develop and share derivative models.

Any outputs produced utilising the Models or Derivative Models are not attributed to NVIDIA.

Developers can then utilise the Nemotron-4 340B Reward model to filter for high-quality responses, which will improve the quality of the AI-generated data. Five criteria are used by Nemotron-4 340B Reward to score responses: verbosity, coherence, accuracy, helpfulness, and complexity. As of right now, it holds the top spot on the AI2-created Hugging Face RewardBench scoreboard, which assesses the strengths, vulnerabilities, and safety of reward models.

By combining their private data with the included HelpSteer2 dataset, researchers can further customise the Nemotron-4 340B Base model to construct their own teach or reward models.

Large language models (LLMs) such as Nemotron-4-340B-Base can be utilised in a synthetic data production pipeline to produce training data that aids in the development of LLMs by researchers and developers. With 4,096 tokens in the context, this model supports 340 billion parameters. It has been pre-trained on a total of 9 trillion tokens, which include more than 40 coding languages, more than 50 natural languages, and a wide range of English-based writings.

To enhance the quality of the pre-trained model, a continuous pre-training of 1 trillion tokens was carried out on top of the pre-trained model following an initial pre-training phase of 8 trillion tokens. NVIDIA changed the distribution of the data used during continuous pre-training from the one that was present at the start of training.

TensorRT-LLM Inference Optimisation, NeMo Fine-Tuning

Developers can maximise the effectiveness of their instruct and reward models to provide synthetic data and score responses by utilising the open-source NVIDIA NeMo and NVIDIA TensorRT-LLM.

Tensor parallelism a kind of model parallelism in which individual weight matrices are divided among several GPUs and servers is a sort of parallelism that is optimised into all Nemotron-4 340B models using TensorRT-LLM. This allows for effective inference at scale.

Nemotron-4 340B the NeMo architecture allows Base, which was trained on 9 trillion tokens, to be tailored to certain use cases or domains. Extensive pretraining data aids in this fine-tuning process, which produces outputs that are more accurate for particular downstream tasks.

The NeMo framework offers a range of customisation options, such as parameter-efficient fine-tuning techniques like low-rank adaptation, or LoRA, and supervised fine-tuning techniques.

Developers can use NeMo Aligner and datasets annotated by Nemotron-4 340B Reward to align their models and improve model quality. Using methods like reinforcement learning from human feedback (RLHF), a model’s behaviour is refined during alignment, a crucial phase in LLM training, to make sure its outputs are accurate, safe, acceptable for the context, and compatible with the model’s stated goals.

NeMo and TensorRT-LLM are also available to businesses via the cloud-native NVIDIA AI Enterprise software platform, which offers rapid and effective runtimes for generative AI foundation models. This platform is ideal for those looking for enterprise-grade support and security for production environments.

Assessing Model Security and Beginning

After undergoing a thorough safety examination that included adversarial tests, the Nemotron-4 340B Instruct model demonstrated good performance over a broad spectrum of risk indicators. It is still important for users to carefully assess the model’s outputs to make sure the artificially created data is appropriate, secure, and accurate for their use case.

Read more on Govindhtech.com

#nvidia #nvidianemotron #nemotron #nemotron4 #govindhtech #news #technology #technews #technologytrends #tensorrt #tensorrtllm #Nemotron4340B

0 notes

govindhtech · 1 year ago

Text

Is TensorRT Acceleration Coming For Stable Diffusion 3

NVIDIA TensorRT

Thanks to NVIDIA RTX and GeForce RTX technology, the AI PC age is arrived. Along with it comes a new language that can be difficult to understand when deciding between the many desktop and laptop options, as well as a new method of assessing performance for AI-accelerated tasks. This article is a part of the AI Decoded series, which shows off new RTX PC hardware, software, tools, and accelerations while demystifying AI by making the technology more approachable.

While frames per second (FPS) and related statistics are easily understood by PC gamers, new measures are needed to measure AI performance.

Emerging as the Best

Trillions of operations per second, or TOPS, is the initial baseline. The key term here is trillions; the processing power required for generative AI jobs is truly enormous. Consider TOPS to be a raw performance indicator, like to the horsepower rating of an engine.

Take Microsoft’s recently unveiled Copilot+ PC series, for instance, which has neural processing units (NPUs) capable of up to 40 TOPS. For many simple AI-assisted tasks, such as asking a nearby chatbot where yesterday’s notes are, 40 TOPS is sufficient.

However, a lot of generative AI tasks are more difficult. For all generative tasks, the NVIDIA RTX and GeForce RTX GPUs offer performance never seen before; the GeForce RTX 4090 GPU offers more than 1,300 TOPS. AI-assisted digital content production, AI super resolution in PC gaming, image generation from text or video, local large language model (LLM) querying, and other tasks require processing power comparable to this.

Put in Tokens to Start Playing

TOPS is just the start of the tale. The quantity of tokens produced by the model serves as a gauge for LLM performance.

The LLM’s output is tokens. A word in a sentence or even a smaller piece like whitespace or punctuation might serve as a token. The unit of measurement for AI-accelerated task performance is “tokens per second.”

Batch size, or the quantity of inputs processed concurrently in a single inference pass, is another crucial consideration. The ability to manage many inputs (e.g., from a single application or across multiple apps) will be a critical distinction, as an LLM will be at the basis of many modern AI systems. Greater batch sizes demand more memory even though they perform better for concurrent inputs, particularly when paired with larger models.

NVIDIA TensorRT-LLM

Because of their massive amounts of dedicated video random access memory (VRAM), Tensor Cores, and TensorRT-LLM software, RTX GPUs are incredibly well-suited for LLMs.

High-speed VRAM is available on GeForce RTX GPUs up to 24GB and on NVIDIA RTX GPUs up to 48GB, allowing for larger models and greater batch sizes. Additionally, RTX GPUs benefit from Tensor Cores, which are specialised AI accelerators that significantly accelerate the computationally demanding tasks necessary for generative AI and deep learning models. Using the NVIDIA TensorRT software development kit (SDK), which enables the highest-performance generative AI on the more than 100 million Windows PCs and workstations powered by RTX GPUs, an application can quickly reach that maximum performance.

RTX GPUs achieve enormous throughput benefits, particularly as batch sizes increase, because to the combination of memory, specialised AI accelerators, and optimised software.

Text to Image More Quickly Than Before

Performance can also be assessed by measuring the speed at which images are generated. Stable Diffusion, a well-liked image-based AI model that enables users to quickly translate text descriptions into intricate visual representations, is one of the simplest methods.

Users may easily build and refine images from text prompts to get the desired result with Stable Diffusion. These outcomes can be produced more quickly when an RTX GPU is used instead of a CPU or NPU to process the AI model.

When utilising the TensorRT extension for the well-liked Automatic1111 interface, that performance increases even further. With the SDXL Base checkpoint, RTX users can create images from prompts up to two times faster, greatly simplifying Stable Diffusion operations.

TensorRT Acceleration

TensorRT acceleration was integrated to ComfyUI, a well-liked Stable Diffusion user interface, last week. Users of RTX devices may now create images from prompts 60% quicker, and they can even utilise TensorRT to transform these images to videos 70% faster utilising Stable Video Diffuson.

The new UL Procyon AI Image Generation benchmark tests TensorRT acceleration and offers 50% faster speeds on a GeForce RTX 4080 SUPER GPU than the quickest non-TensorRT implementation.

The much awaited text-to-image model from Stable Diffusion 3 Stability AI will soon receive TensorRT acceleration, which will increase performance by 50%. Furthermore, even more performance acceleration is possible because to the new TensorRT-Model Optimizer. This leads to a 50% decrease in memory use and a 70% speedup over the non-TensorRT approach.

Naturally, the actual test is found in the practical application of refining an initial prompt. By fine-tuning prompts on RTX GPUs, users may improve image production much more quickly it takes seconds instead of minutes when using a Macbook Pro M3 Max. When running locally on an RTX-powered PC or workstation, users also benefit from speed and security with everything remaining private.

The Results Are Available and Can Be Shared

Recently, the open-source Jan.ai team of engineers and AI researchers integrated TensorRT-LLM into their local chatbot app, then put these optimisations to the test on their own system.Image Credit to NVIDIA

TensorRT-LLM

The open-source llama.cpp inference engine was utilised by the researchers to test TensorRT-LLM’s implementation on a range of GPUs and CPUs that the community uses. They discovered that TensorRT is more effective on consecutive processing runs and “30-70% faster than llama.cpp on the same hardware.” The group invited others to assess the performance of generative AI independently by sharing its methodology.