#intelxeoncpu
Explore tagged Tumblr posts
govindhtech · 7 months ago
Text
Intel oneDPL(oneAPI DPC++ Library) Offloads C++ To SYCL
Tumblr media
Standard Parallel C++ Code Offload to SYCL Device Utilizing the Intel oneDPL (oneAPI DPC++ Library).
Enhance C++ Parallel STL methods with multi-platform parallel computing capabilities. C++ algorithms may be executed in parallel and vectorized with to the Parallel Standard Template Library, sometimes known as Parallel STL or pSTL.
Utilizing the cross-platform parallelism capabilities of SYCL and the computational power of heterogeneous architectures, you may improve application performance by offloading Parallel STL algorithms to several devices (CPUs or GPUs) that support the SYCL programming framework. Multiarchitecture, accelerated parallel programming across heterogeneous hardware is made possible by the Intel oneAPI DPC++ Library (oneDPL), which allows you to offload Parallel STL code to SYCL devices.
The code example in this article will show how to offload C++ Parallel STL code to a SYCL device using the oneDPL pSTL_offload preview function.
Parallel API
As outlined in ISO/IEC 14882:2017 (often referred to as C++17) and C++20, the Parallel API in Intel oneAPI DPC++ Library (oneDPL) implements the C++ standard algorithms with execution rules. It provides data parallel execution on accelerators supported by SYCL in the Intel oneAPI DPC++/C++ Compiler, as well as threaded and SIMD execution of these algorithms on Intel processors built on top of OpenMP and oneTBB.
The Parallel API offers comparable parallel range algorithms that follow an execution strategy, extending the capabilities of range algorithms in C++20.
Furthermore, oneDPL offers particular iterations of a few algorithms, such as:
Segmented reduction
A segmented scan
Algorithms for vectorized searches
Key-value pair sorting
Conditional transformation
Iterators and function object classes are part of the utility API. The iterators feature a counting and discard iterator, perform permutation operations on other iterators, zip, and transform. The function object classes provide identity, minimum, and maximum operations that may be supplied to reduction or transform algorithms.
An experimental implementation of asynchronous algorithms is also included in oneDPL.
Intel oneAPI DPC++ Library (oneDPL): An Overview
When used with the Intel oneAPI DPC++/C++ Compiler, oneDPL speeds up SYCL kernels for accelerated parallel programming on a variety of hardware accelerators and architectures. With the help of its Parallel API, which offers range-based algorithms, execution rules, and parallel extensions of C++ STL algorithms, C++ STL-styled programs may be efficiently executed in parallel on multi-core CPUs and offloaded to GPUs.
It supports libraries for parallel computing that developers are acquainted with, such Boost and Parallel STL. Compute. Its SYCL-specific API aids in GPU acceleration of SYCL kernels. In contrast, you may use oneDPL‘s Device Selection API to dynamically assign available computing resources to your workload in accordance with pre-established device execution rules.
For simple, automatic CUDA to SYCL code conversion for multiarchitecture programming free from vendor lock-in, the library easily interfaces with the Intel DPC++ Compatibility Tool and its open equivalent, SYCLomatic.
About the Code Sample 
With just few code modifications, the pSTL offload code example demonstrates how to offload common C++ parallel algorithms to SYCL devices (CPUs and GPUs). Using the fsycl-pstl-offload option with the Intel oneAPI DPC++/C++ Compiler, it exploits an experimental oneDPL capability.
To perform data parallel computations on heterogeneous devices, the oneDPL Parallel API offers the following execution policies:
Unseq for sequential performance
Par stands for parallel processing.
Combining the effects of par and unseq policies, par_unseq
The following three programs/sub-samples make up the code sample:
FileWordCount uses C++17 parallel techniques to count the words in a file.
WordCount determines how many words are produced using C++17 parallel methods), and
Various STL algorithms with the aforementioned execution policies (unseq, par, and par_unseq) are implemented by ParSTLTests.
The code example shows how to use the –fsycl-pstl-offload compiler option and standard header inclusion in the existing code to automatically offload STL algorithms called by the std:execution::par_unseq policy to a selected SYCL device.
You may offload your SYCL or OpenMP code to a specialized computing resource or an accelerator (such CPU, GPU, or FPGA) by using specific device selection environment variables offered by the oneAPI programming paradigm. One such environment option is ONEAPI_DEVICE_SELECTOR, which restricts the selection of devices from among all the compute resources that may be used to run the code in applications that are based on SYCL and OpenMP. Additionally, the variable enables the selection of sub-devices as separate execution devices.
The code example demonstrates how to use the ONEAPI_DEVICE SELECTOR variable to offload the code to a selected target device. OneDPL is then used to implement the offloaded code. The code is offloaded to the SYCL device by default if the pSTL offload compiler option is not used.
The example shows how to offload STL code to an Intel Xeon CPU and an Intel Data Center GPU Max. However, offloading C++ STL code to any SYCL device may be done in the same way.
What Comes Next?
To speed up SYCL kernels on the newest CPUs, GPUs, and other accelerators, get started with oneDPL and examine oneDPL code examples right now!
For accelerated, multiarchitecture, high-performance parallel computing, it also urge you to investigate other AI and HPC technologies that are based on the unified oneAPI programming paradigm.
Read more on govindhtech.com
0 notes
govindhtech · 11 months ago
Text
New NVIDIA L40S GPU-accelerated OCI Compute Instances
Tumblr media
Expanding NVIDIA GPU-Accelerated Instances for  AI, Digital Twins, and Other Uses is Oracle  Cloud Infrastructure
In order to boost productivity, cut expenses, and spur creativity, businesses are quickly using generative  AI, large language models (LLMs), sophisticated visuals, and digital twins.
But in order for businesses to use these technologies effectively, they must have access to cutting edge full-stack accelerated computing systems. Oracle  Cloud Infrastructure (OCI) today announced the imminent release of a new virtual machine powered by a single NVIDIA H100 Tensor Core GPU and the availability of NVIDIA L40S GPU bare-metal instances that are available for order to match this demand. With the addition of this new virtual machine, OCI’s H100 offering now includes an NVIDIA HGX H100 8-GPU bare-metal instance.
These platforms offer strong performance and efficiency when combined with NVIDIA networking and the NVIDIA software stack, allowing businesses to enhance generative  AI.
You can now order the NVIDIA L40S GPU on OCI
Designed to provide innovative multi-workload acceleration for generative AI, graphics, and video applications, the NVIDIA L40S GPU is universal data centre GPU. With its fourth-generation Tensor Cores and FP8 data format support, the L40S GPU is an excellent choice for inference in a variety of generative  AI use cases, as well as for training and optimising small- to mid-size LLMs.
For Llama 3 8B with NVIDIA TensorRT-LLM at an input and output sequence length of 128, for instance, a single L40S GPU (FP8) may produce up to 1.4 times as many tokens per second as a single NVIDIA A100 Tensor Core GPU (FP16).
Additionally, the NVIDIA L40S GPU offers media acceleration and best-in-class graphics. It is perfect for digital twin and complex visualisation applications because of its numerous encode/decode engines and third-generation NVIDIA Ray Tracing Cores (RT Cores).
With support for NVIDIA DLSS 3, the L40S GPU offers up to 3.8 times the real-time ray-tracing capabilities of its predecessor, resulting in quicker rendering and smoother frame rates. Because of this, the GPU is perfect for creating apps on the NVIDIA Omniverse platform, which enables AI-enabled digital twins and real-time, lifelike 3D simulations. Businesses may create sophisticated 3D apps and workflows for industrial digitalization using Omnivores on the L40S GPU. These will enable them to design, simulate, and optimise facilities, processes, and products in real time before they go into production.
NVIDIA L40S 48gb
OCI’s BM.GPU.L40S will include the L40S GPU. Featuring four NVIDIA L40S GPUs, each with 48GB of GDDR6 memory, this computational form is bare metal. This form factor comprises 1TB of system memory, 7.38TB local NVMe SSDs, and 112-core 4th generation Intel Xeon CPUs.
With OCI’s bare-metal compute architecture, these forms do away with the overhead of any virtualisation for high-throughput and latency-sensitive  AI or machine learning workloads. By removing data centre responsibilities off CPUs, the NVIDIA BlueField-3 DPU in the accelerated compute form improves server efficiency and speeds up workloads related to networking, storage, and security. By utilising BlueField-3 DPUs, OCI is advancing its off-box virtualisation approach for its whole fleet.
OCI Supercluster with NVIDIA L40S allows for ultra-high performance for up to 3,840 GPUs with minimal latency and 800Gbps internode bandwidth. NVIDIA ConnectX-7 NICs over RoCE v2 are used by OCI’s cluster network to handle workloads that are latency-sensitive and high throughput, such as  AI training.
“For 30% more efficient video encoding, we chose OCI  AI infrastructure with bare-metal instances and NVIDIA L40S GPUs,” stated Beamr  Cloud CEO Sharon Carmel.50% or less on the network and storage traffic will be used for videos processed with Beamr  Cloud on OCI, resulting in two times faster file transfers and higher end user productivity. Beamr will offer video AI workflows to OCI clients, getting them ready for the future of video.
OCI to Feature Single-GPU H100 VMs Soon
Soon to be available at OCI, the VM.GPU.H100.1 compute virtual machine shape is powered by a single NVIDIA H100 Tensor Core GPU. For businesses wishing to use the power of NVIDIA H100 GPUs for their generative  AI and HPC workloads, this will offer affordable, on-demand access.
A decent platform for LLM inference and lesser workloads is an H100 alone. For instance, with NVIDIA TensorRT-LLM at an input and output sequence length of 128 and FP8 precision, a single H100 GPU can produce more than 27,000 tokens per second for Llama 3 8B (up to 4x greater throughput than a single A100 GPU at FP16 precision).
VM.GPU.H100 is the one. form is well-suited for a variety of  AI workloads because it has 13 cores of 4th Gen Intel Xeon processors, 246GB of system memory, and a capacity for 2×3.4TB NVMe drives.
“Oracle  Cloud’s bare-metal compute with NVIDIA H100 and A100 GPUs, low-latency Supercluster, and high-performance storage delivers up to” claimed Yeshwant Mummaneni, head engineer of data management analytics at Altair. 20% better price-performance for Altair’s computational fluid dynamics and structural mechanics solvers.” “We are eager to use these GPUs in conjunction with virtual machines to power the Altair Unlimited virtual appliance.”
Validation Samples for GH200 Bare-Metal Instances Are Available
The BM.GPU.GH200 compute form is also available for customer testing from OCI. It has the NVIDIA Grace Hopper Superchip and NVLink-C2C, which connects the NVIDIA Grace CPU and NVIDIA Hopper GPU at 900GB/s with high bandwidth and cache coherence. With more than 600GB of RAM that is available, apps handling terabytes of data can operate up to 10 times faster than they would on an NVIDIA A100 GPU.
Software That’s Optimised for Enterprise AI
Businesses can speed up their  AI, HPC, and data analytics workloads on OCI with a range of NVIDIA GPUs. But an optimised software layer is necessary to fully realise the potential of these GPU-accelerated compute instances.
World-class generative AI applications may be deployed securely and reliably with the help of NVIDIA NIM, a set of user-friendly microservices that are part of the NVIDIA AI Enterprise software platform that is available on the OCI Marketplace. NVIDIA NIM is designed for high-performance AI model inference.
NIM pre-built containers, which are optimised for NVIDIA GPUs, give developers better security, a quicker time to market, and a lower total cost of ownership. NVIDIA API Catalogue offers NIM microservices for common community models, which can be simply deployed on Open Cross Infrastructure (OCI).
With the arrival of future GPU-accelerated instances, such as NVIDIA Blackwell and H200 Tensor Core GPUs, performance will only get better with time.
Contact OCI to test the GH200 Superchip and order the L40S GPU. Join Oracle and NVIDIA SIGGRAPH, the world’s preeminent graphics conference, which is taking place until August 1st, to find out more.
L40S NVIDIA price
Priced at approximately $10,000 USD, the NVIDIA L40S GPU is intended for use in data centres and  AI tasks. It is an improved L40 that was created especially for AI applications rather than visualisation jobs. This GPU can be used for a variety of high-performance applications, including media acceleration, large language model (LLM) training, inference, and 3D graphics rendering. It is driven by NVIDIA’s Ada Lovelace architecture.
Read more on govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Aurora Supercomputer Sets a New Record for AI Tragic Speed!
Tumblr media
Intel Aurora Supercomputer
Together with Argonne National Laboratory and Hewlett Packard Enterprise (HPE), Intel announced at ISC High Performance 2024 that the Aurora supercomputer has broken the exascale barrier at 1.012 exaflops and is now the fastest AI system in the world for AI for open science, achieving 10.6 AI exaflops. Additionally, Intel will discuss how open ecosystems are essential to the advancement of AI-accelerated high performance computing (HPC).
Why This Is Important:
From the beginning, Aurora was intended to be an AI-centric system that would enable scientists to use generative AI models to hasten scientific discoveries. Early AI-driven research at Argonne has advanced significantly. Among the many achievements are the mapping of the 80 billion neurons in the human brain, the improvement of high-energy particle physics by deep learning, and the acceleration of drug discovery and design using machine learning.
Analysis
The Aurora supercomputer has 166 racks, 10,624 compute blades, 21,248 Intel Xeon CPU Max Series processors, and 63,744 Intel Data Centre GPU Max Series units, making it one of the world’s largest GPU clusters. 84,992 HPE slingshot fabric endpoints make up Aurora’s largest open, Ethernet-based supercomputing connection on a single system.
The Aurora supercomputer crossed the exascale barrier at 1.012 exaflops using 9,234 nodes, or just 87% of the system, yet it came in second on the high-performance LINPACK (HPL) benchmark. Aurora supercomputer placed third on the HPCG benchmark at 5,612 TF/s with 39% of the machine. The goal of this benchmark is to evaluate more realistic situations that offer insights into memory access and communication patterns two crucial components of real-world HPC systems. It provides a full perspective of a system’s capabilities, complementing benchmarks such as LINPACK.
How AI is Optimized
The Intel Data Centre GPU Max Series is the brains behind the Aurora supercomputer. The core of the Max Series is the Intel X GPU architecture, which includes specialised hardware including matrix and vector computing blocks that are ideal for AI and HPC applications. Because of the unmatched computational performance provided by the Intel X architecture, the Aurora supercomputer won the high-performance LINPACK-mixed precision (HPL-MxP) benchmark, which best illustrates the significance of AI workloads in HPC.
The parallel processing power of the X architecture excels at handling the complex matrix-vector operations that are a necessary part of neural network AI computing. Deep learning models rely heavily on matrix operations, which these compute cores are essential for speeding up. In addition to the rich collection of performance libraries, optimised AI frameworks, and Intel’s suite of software tools, which includes the Intel oneAPI DPC++/C++ Compiler, the X architecture supports an open ecosystem for developers that is distinguished by adaptability and scalability across a range of devices and form factors.
Enhancing Accelerated Computing with Open Software and Capacity
He will stress the value of oneAPI, which provides a consistent programming model for a variety of architectures. OneAPI, which is based on open standards, gives developers the freedom to write code that works flawlessly across a variety of hardware platforms without requiring significant changes or vendor lock-in. In order to overcome proprietary lock-in, Arm, Google, Intel, Qualcomm, and others are working towards this objective through the Linux Foundation’s Unified Acceleration Foundation (UXL), which is creating an open environment for all accelerators and unified heterogeneous compute on open standards. The UXL Foundation is expanding its coalition by adding new members.
As this is going on, Intel Tiber Developer Cloud is growing its compute capacity by adding new, cutting-edge hardware platforms and new service features that enable developers and businesses to assess the newest Intel architecture, innovate and optimise workloads and models of artificial intelligence rapidly, and then implement AI models at scale. Large-scale Intel Gaudi 2-based and Intel Data Centre GPU Max Series-based clusters, as well as previews of Intel Xeon 6 E-core and P-core systems for certain customers, are among the new hardware offerings. Intel Kubernetes Service for multiuser accounts and cloud-native AI training and inference workloads is one of the new features.
Next Up
Intel’s objective to enhance HPC and AI is demonstrated by the new supercomputers that are being implemented with Intel Xeon CPU Max Series and Intel Data Centre GPU Max Series technologies. The Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA) CRESCO 8 system will help advance fusion energy; the Texas Advanced Computing Centre (TACC) is fully operational and will enable data analysis in biology to supersonic turbulence flows and atomistic simulations on a wide range of materials; and the United Kingdom Atomic Energy Authority (UKAEA) will solve memory-bound problems that underpin the design of future fusion powerplants. These systems include the Euro-Mediterranean Centre on Climate Change (CMCC) Cassandra climate change modelling system.
The outcome of the mixed-precision AI benchmark will serve as the basis for Intel’s Falcon Shores next-generation GPU for AI and HPC. Falcon Shores will make use of Intel Gaudi’s greatest features along with the next-generation Intel X architecture. A single programming interface is made possible by this integration.
In comparison to the previous generation, early performance results on the Intel Xeon 6 with P-cores and Multiplexer Combined Ranks (MCR) memory at 8800 megatransfers per second (MT/s) deliver up to 2.3x performance improvement for real-world HPC applications, such as Nucleus for European Modelling of the Ocean (NEMO). This solidifies the chip’s position as the host CPU of choice for HPC solutions.
Read more on govindhtech.com
0 notes
govindhtech · 2 years ago
Text
Intel Cloud Optimization Enhances AWS AI
Tumblr media
Intel Cloud Optimization on AWS Because it provides infrastructure and scalability, cloud computing is often used to create and operate large AI systems. Amazon Web Services (AWS), one of the largest and most prominent CSPs, offers hundreds of services to build any cloud application. The platform’s purpose-built databases and tools for AI and machine learning let developers and enterprises innovate faster, cheaper, and more agilely.
Developers may accelerate their innovation on popular hardware technologies and further boost model efficiency by utilizing pre-built optimizations and tools for a wide range of applications and use cases on AWS. It can take a lot of time and resources to find and implement the best tools and optimizations for your project. The pain of adding additional architectures to code can be mitigated for developers by providing comprehensive documentation and guides that make the implementation of these optimizations simple.
Intel Cloud Optimization Modules: What Are They? Intel Cloud Optimization Modules are a set of cloud-native, open-source reference architectures designed with production AI developers in mind. They further optimize the potential of cloud-based solutions that easily connect with AI workloads. These modules enable developers to apply AI solutions that are optimized for Intel processors and GPUs, thereby increasing workload efficiency and achieving peak performance.
With specially designed tools to complement and enrich the cloud experience on AWS with pertinent codified Intel AI software optimizations, the cloud optimization modules are accessible for well-known cloud platforms like AWS. With end-to-end AI software and optimizations for a range of use cases, including computer vision and natural language processing, these optimizations provide numerous important advantages for driving AI solutions.
Every module has a content bundle that contains a whitepaper with additional details on the module and its contents as well as the open-source GitHub repository with all of the documentation. The content packages also include a cheat sheet that lists the most pertinent code for each module, a video series, practical implementation walkthroughs, and the opportunity to attend office hours if you have any special implementation-related issues.
AWS Cloud Intel Cloud Optimization Modules AWS users can choose from a number of Intel Cloud Optimization Modules, which include optimizations for popular AWS tools like SageMaker and Amazon Elastic Kubernetes. Below, you may find out more about various AWS optimization modules:
GPT2-Modest Dispersed Instruction Generative pre-trained transformers, or GPT models, are widely used in a range of fields as GenAI applications. Since compact models are easier to construct and deploy, building large language models (LLM) is often sufficient in many use situations. This module shows developers how to optimize a GPT2-small (124M parameter) model for high-performance distributed training on an AWS cluster of Intel Xeon CPUs.
Using software optimizations and frameworks such as the Intel Extension for PyTorch and oneAPI Collective Communications Library (oneCCL) to speed up the process and improve model performance in an effective multi-node training environment, the module walks through the whole lifecycle of fine-tuning an LLM on a configured AWS cluster. An LLM on AWS with the ability to produce words trained on your particular task and dataset for your use case is the end result.
SageMaker with XGBoost A well-liked tool for creating, honing, and deploying machine learning applications on AWS, Amazon SageMaker comes with built-in Jupyter notebook instances and commonly used, optimized machine learning methods for faster model building. Working through this session will teach you how to activate the Intel AI Tools for accelerated models and inject your own training and inference code into a prebuilt SageMaker pipeline. This module accelerates an end-to-end custom machine learning pipeline on SageMaker by leveraging Intel Optimization for XGBoost. The Lambda container has all the parts needed to create custom AWS Lambda functions with XGBoost and Intel oneDAL optimizations, while the XGBoost oneDAL container comes with the oneAPI Data Analytics Library to speed up model algorithms.
Within Kubernetes, XGBoost With an automatically managed service, Amazon Elastic Kubernetes Services (EKS) makes it simple for developers to launch, operate, and expand Kubernetes applications on AWS. Using EKS and Intel AI Tools, this module makes it easier for developers to create and launch accelerated AI applications on AWS. With Intel oneDAL optimizations, developers can learn how to construct an expedited Kubernetes cluster that makes use of Intel Optimization for XGBoost for AI workloads. The module makes use of Elastic Load Balancer (ELB), Amazon Container Registry (ECR), and Amazon Elastic Compute Cloud (EC2) in addition to EKS.
Use Intel Cloud Optimization Modules to improve your AI projects on AWS by leveraging Intel optimizations and containers for widely used tools. To further your projects, you can learn how to use strong software optimizations and construct accelerated models on your preferred AWS tools and services. Use these modules to maximize the potential of your AWS projects, and register for office hours if you have any inquiries concerning implementation!
They invite you to explore Intel’s additional AI Tools and Framework enhancements and discover the oneAPI programming paradigm, which is a unified, open, and standards-based framework that serves as the basis for Intel’s AI Software Portfolio. Additionally, visit the Intel Developer Cloud to test out the newest AI-optimized software and hardware to assist in creating and implementing your next cutting-edge AI projects!
Read more on Govindhtech.com
0 notes
govindhtech · 7 months ago
Text
Intel Data Direct I/O Performance With Intel VTune Profiler
Tumblr media
Improve Intel Data Direct I/O (DDIO) Workload Performance with Intel VTune Profiler.
Profile uncore hardware performance events in Intel Xeon processors with oneAPI
One hardware feature included in Intel Xeon CPUs is Intel Data Direct I/O (DDIO) technology. By making the CPU cache the primary point of entry and exit for I/O data going into and out of the Intel Ethernet controllers and adapters, it contributes to advances in I/O performance.
To monitor the effectiveness of DDIO and Intel Virtualization Technology (Intel VT) for Directed I/O (Intel VT-d), which permits the independent execution of several operating systems and applications, it is essential to monitor uncore events, or events that take place outside the CPU core. By analyzing uncore hardware events, you may improve the performance of Intel Data Direct I/O (DDIO) workloads using Intel VTune Profiler, a performance analysis and debugging tool driven by the oneAPI.
We’ll talk about using VTune Profiler to evaluate and enhance directed I/O performance in this blog. Let’s take a quick look at Intel Data Direct I/O technology before we go into the profiling approach.
Overview of the Intel Data Direct I/O (DDIO) Technology
Intel Integrated I/O technology Intel DDIO was launched in 2012 for the Intel Xeon processor E5 and E7 v2 generations. It aims to increase system-level I/O performance by employing a new processor-to-I/O data flow.
I/O operations were sluggish and processor cache was a scarce resource prior to the development of Data Direct I/O technology. It was necessary for the host processor’s main memory to store and retrieve any incoming or departing data from an Ethernet controller or adapter, respectively. It used to be necessary to move the data from main memory to the cache before working with it.
This led to a lot of read and write operations in the memory. This also caused some additional, speculative read operations from the I/O hub in some of the older designs. Excessive memory accesses often lead to higher system power consumption and deterioration of I/O performance.
Intel DDIO technology was created to rearrange the flow of I/O data by making the processor cache the primary source and destination of I/O data instead of the main memory, as the processor cache is no longer a restricted resource.
Depending on the kind of workload at the workstation or on the server, the DDIO approach offers benefits like:
Higher transaction rates, reduced battery usage, reduced latency, increased bandwidth, and more.
There is no industry enablement needed for the Data Direct I/O technology.
It doesn’t rely on any hardware, and it doesn’t need any modifications to your operating system, drivers, or software.
Boost DDIO Performance Using Intel VTune Profiler
A function carried out in a CPU’s uncore section, outside of the processor core itself, that yet affects processor performance as a whole is referred to as an uncore event. For instance, these occurrences may be connected to the Intel Ultra Path Interconnect (UPI) block, memory controller, or I/O stack action.
A new recipe in the VTune Profiler Cookbook explains how to count these kinds of uncore hardware events using the tool’s input and output analysis function. You may analyze Data Direct I/O and VT-d efficiency by using the data to better understand the traffic and behavior of the Peripheral Component Interconnect Express (PCIe).
The recipe explains how to do input and output analysis, evaluate the findings, and classify the resulting I/O metrics. In essence, VTune Profiler v2023.2 or later and an Intel Xeon scalable CPU of the first or later generation are needed. Although the approach is suitable to the most recent version of Intel Xeon Processors, the I/O metrics and events covered in the recipe are based on the third generation Intel Xeon Scalable Processor.
Perform I/O Analysis with VTune Profiler
Start by analyzing your application’s input and output using VTune Profiler. With the analysis function, you may examine CPU, bus, and I/O subsystem use using a variety of platform-level metrics. You may get data indicating the Intel Data Direct I/O(DDIO) use efficiency by turning on the PCIe traffic analysis option.
Analyze the I/O Metrics
VTune Profiler Web Server or VTune Profiler GUI may be used to examine the report that is produced as a consequence of the input and output analysis. Using the VTune Profiler Web Server Interface, the recipe illustrates the examination of many I/O performance indicators, including:
Platform diagram use of the physical core, DRAM, PCIe, and Intel UPI linkages.
PCIe Traffic Summary, which includes metrics for both outgoing (caused by the CPU) and incoming (caused by I/O devices) PCIe traffic.
These measurements aid in the computation of CPU/IO conflicts, latency for incoming read/write requests, PCIe bandwidth and efficient use, and other factors.
Metrics to assess the workload’s effectiveness in re-mapping incoming I/O device memory locations to various host addresses using Intel VT-d technology.
Usage of DRAM and UPI bandwidth.
Read more on Govindhtech.com
0 notes
govindhtech · 11 months ago
Text
Next-Gen Computing: Exploring the Dell PowerEdge XR8000
Dell PowerEdge XR8000 is your Edge Hero. They can assist you in realizing this ideal with the Dell PowerEdge XR8000, which is built for simplicity, efficiency, and flexibility.
The Dell PowerEdge XR8000 is a game-changer, allowing for the seamless integration of artificial intelligence (AI), User Plane Function (UPF) and Multi-access Edge Computing (MEC) to enable a multitude of functionality at the edge. For applications like autonomous vehicles, smart cities, and industrial automation, the XR8000’s MEC reduces latency and improves user experience by bringing processing capacity closer to the data source.
Because of its AI capabilities, enterprises may implement machine learning models, inferencing, and intelligent analytics right at the edge, resulting in operational efficiency and real-time decision-making. It can facilitate data traffic control for UPF workloads in 5G networks, enhancing network dependability and performance.
Combining these capabilities makes the Dell PowerEdge XR8000 a vital tool for businesses looking to remain ahead of the rapidly changing digital landscape. It provides a reliable solution that can be tailored to even the most demanding edge computing environments and is future-proof.
Multi-Access Edge Computing (MEC)
By using MEC and bringing processing capability closer to data creation, communications service providers (CSPs) can boost IoT applications and real-time analytics for enterprises. This deliberate move takes advantage of 5G’s low latency and high bandwidth and establishes new revenue streams by offering cutting-edge solutions that drive digital transformation in multiple industries.
STL expects the MEC addressable market will expand 48% to $445 billion by 2030. MEC benefits business, public utilities, gaming and entertainment, and healthcare with its many applications.
Collaboration amongst several ecosystem participants is necessary for the implementation of MEC, including CSPs, infrastructure providers, and third-party application providers. The efficiency of the MEC hardware at the edge and the third-party MEC apps that are essential for particular industrial verticals determine the success of a MEC solution. The fact that Dell Technologies is an authority in the business sector is a plus.
Dell PowerEdge XR8000 provides a computational infrastructure for the MEC platform that may be utilized to host the MEC applications thanks to its distinctive sled-based architecture. Better ROI for consumers is made possible by its support for L4 GPUs, best-in-class Network Interface Cards, and 12-year warranty after purchase. Gaming, video surveillance, and content delivery networks are a few of the main uses.
The Dell PowerEdge XR8000 fulfils every criteria a provider might have in terms of hardware to meet MEC regulations. Because of its small depth and ruggedized design (NEBS level 3 Certification), this platform may be installed in an edge environment with confidence. Dense computation, ease of deployment, and a safe cyber platform for client data at the edge are all features of the XR8000.
Artificial intelligence (AI)
 AI’s growing needs in the telecom industry highlight the need for edge computing solutions strengthened by more powerful GPUs, more cores, and more thermal design power (TDP). With a projected size of $20.39 billion in 2023 and a projected growth rate of 27.5% from 2024 to 2032, the worldwide edge AI market is expected to reach $186.44 billion by 2032.
The buzz surrounding the newest AI capabilities for telecom companies significantly enhanced operations and open doors for new services needs to be balanced with the need to use a server platform built for AI in telecom networks.
Edge computing and artificial intelligence are two new technologies that are combined to create  AI at the edge. AI provides business intelligence to the processed data for business insights, and edge computing assists in processing data at the edge.
Because there is no need to send the data back to the core, AI at the edge offers amazing benefits like reduced latency, more security, and cheaper operating costs in addition to greater bandwidth efficiency. Less data transmission volume to the cloud and real-time data processing while preserving data security and integrity are further advantages.
The latest Intel Xeon CPUs are supported by the Dell poweredge XR8000, which is a ruggedised AI-capable server for the edge thanks to its support for NVIDIA L4 GPUs. It is a processing powerhouse for  AI and GenAI that can support up to six L4 GPUs in a 2U form size, which greatly enhances computer vision, inference performance, and data analytics.
The ability of the Dell PowerEdge XR8000 for AI to handle several AI workloads on a single chassis is what sets it apart from the competition and allows CSPs to diversify their deployment to diverse AI telecom workloads while also improving return on investment. Because of its flexible, compute-dense sled architecture, CSPs will be able to quickly enable new AI capabilities and confidently and easily deploy solutions.
As an AI server, the Dell PowerEdge XR8000 can be used to target the automotive, manufacturing, healthcare, energy, and telecom industries.
User Plane Function (UPF) 5G brings numerous new use services and performs faster, more reliably, and with lower latency than 4G deployments. The breakdown of 5G core into control and use planes (CUPS) enables CSPs to deploy UPF at different places and platforms, even though RAN plays a crucial role in helping 5G achieve goals.
By taking advantage of this, Distributed User Plane Function (D-UPF) allows CSPs to locate UPF close to the edge where data is created. This will lower backhaul networking costs for CSPs and allow them diversify income streams and charge more for differentiated services.
For optimal performance, the UPF should be hosted on a commercially available off-the-shelf (COTS) platform, which can take use of cloudification and virtualization. A hardware platform that has been ruggedized for the edge is necessary for the deployment of edge UPF. The Dell PowerEdge XR8000 platform is a NEBS level 3 certified system that is highly suitable for D-UPF due to its temperature tolerance range of -20 to 65 degrees Celsius.
The hot pluggable sled-based architecture of the PowerEdge XR8000 provides redundancy in both power and computing. CSPs choose it as their preferred platform for D-UPF.
Championing technology, the Dell PowerEdge XR8000 is the most optimized edge server platform available. By putting processing capacity closer to the data source, lowering latency, and enhancing real-time processing, it strengthens MEC. Because of its strong architecture, which can withstand powerful  AI and GenAI capabilities, it enables intelligent data analysis and edge decision-making, sparking innovation in a variety of industries.
The Dell PowerEdge XR8000 guarantees smooth data routing and network traffic management for UPF, which is crucial for 5G installations. Discover the Dell PowerEdge XR8000 hero and open up new revenue opportunities at the edge.
Read more on govindhtech.com
0 notes
govindhtech · 1 year ago
Text
IBM Cloud Bare Metal Servers for VPCs Use 4th Gen Intel Xeon
Tumblr media
The range of IBM  Cloud Bare Metal Servers for Virtual Private Clouds is being shaken up by new 4th Gen Intel Xeon processors and dynamic network bandwidth.
With great pleasure, IBM is thrilled to announce that the fourth generation of Intel Xeon CPUs are now available on IBM  Cloud Bare Metal Servers for Virtual Private Clouds. IBM customers now have the ability to provision Intel’s most recent micro architecture within their very own virtual private cloud. This allows them to get access to a variety of performance benefits, such as increased core-to-memory ratios (21 new server profiles) and dynamic network bandwidth that is only available through IBM Cloud VPC. For those who are following track, that is three times as many provisioning options as their present Intel Xeon  CPUs, which are of the second generation. Take a look around.
Are these servers made of bare metal suitable for my needs?
In addition to having rapid provisioning, excellent network speeds, and the most secure software-defined resources that are accessible within IBM, IBM  Cloud Bare Metal Servers for Virtual Private Clouds are hosted on their most recent and developer-friendly platform. Every single one of your central processing units would be based on the 4th gen Intel Xeon processors, which IBM initially introduced on IBM  Cloud Bare Metal Servers for traditional infrastructure in conjunction with Intel’s day-one release product.
The traditional IBM Cloud infrastructure is distinct from the IBM Cloud Virtual Private Cloud. More suitable for large, steady-state, predictable activities that call for the highest possible level of customisation is this method. However, IBM Cloud Virtual Private Cloud is an excellent solution for high-availability and maximum elasticity requirements. Take a look at this brief introduction video to get a better understanding of which environment would be most suitable for your workload requirements.
The customisation choices available to you include five pre-set profile families, which contain your number of  CPU instances, RAM, and bandwidth, in the event that IBM  Cloud Bare Metal Servers for Virtual Private  Cloud turns out to be your preferred choice. What sets IBM Cloud apart from other cloud services is the fact that each profile provides you with DDR-5 memory and dynamic network bandwidth ranging from 10 to 200 Gbps. For tasks that require a significant amount of  CPU power, such as heavy web traffic operations, production batch processing, and front-end web servers, compute profiles are the most effective solution.
Balanced profiles are designed to provide a combination of performance and scalability, making them a great choice for databases of a moderate size and cloud applications that experience moderate traffic.
Memory profiles are most effective when applied to workloads that require a significant amount of memory, such as large cache and database applications, as well as in-memory analytics.
When it comes to running small to medium in-memory databases and OLAP, such as SAP BW/4 HANA, very high profiles are the most effective solutions.
Large in-memory databases and online transaction processing workloads are both excellent for ultra-high profiles because they offer the most memory per core.
For these bare metal servers, what kinds of workloads do you propose they handle?
Over the course of this year, IBM’s beta programme was exposed to a wide variety of workloads; nonetheless, there were a few noteworthy success stories that particularly stood out:
Building on top of IBM  Cloud, VMware Cloud Foundation These workloads required a high core performance, interoperability with VMware, licencing portability, a smaller core count variety, and a Generic operating system, which IBM just recently launched. In a dedicated location, they conducted tests for VMware managed virtual cloud functions (VCF) as well as build-your-own VMware virtual cloud functions (VCF).
They were happy with the customisation freedom and benchmark performance enhancements that backed up their findings. During the second half of the year, these workloads will be accessible on Intel Xeon profiles of the fourth generation within the IBM  Cloud Virtual Private Cloud.
With regard to HPCaaS, this workload was one of a kind, and IBM believe that it is a primary use case for this distribution. Terraform and IBM Storage Scale were used in their tests to see whether or not they could get improved performance. They were delighted with the throughput improvement and the agile provisioning experience between platforms and networking.
The task of providing financial services and banking necessitated both powerful and dedicated system performance, as well as the highest possible level of security and compliance. After conducting tests to determine capacity expansion, user interface experience, security controls, and security management, they were thrilled to find that production times had been reduced.
Beginning the process
In the data centres of IBM  Cloud Dallas, Texas, bare metal servers powered by 4th gen Intel Xeon processors are currently accessible. Additional sites will be added in the second half of the year 24. The IBM  Cloud Bare Metal Servers for Virtual Private Cloud catalogue allows you to view all of the pricing and provisioning options for their new 4th Gen Intel Xeon processors and save a quote to your account. As an alternative, you could start a chat and obtain some answers right now. Within their cloud documents, you can find more information by reading their getting started guides and tutorials.
Spend one thousand dollars in IBM Cloud credits
If you are an existing customer who is interested in provisioning new workloads or if you are inquisitive about deploying your first workload on IBM Cloud VPC, then you should be sure to take advantage of their limited time promotion for IBM Cloud VPC. By entering the promotional code VPC1000 within either the bare metal or virtual server catalogues, you will receive USD 1,000 in credits that may be used towards the purchase of your new virtual private cloud (VPC) resources. These resources include computing, network, and storage components. Only profiles based on the second generation of Intel Xeon processors and profiles from earlier generations are eligible for this promotion, which is only available for a limited period.
Read more on Govindhtech.com
0 notes