#HPCWorkloads | Explore Tumblr posts and blogs

govindhtech · 9 months ago

Text

Google Cloud Parallelstore Powering AI And HPC Workloads

Parallelstore

Businesses process enormous datasets, execute intricate simulations, and train generative models with billions of parameters using artificial intelligence (AI) and high-performance computing (HPC) applications for a variety of use cases, including LLMs, genomic analysis, quantitative analysis, and real-time sports analytics. Their storage systems are under a lot of performance pressure from these workloads, which necessitate high throughput and scalable I/O performance that keeps latencies under a millisecond even when thousands of clients are reading and writing the same shared file at the same time.

Google Cloud is thrilled to share that Parallelstore, which is unveiled at Google Cloud Next 2024, is now widely available to power these next-generation AI and HPC workloads. Parallelstore, which is based on the Distributed Asynchronous Object Storage (DAOS) architecture, combines a key-value architecture with completely distributed metadata to provide high throughput and IOPS.

Continue reading to find out how Google Parallelstore meets the demands of demanding AI and HPC workloads by enabling you to provision Google Kubernetes Engine and Compute Engine resources, optimize goodput and GPU/TPU utilization, and move data in and out of Parallelstore programmatically.

Optimize throughput and GPU/TPU use

It employs a key-value store architecture along with a distributed metadata management system to get beyond the performance constraints of conventional parallel file systems. Due to its high-throughput parallel data access, it may overwhelm each computing client’s network capacity while reducing latency and I/O constraints. Optimizing the expenses of AI workloads is dependent on maximizing good output to GPUs and TPUs, which is achieved through efficient data transmission. Google Cloud may also meet the needs of modest-to-massive AI and HPC workloads by continuously granting read/write access to thousands of virtual machines, graphics processing units, and TPUs.

The largest Parallelstore deployment of 100 TiB yields throughput scaling to around 115 GiB/s, with a low latency of ~0.3 ms, 3 million read IOPS, and 1 million write IOPS. This indicates that a large number of clients can benefit from random, dispersed access to small files on Parallelstore. According to Google Cloud benchmarks, Parallelstore’s performance with tiny files and metadata operations allows for up to 3.7x higher training throughput and 3.9x faster training timeframes for AI use cases when compared to native ML framework data loaders.

Data is moved into and out of Parallelstore using programming

For data preparation or preservation, cloud storage is used by many AI and HPC applications. You may automate the transfer of the data you want to import into Parallelstore for processing by using the integrated import/export API. With the help of the API, you may ingest enormous datasets into Parallelstore from Cloud Storage at a rate of about 20 GB per second for files bigger than 32 MB and about 5,000 files per second for smaller files.

gcloud alpha parallelstore instances import-data $INSTANCE_ID –location=$LOCATION –source-gcs-bucket-uri=gs://$BUCKET_NAME [–destination-parallelstore-path=”/”] –project= $PROJECT_ID

You can programmatically export results from an AI training task or HPC workload to Cloud Storage for additional evaluation or longer-term storage. Moreover, data pipelines can be streamlined and manual involvement reduced by using the API to automate data transfers.

gcloud alpha parallelstore instances export-data $INSTANCE_ID –location=$LOCATION –destination-gcs-bucket-uri=gs://$BUCKET_NAME [–source-parallelstore-path=”/”]

GKE resources are programmatically provisioned via the CSI driver

The Parallelstores GKE CSI driver makes it simple to effectively manage high-performance storage for containerized workloads. Using well-known Kubernetes APIs, you may access pre-existing Parallelstore instances in Kubernetes workloads or dynamically provision and manage Parallelstore file systems as persistent volumes within your GKE clusters. This frees you up to concentrate on resource optimization and TCO reduction rather than having to learn and maintain a different storage system.

ApiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: parallelstore-class provisioner: parallelstore.csi.storage.gke.io volumeBindingMode: Immediate reclaimPolicy: Delete allowedTopologies:

matchLabelExpressions:

key: topology.gke.io/zone values:

us-central1-a

The fully managed GKE Volume Populator, which automates the preloading of data from Cloud Storage straight into Parallelstore during the PersistentVolumeClaim provisioning process, will be available to preload data from Cloud Storage in the upcoming months. This helps guarantee that your training data is easily accessible, allowing you to maximize GPU and TPU use and reduce idle compute-resource time.

Utilizing the Cluster Toolkit, provide Compute Engine resources programmatically

With the Cluster Toolkit’s assistance, deploying Parallelstore instances for Compute Engine is simple. Cluster Toolkit is open-source software for delivering AI and HPC workloads; it was formerly known as Cloud HPC Toolkit. Using best practices, Cluster Toolkit allocates computing, network, and storage resources for your cluster or workload. With just four lines of code, you can integrate the Parallelstore module into your blueprint and begin using Cluster Toolkit right away. For your convenience, we’ve also included starter blueprints. Apart from the Cluster Toolkit, Parallelstore may also be deployed using Terraform templates, which minimize human overhead and support operations and provisioning processes through code.

Respo.vision

Leading sports video analytics company Respo. Vision is using it to speed up its real-time system’s transition from 4K to 8K video. Coaches, scouts, and fans can receive relevant information by having Respo.vision help gather and label granular data markers utilizing Parallelstore as the transport layer. Respo.vision was able to maintain low computation latency while managing spikes in high-performance video processing without having to make costly infrastructure upgrades because to Parallelstore.

The use of AI and HPC is expanding quickly. The storage solution you need to maintain the demanding GPU/TPUs and workloads is Parallelstore, with its novel architecture, performance, and integration with Cloud Storage, GKE, and Compute Engine.

Read more on govindhtech.com

#GoogleCloud #ParallelstorePoweringAI #AIworkloads #HPCWorkloads #artificialintelligence #AI #news #cloudstorage #GoogleCloudNext2024 #ClusterToolkit #ComputeEngine #GoogleKubernetesEngine #technology #technews #GOVINDHTECH

0 notes

govindhtech · 10 months ago

Text

PCS AWS: AWS Parallel Computing Service For HPC workloads

PCS AWS

AWS launching AWS Parallel Computing Service (AWS PCS), a managed service that allows clients build up and maintain HPC clusters to execute simulations at nearly any scale on AWS. The Slurm scheduler lets them work in a familiar HPC environment without worrying about infrastructure, accelerating outcomes.

AWS Parallel Computing

Run HPC workloads effortlessly at any scale.

Why AWS PCS?

AWS Parallel Computing Service (AWS PCS) is a managed service that simplifies HPC workloads and Slurm-based scientific and engineering model development on AWS. PCS AWS lets you create elastic computing, storage, networking, and visualization environments. Managed updates and built-in observability features make cluster management easier with AWS PCS. You may focus on research and innovation in a comfortable environment without worrying about infrastructure.

Benefits

Focus on labor, not infrastructure

Give users comprehensive HPC environments that scale to run simulations and scientific and engineering modeling without code or script changes to boost productivity.

Manage, secure, and scale HPC clusters

Build and deploy scalable, dependable, and secure HPC clusters via the AWS Management Console, CLI, or SDK.

HPC solutions using flexible building blocks

Build and maintain end-to-end HPC applications on AWS using highly available cluster APIs and infrastructure as code.

Use cases

Tightly connected tasks

At almost any scale, run concurrent MPI applications like CAE, weather and climate modeling, and seismic and reservoir simulation efficiently.

Faster computing

GPUs, FPGAs, and Amazon-custom silicon like AWS Trainium and AWS Inferentia can speed up varied workloads like creating scientific and engineering models, protein structure prediction, and Cryo-EM.

Computing at high speed and loosely linked workloads

Distributed applications like Monte Carlo simulations, image processing, and genomics research can run on AWS at any scale.

Workflows that interact

Use human-in-the-loop operations to prepare inputs, run simulations, visualize and evaluate results in real time, and modify additional trials.

AWS ParallelCluster

In November 2018, AWS launched AWS ParallelCluster, an AWS-supported open-source cluster management tool for AWS Cloud HPC cluster deployment and maintenance. Customers can quickly design and deploy proof of concept and production HPC computation systems with AWS ParallelCluster. Open-source AWS ParallelCluster Command-Line interface, API, Python library, and user interface are available. Updates may include cluster removal and reinstallation. To eliminate HPC environment building and operation chores, many clients have requested a completely managed AWS solution.

AWS Parallel Computing Service (AWS PCS)

PCS AWS simplifies AWS-managed HPC setups via the AWS Management Console, SDK, and CLI. Your system administrators can establish managed Slurm clusters using their computing, storage, identity, and job allocation preferences. AWS PCS schedules and orchestrates simulations using Slurm, a scalable, fault-tolerant work scheduler utilized by many HPC clients. Scientists, researchers, and engineers can log into AWS PCS clusters to conduct HPC jobs, use interactive software on virtual desktops, and access data. Their workloads can be swiftly moved to PCS AWS without code porting.

Fully controlled NICE DCV remote desktops allow specialists to manage HPC operations in one place by accessing task telemetry or application logs and remote visualization.

PCS AWS uses familiar methods for preparing, executing, and analyzing simulations and computations for a wide range of traditional and emerging, compute or data-intensive engineering and scientific workloads in computational reservoir simulations, electronic design automation, finite element analysis, fluid dynamics, and weather modeling.

Starting AWS Parallel Computing Service

AWS documentation article for constructing a basic cluster lets you try AWS PCS. First, construct a VPC with an AWS CloudFormation template and shared storage in Amazon EFS in your account for the AWS Region where you will try PCS AWS. AWS literature explains how to create a VPC and shared storage.

Cluster

Select Create cluster in the PCS AWS console to manage resources and run workloads.

Name your cluster and select your Slurm scheduler controller size. Cluster workload limits are Small (32 nodes, 256 jobs), Medium (512 nodes, 8,192 tasks), and Large (2,048 nodes, 16,384 jobs). Select your VPC, cluster launch subnet, and cluster security group in Networking.

A resource selection method parameter, an idle duration before compute nodes scale down, and a Prolog and Epilog scripts directory on launched compute nodes are optional Slurm configurations.

Create cluster. Provisioning the cluster takes time.

Form compute node groupings

After constructing your cluster, you can create compute node groups, a virtual grouping of Amazon EC2 instances used by PCS AWS to enable interactive access to a cluster or perform processes in it. You define EC2 instance types, minimum and maximum instance counts, target VPC subnets, Amazon Machine Image (AMI), purchasing option, and custom launch settings when defining a compute node group. Compute node groups need an instance profile to pass an AWS IAM role to an EC2 instance and an EC2 launch template for AWS PCS to configure EC2 instances.

Select the Compute node groups tab and the Create button in your cluster to create a compute node group in the console.

End users can login to a compute node group, and HPC jobs run on a job node group.

Use a compute node name and a previously prepared EC2 launch template, IAM instance profile, and subnets to launch compute nodes in your cluster VPC for HPC jobs.

Next, select your chosen EC2 instance types for compute node launches and the scaling minimum and maximum instance count.

Select Create. Provisioning the computing node group takes time.

Build and run HPC jobs

After building compute node groups, queue a job to run. Job queued until PCS AWS schedules it on a compute node group based on provisioned capacity. Each queue has one or more computing node groups that supply EC2 instances for processing.

Visit your cluster, select Queues, and click Create queue to create a queue in the console.

Select Create and wait for queue creation.

AWS Systems Manager can connect to the EC2 instance it creates when the login compute node group is active. Select your login compute node group EC2 instance in the Amazon EC2 console. The AWS manual describes how to create a queue to submit and manage jobs and connect to your cluster.

Create a submission script with job requirements and submit it to a queue with the sbatch command to perform a Slurm job. This is usually done from a shared directory so login and compute nodes can access files together.

Slurm may perform MPI jobs in PCS AWS. See AWS documents Run a single-node job with Slurm or Run a multi-node MPI task with Slurm for details.

Visualize with a fully managed NICE DCV remote desktop. Start with the HPC Recipes for AWS GitHub CloudFormation template.

After HPC jobs using your cluster and node groups, erase your resources to minimize needless expenses. See AWS documentation Delete your AWS resources for details.

Know something

Some things to know about this feature:

Slurm versions – AWS PCS initially supports Slurm 23.11 and enables tools to upgrade major versions when new versions are added. AWS PCS also automatically patches the Slurm controller.

On-Demand Capacity Reservations let you reserve EC2 capacity in a certain Availability Zone and duration to ensure you have compute capacity when you need it.

Network file systems Amazon FSx for NetApp ONTAP, OpenZFS, File Cache, EFS, and Lustre can be attached to write and access data and files. Self-managed volumes like NFS servers are possible.

Now available

US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) now provide AWS Parallel Computing Service.