aarna-blog - Tumblr blog

aarna-blog · 13 days ago

Text

Demystifying Orchestrators: Navigating the Landscape of Management and Orchestration Solutions

We often tell our stakeholders that Orchestration is a general term, and we need to clarify what are the different types of Orchestrators available to apply in their network architecture.

In this blog, I will give an overview of various aspects of Orchestrators, since not all are created equal.

This can help decision makers choose what works for their organizational needs.

So what do orchestrators do?? In short, they perform things like:

‍

Next comes the scope of Orchestration functionality. This can be simplified into Day-0 or Day-N management. Some Orchestrators (typically the ones used on public clouds) usually perform only Day-0, which is to spin up the required resources for the operation of the function (Infrastructure, network functions, etc.). If any changes need to be made during the life of these resources (e.g., they need to be reconfigured, either manually or based on some configurable policies), there is no mechanism for doing so. The latter functionality is known as Day-N orchestration/management, also referred to as LCM (Life Cycle Management).

‍This itself has two distinct functions -- monitoring and taking actions (open loop or closed loop). Either the Orchestrator performs both of these functions, or it relies on other tools/utilities to monitor, and can simply perform corrective actions, such as the reconfigurations. In the virtualized world (VNF/CNFs), the reconfiguration could also include healing, scaling-in, scaling-out, etc.

‍Scalability is another important factor in choosing the Orchestrator. If your use cases need to be run on Edge networks (Enterprise Edge or Cloud Edge, also known as the New Middle Mile), it needs to be a lot more scalable compared to Cloud scale (those confined to few public cloud locations and clusters). The number of Network services and Applications (including the vendors) is also an order of magnitude more in case of Edge networks (1000s as opposed to 10s).

‍There is another twist to this tale -- some Orchestrators may be vendor or cloud-specific, which work well for the proprietary vendor or cloud resources, but obviously cannot be used for other vendor products or clouds. So if your environment is multi-cloud/multi-vendor, this option may not be suitable for you.

‍Independent of the above (but somewhat related in reality) is the openness of the Orchestrator implementations -- whether they are closed source implementations (proprietary) or open source based. In theory, the Orchestrator could be working with proprietary target functions but could be open source based (or vice versa). But in practice, they go hand in hand.

‍Lastly, some orchestrators may be specific to a particular domain (e.g., 5G Core, Transport or RAN). O-RAN Service Management and Orchestration (SMO), as the name implies, is an example of an Orchestrator that is specific to the RAN domain.

‍There is another category of vendors who offer Cloud Networking, which are also sometimes referred to as Orchestrators. But their functions are typically meant for specific networking use cases such as SD-WAN, MPLS Replacement, etc.

‍Regardless of their specific functions (which includes any of those mentioned above), there is another nuance to the way the solution is offered, which could be either as a Technology provider or a Service Provider. The former provides a level of customization that may be necessary for specific use cases or environments, whereas the latter (though well-packaged) is typically a fixed function.

‍With so many variations and nuances, users need to think through their specific organizational needs (present and future) before deciding on which solution to choose from the choices available.

‍I hope this sheds some light on the topic and gives some clarity on how to go about choosing a vendor. At Aarna.ml, we offer open source, zero-touch, orchestrator, AMCOP (also offered as a SaaS AES) for lifecycle management, real-time policy, and closed loop automation for edge and 5G services. If you’d like to discuss your orchestration needs, please contact us for a free consultation.

This content originally posted on https://www.aarna.ml/

0 notes

aarna-blog · 15 days ago

Text

Solving Edge Workload Orchestration and Management Challenges with Nephio

It’s been a little over a year since Nephio was first announced and the progress the community has made since that time has been inspiring. In my role on the Nephio TSC, I’ve had the chance to collaborate across the industry and build a community around this promising new approach to intent-based network orchestration.

Nephio is declarative and intent driven (an intent driven system monitors the current state and continuously reconciles it with the intended state) and presents a new approach to the industry for orchestration and management of network services, edge computing applications, and the underlying infrastructure. What's most exciting to me is using Nehio to overcome the following challenges for edge workload orchestration and management.

Nephio takes on these challenges by using Kubernetes as a general purpose control plane. It constantly reconciles the current state with the intended state which is required for scaling large and complex networks. It is suitable for both infrastructure and workloads for on-demand infrastructure and clouds. And because it’s heterogeneous, it can handle both public and private clouds, multi-vendor environments, and third-party network functions and cloud native applications.

I’m also excited that Nephio uses the O2 interface (as defined in O-RAN architecture) as one of its stated use cases. In the recent LF Networking Developer & Testing Forum, I presented a session covering how O2 IMS (Infrastructure Management Service) can be achieved on Nephio architecture and how Nephio is influencing the evolution of radio networks with an O-RAN-O2 IMS K8s profile. With its simplicity and cloud native Kubernetes approach, I believe Nephio is a perfect fit for the management and orchestration of cloud native RAN workloads and I’m looking forward to further collaboration in this area.

Lastly, I’m very proud of the way the community came together to achieve consensus on the scope, produce the release artifacts and documentation with many new features and capabilities, and deliver R1 for the industry.

If you’ve been on the sidelines with Nephio, I encourage you to get involved in the community by learning more on the Nephio wiki. Feel free to contact me as well about any questions or ideas for Nephio you may have and I’d be happy to reserve some time to discuss them.

This content originally posted on https://www.aarna.ml/

0 notes

aarna-blog · 20 days ago

Text

Optimize O-RAN: Performance Management with O-RAN SMO

Performance Management is a key function of the O-RAN Service Management and Orchestration (SMO) framework. The O-RAN SMO provides a set of performance management functions that enable network operators to monitor, measure, and optimize the performance of their O-RAN networks.

The O-RAN SMO provides a rich set of performance metrics that cover various aspects of the network, including radio access, transport, and core network performance. These metrics are collected in real-time from different network elements and can be used for real-time monitoring, analysis, and troubleshooting.

We recently developed a new SMO Performance Dashboard and integrated it with AMCOP, learn more in my previous blog post.

The O-RAN SMO also provides advanced analytics capabilities that enable network operators to detect and predict performance issues before they occur. This helps operators to proactively address performance issues, minimize network downtime, and improve overall network quality.

In addition to real-time monitoring and analytics, the O-RAN SMO also provides historical performance analysis capabilities. This enables network operators to analyze performance trends over time, identify areas for improvement, and optimize network performance over the long term.

Overall, the O-RAN SMO's performance management capabilities are critical for ensuring the reliability, availability, and quality of O-RAN networks. By providing real-time monitoring, predictive analytics, and historical analysis, the O-RAN SMO enables operators to optimize network performance, reduce downtime, and improve customer satisfaction.

Aarna.ml offers the number one open source and vendor neutral O-RAN SMO as part of Aarna.ml Multi Cluster Orchestration Platform (AMCOP). Learn more about O-RAN SMO and contact us for a free consultation.

This content originally posted on https://www.aarna.ml/

0 notes

aarna-blog · 22 days ago

Text

Why Cloud GPU Instances Are the Smartest Compute Choice in 2025

In a world where AI, machine learning, and data analytics are reshaping entire industries, the need for massive compute power has become unavoidable. But gone are the days when that meant investing in expensive infrastructure. Today, with Cloud GPU Instances and Spot Instance Pricing high-performance computing is accessible, and remarkably affordable.

Cloud GPU Instances

Cloud GPU Instances are virtual machines equipped with powerful GPUs like NVIDIA A100 or H100, designed for compute-intensive tasks like AI model training, deep learning, graphics rendering, and scientific simulations.

You don’t need to purchase or maintain expensive physical GPUs. Instead, GPU-as-a-Service (GPUaaS) platforms let you pay as you go. No upfront capital, just raw power when you need it.

And the market is booming. In 2023, GPUaaS was valued at $3.3 billion. By 2032? It’s forecasted to reach $33.91 billion, growing at a CAGR of 29.42%.

AWS Spot Instances

According to Amazon, EC2 Spot Instances can save you up to 90% compared to On-Demand pricing. That’s because Spot Instances use spare EC2 capacity. Instead of letting unused infrastructure sit idle, AWS offers it at steep discounts.

But Spot Instances come with a twist: they can be interrupted with just a two-minute warning if AWS needs the capacity back. While Amazon notes that Spot Instances are terminated less than 5% of the time, this makes them better suited for fault-tolerant workloads, such as:

a) Background processing b) Data analytics c) Batch jobs d) Optional or redundant tasks

For applications that demand consistent uptime, like mission-critical systems, On-Demand or Reserved Instances are often a better fit.

Spot vs On-Demand vs Reserved Instances: What’s the Difference?

When evaluating AWS compute options, it’s essential to understand the distinctions between Spot, On-Demand, and Reserved Instances.

Spot Instances are the most cost-effective, offering up to 90% savings compared to On-Demand pricing. Their availability depends on unused EC2 capacity, and they come with the risk of interruptions—AWS can reclaim the instance with just a 2-minute warning. These instances require no long-term commitment and are best suited for flexible, fault-tolerant workloads like batch processing, analytics, and optional jobs. Billing is per second, but there’s no SLA guarantee. Launching Spot Instances is made easy with tools like Spot Fleet and Auto Scaling Groups.

On-Demand Instances offer consistent availability and require no upfront commitment, making them ideal for dynamic or unpredictable workloads. Though they have the highest hourly cost, they provide full control with no risk of termination. Billing is per second, and these instances can be launched through the EC2 Console or SDKs.

Reserved Instances strike a balance between cost and reliability. By committing to a 1- or 3-year term, users can save up to 72% over On-Demand pricing. These instances are guaranteed in availability and are ideal for predictable, mission-critical tasks. They can be paid upfront or in monthly installments and are managed through the Console or CLI.

What Exactly Is a Spot Instance?

In essence, a Spot Instance is an EC2 virtual machine that lets you tap into AWS’s spare compute capacity at a fraction of the price. You’re charged based on the current Spot Price, which fluctuates based on long-term trends in supply and demand—not real-time bidding.

If your maximum price exceeds the current Spot Price and capacity is available, your instance launches. But AWS can reclaim it at any time, issuing a 2-minute termination notice.

Despite the risk, Spot Instances are:

• Built on the same architecture as On-Demand or Reserved Instances

• Easily deployed using the same tools (Spot Fleet, EC2 Console, Auto Scaling, etc.)

• Perfect for experimentation, scale-out architectures, and budget-conscious operations

Spot vs Preemptible vs Azure Spot: Cloud Providers Compared

Spot-like instances from major cloud providers offer compelling cost savings, but each has unique characteristics.

AWS Spot Instances have a variable, demand-based pricing model and no time limits. Users receive a 2-minute termination warning before AWS reclaims capacity. Billing is per second (after the first 60 seconds), and they’re well-suited for AI, ML, batch, and data-intensive tasks. Integration tools include Spot Fleet and Auto Scaling Groups, and interruptions are relatively rare—less than 5%.

Google Cloud Preemptible VMs offer a fixed discount (up to 80%) with a maximum usage limit of 24 hours. Termination comes with a 30-second warning, and billing is per second. These are ideal for batch and stateless workloads, and integration is supported via GKE and Managed Instance Groups. The frequency of interruptions varies.

Azure Spot VMs also feature variable, capacity-based pricing with no usage time limits. Like GCP, they offer a 30-second termination warning and no SLA. Billing is per second, and they’re perfect for batch jobs and scale-out workloads. Integration is supported via VM Scale Sets and Azure Kubernetes Service (AKS), with flexible eviction policy options.

Each platform is interruption-prone and lacks SLA coverage, making them ideal only for resilient workloads.

Why Not Always Use Spot?

Despite the massive cost advantage, Spot Instances aren’t a universal solution. According to AWS, and best practice, you should avoid using Spot for:

• Mission-critical applications

• Real-time services requiring 100% uptime

• Stateful systems that can’t handle interruptions

Instead, use Spot Instances where resiliency and flexibility are baked into the architecture. For example, stateless microservices can restart without data loss. Large ML jobs can checkpoint progress. And autoscaling fleets can absorb interruptions without disruption.

Reserved Instances: Lock In Long-Term Value If you know your compute needs ahead of time, Reserved Instances can save you up to 72% over On-Demand pricing. By committing to 1- or 3-year plans (paid upfront or monthly), you gain:

• Cost predictability

• Guaranteed availability

• Ideal conditions for critical, steady workloads

You can also opt for Convertible Reserved Instances, which allow some flexibility in instance family or OS type, even during the term.

On-Demand Instances: When You Need Control

On-Demand Instances offer:

• High availability

• Scalability without commitment

• Zero interruption risk

That makes them the go-to choice for

• Apps with spiky, unpredictable workloads

• Early-stage testing or deployment

• Services where uptime is non-negotiable

They’re the most expensive option, but sometimes, that reliability is worth every penny.

Industry Outlook

The compute economy is shifting fast. In Q1 2025, Microsoft reported $42.4 billion in cloud revenue, with a 20% YoY increase in AI-related services. Capital expenditures by hyperscalers are projected to hit $251 billion in 2025, up from $218 billion in 2024.

AWS, GCP, and Azure are all expanding GPU-backed infrastructure. And newer players like Vultr (now valued at $3.5 billion) are raising capital to invest heavily in cloud GPU delivery.

Meanwhile, the data center GPU market is forecast to jump from $16.94 billion in 2024 to $192.68 billion by 2034, with cloud deployments leading the charge.

0 notes

aarna-blog · 25 days ago

Text

Trace Management: A Key Function in O-RAN SMO

Recently we blogged about optimizing O-RAN performance and management using O-RAN SMO and a new O-RAN SMO performance dashboard now available in AMCOP.

Another key function of the O-RAN Service Management and Orchestration (SMO) framework is Trace Management. The O-RAN SMO provides a set of trace management functions that enable network operators to collect, store, and analyze trace data from different network elements.

The O-RAN SMO's trace management capabilities include real-time trace collection and analysis, as well as long-term trace storage and retrieval. The trace data collected from different network elements can be used for a variety of purposes, including network troubleshooting, performance optimization, and security analysis.

The O-RAN SMO's trace management functions are designed to support the O-RAN Alliance's open and standardized interfaces and protocols. This ensures that trace data can be collected from different vendors' equipment and analyzed in a consistent and interoperable way.

The O-RAN SMO's trace management capabilities also support advanced analytics and machine learning techniques. This enables network operators to detect and analyze patterns in the trace data, identify performance issues, and predict potential problems before they occur.

Trace Data can be reported from the Network Function (NF) to the SMO via trace files or via a streaming interface. Currently, Aarna’s SMO offering supports file-based trace records that are available to the SMO and streaming support will be available soon. Trace management utilizes O-RAN file management standards to collect trace files from NF, and notifications can be transmitted through NETCONF or VES.

Let us take a closer look at the high-level workflow involved in using AMCOP SMO (Service Management and Orchestration) to enhance trace collection processes.

High-level Workflow:

1. User Initiates Trace Job Creation:

The workflow begins with a user leveraging AMCOP SMO to create one or more trace jobs. These trace jobs specify the criteria and parameters for collecting trace data from the RAN elements.

2. User Requests Trace Collection:

Once the trace jobs are configured, the user sends an RPC (Remote Procedure Call) to initiate the trace collection process. This RPC serves as the trigger for data collection.

3. RAN Element Starts Trace Collection:

Upon receiving the RPC, the RAN element responsible for trace collection commences its operations. It starts collecting trace data based on the predefined job parameters.

4. Notification to AMCOP SMO:

As the RAN element collects trace data, it communicates with the Management Service (Mns) Consumer, which is typically the AMCOP SMO component. This communication is achieved through a notification event known as "notifyFileReady."

5. File Generation and Notification:

The "notifyFileReady" event indicates that a Trace Measurement file has been successfully generated and is ready for upload. This notification acts as a signal for further actions in the workflow.

6. SMO Notifies VES Collector:

AMCOP SMO, upon receiving the notification, takes the next step by notifying the VES (Virtual Event Stream) Collector. The VES Collector is responsible for gathering events and data from various network elements.

7. DFC Polls VES Collector:

The Data File Collector (DFC), an integral part of the process, continuously polls the VES Collector. This polling aims to retrieve information about the generated trace files and their locations.

8. Trace File Upload to DFC:

Armed with the file information obtained from the VES Collector, the DFC proceeds to upload the trace files from the RAN element to its storage location within the DFC pod.

9. Storage Location in DFC Pod:

The trace files are stored in a specific directory structure within the DFC pod. This structure typically follows the pattern: /tmp/onap_datafile/<RAN Device name>/<file path from notification>. This organization simplifies data management and retrieval.

10. Optional SFTP Upload:

If a Secure File Transfer Protocol (SFTP) uploader is configured, the DFC can upload the trace files to an external SFTP server. This step enhances data redundancy and accessibility.

11. Data Retention Decision:

The DFC may decide whether to retain or delete the trace files after successful upload based on a predefined variable, typically referred to as "delete_datafile."

12. User-Controlled Trace Collection Termination:

To maintain flexibility, users have the ability to stop trace collection jobs at any time by sending a stop trace job RPC. This user-initiated action allows for the management of trace collection processes.

Overall, the O-RAN SMO's trace management functions are critical for ensuring the reliability, availability, and security of O-RAN networks. By providing real-time trace analysis, long-term trace storage, and advanced analytics capabilities, the O-RAN SMO enables operators to quickly troubleshoot network issues, optimize network performance, and enhance network security.

This content originally posted on https://www.aarna.ml/

#technology #technews #aarna.ml

0 notes

aarna-blog · 28 days ago

Text

Navigating Cloud Adjacent Storage and Multi-Cloud Connectivity Trends

The cloud computing landscape is on the cusp of a seismic shift, with IDC forecasting that cloud services will soon surpass the remarkable milestone of $1 trillion USD. In this dynamic environment, it's no surprise that over 75% of enterprises are either already in the process of adopting or planning to embrace multi-cloud strategies starting in 2024.

As we look ahead, several trends are emerging as driving forces and new opportunities in this transformative wave. In this blog we’ll take a closer look at one of these key strategies — cloud adjacent storage.

Cloud Adjacent Storage at the Edge

The prediction that data creation is expected to reach 180 zettabytes by 2025 underscores the incredible growth in the volume of data being generated globally due to various factors such as 5G, FWA, big data, and AI.

This rapid increase poses challenges for traditional data storage and management approaches in the cloud, especially multi-cloud approaches that create complexity, security concerns, and governance questions. Another challenge is cost overruns, especially in egress costs can spike under changing usage conditions.

A solution to this can be found in distributed edge computing where data processing and storage are brought closer to the location where data is generated, rather than relying solely on centralized cloud data centers.

Critical Infrastructure and Workload Automation

Managing and optimizing multiple cloud environments is a daunting task that incurs additional OPEX, such as hiring staff skilled in multi-cloud, or training existing IT teams. These tasks slow down cloud adjacent storage adoption and optimization strategies.

Recognizing this need, Aarna.ml began the development of Aarna Edge Services (AES) in 2022. AES is a SaaS platform that simplifies management and orchestration of edge storage and multi-cloud connectivity infrastructure over the Equinix and Pure Storage platforms. It leverages policy-driven automation to simplify the deployment, configuration, and monitoring of cloud-native infrastructure and high-performance computing (HPC) stacks. This allows organizations to focus on their AI workloads without getting bogged down in infrastructure management.

Aarna Edge Services (AES)

AES automates the provisioning of multi-cloud AI workloads, ensuring they run efficiently across cloud-native technologies. This flexibility allows for the utilization of stateless CPU and GPU cloud spot instances on-demand, optimizing resource usage. This also simplifies the integration of off-the-shelf, mature cloud-based AI and big data solutions, enabling organizations to tap into advanced analytics, artificial intelligence, and big data tools without the complexities often associated with big data and AI application stack implementation. This of course leads to more data and more use of storage (a virtuous cycle in our case).

AES Beta testing has shown the ability to optimize enterprise cloud spending by more than 50%. This cost efficiency is achieved by intelligent resource allocation and dynamic utilization of cloud spot instances when available, reducing overall cloud expenses while maintaining high levels of reliability, performance, and security.

This content originally posted on https://www.aarna.ml/

#technology #aarna.ml #technews

0 notes

aarna-blog · 1 month ago

Text

O-RAN SMO Software Management

Software Management is a key function of the O-RAN Service Management and Orchestration (SMO) framework. O-RAN SMO provides a set of software management functions that enable network operators to manage the software lifecycle of different network elements in their O-RAN networks.

O-RAN software management functions are designed to support the O-RAN Alliance's open and standardized interfaces and protocols, ensuring that different vendors' equipment can be managed in a consistent and interoperable way and support automated software management workflows, enabling network operators to quickly and efficiently deploy and upgrade software across their networks.

O-RAN SMO software management capabilities include software download, provisioning, activation, deactivation, upgrade, and rollbacks for the network elements. This enables network operators to manage software versions and configurations across different network elements, ensuring that software updates are deployed in a controlled and secure manner.

Overall, the O-RAN SMO software management functions are critical for ensuring the reliability, availability, and security of O-RAN networks. By providing software download, provisioning, activation, deactivation, upgrade, and rollback capabilities, O-RAN SMO enables operators to quickly and efficiently manage the software lifecycle of their network elements, reducing network downtime, minimizing operational costs, and enhancing network agility.

Aarna Networks AMCOP contains a comprehensive O-RAN SMO – a cloud native application for orchestrating and managing O-RAN network functions – that allows network operators and vendors to manage multi-vendor RAN environments and choose best-of-breed network functions for validation and interoperability.

0 notes

aarna-blog · 1 month ago

Text

Aarna.ml Unveils AMCOP 3.4: Advancing Edge Orchestration

Aarna.ml today is announcing the release of Aarna.ml Multi Cluster Orchestration Platform (AMCOP) version 3.4, a pivotal milestone in advancing zero-touch edge orchestration. This release introduces a myriad of enhanced features, improvements, and additions, solidifying AMCOP's capabilities in managing complexity at scale. ‍

Role-Based Access Control (RBAC)

RBAC emerges as a linchpin in security, regulating network access based on organizational roles within Service Management and Orchestration (SMO) in the O-RAN architecture. RBAC not only adds an extra layer of security but also efficiently distributes superuser capabilities across administrators through meticulous privilege management. ‍

O1 Functions and NACM

In O-RAN deployments, the sensitivity of O1 functions necessitates adherence to zero-trust principles. The O1 interface, enforcing confidentiality, integrity, authenticity, and least privilege access control through encrypted transport and the Network Configuration Access Control Model (NACM), thus ensur secure network operations. This standards-based mechanism restricts user access to predefined NETCONF operations and content, integrating authentication and authorization seamlessly. ‍

OAuth 2.0 for Access Management

OAuth takes the reins in generating authorization tokens, managing access for distinct roles within the system. This introduction of an authorization layer, separating the client's role from the resource owner's, ensures secure access to protected resources. Utilizing Access Tokens issued by an authorization server, OAuth adheres to industry standards, providing a robust mechanism for secure resource access.

Keycloak for Authentication and Authorization

Keycloak, a robust open-source identity and access management solution, stands as the AAA provider for Aarna SMO. Within Keycloak's administrative realms, the roles, such as 'system-admin,' 'fault-admin,' and 'performance-admin,' define permissions, ensuring secure authentication and authorization for contemporary web applications. ‍

NETCONF Access Control Model (NACM)

NACM, a standardized approach, ensures robust access control mechanisms within the NETCONF Server. Adhering to industry standards outlined in RFC8341, NACM introduces predefined access control groups aligning with distinct NETCONF client roles, prioritizing compatibility, reliability, and adherence to established industry practices.

In this release of AMCOP, ORAN Specified RBAC/Security Requirements as per O-RAN.WG11.Security-Requirements-Specification.O-R003-v06.00 and MPlane O-RU Device Requirements as per specification - O-RAN.WG4.MP.0-R003-v12.00 are met. The solution architecture, as depicted in Figure 1, showcases the implementation of RBAC with users, roles, domains, and policies.

In conclusion, AMCOP v3.4 not only addresses security requirements but also enhances orchestration capabilities. The adoption of industry standards and the meticulous integration of access control mechanisms underscore Aarna.ml commitment to providing users with a secure, interoperable, and globally accepted platform for network orchestration. For more details on device-level access requirements, refer to the O-RAN specifications - O-RAN.WG4.MP.0-R003-v12.00.

This release reaffirms Aarna.ml dedication to innovation, security, and the seamless orchestration of multiple network elements, further solidifying its position as a leader in the evolving landscape of network management and orchestration.

Learn more about ACMOP and request a free trial.

This content originally posted on https://www.aarna.ml/

#technology #technews #business #aarna.ml

0 notes

aarna-blog · 2 months ago

Text

Empowering Edge Computing: The Imperative Role of Automation Solution

We are often asked "why automation or orchestration is needed for Edge computing in general, since Edge computing is not a new concept". In this blog, you'Il learn about the role of an orchestrator in unleashing the true potential of Edge computing environments.

‍ Recapping the uniqueness of Edge environments, a blog 'Why Edge Orchestration is Different' by Amar Kapadia highlights the following attributes:

Scale

Dynamic in nature

Heterogeneity

Dependency between workloads and infrastructure ‍

We will see the challenges in accomplishing them, and it will be obvious why automation plays a critical role in this process. Also, as explained in the previous blog, the Edge environments include both Infrastructure and the Applications that run on them (Physical or Virtual/Cloud-native). So all the above factors need to be considered for both of them in case of Edge computing. ‍ The scale of Edge environments clearly prohibits manual mode of operating them since this will involve bringing up each environment, with its own set of initial (day-0) configurations, independent of each other. The problem is compounded when these environments need to be managed on an ongoing basis (day-N). This also brings up the challenge of the dynamic nature of these environments, where the configurations can keep changing based on the business needs of the users. Each such change will result in potentially tearing down the previous environment and bringing up another one, possibly with a different set of configurations. Some of these environments may take days to bring up, with expertise from different domains (which is another challenge), and any change will mean a few more days to bring it up again, even for a potentially minor change. ‍

Another challenge with the Edge environment is their heterogeneity in nature, unlike the public cloud which is generally homogeneous. This would mean that multiple tools, possibly from different vendors, need to be used. These tools could be proprietary or standard tools such as Terraform, Ansible, Crossplane and so on. Each vendor of the Infrastructure or the applications/network functions could be using a different tool, and even in cases where they use standard tools, there may be multiple versions of the artifacts (eg., Terraform plans, Ansible scripts) that need to be dealt with. The workloads on edge may need to talk to or integrate with some applications on the central location / cloud. This would need setting up connectivity between edge and other sites as may be desired. The orchestrator should also be able provision this in an automated manner. ‍ Lastly, as we saw in the previous blog, there may be dependencies between Infrastructure and the workloads, as well as between various workloads (eg., Network Functions such as 5G that are used by other applications). This will make it extremely difficult to bring them up manually or with home-grown solutions. ‍

All these challenges will mean that unless the Edge environment is extremely small and contained, it will need a sophisticated automation framework or an orchestrator. The only scalable way to accomplish this is to specify the topology of the environment as an Intent, which is rolled out by the Orchestrator. In addition, the Orchestrator should constantly monitor the deployment, and make necessary adjustments (constant reconciliation) deployment and management of the topologies in the Edge environment. When there is a change required, a new intent (configuration) is specified which should be rolled out seamlessly. The tool should also be able to work with various tools such as Terraform/OpenTofu, Ansible and so on, as well as provide ways to integrate with proprietary vendor solutions. ‍

At Aarna.ml, we offer open source, zero-touch, Intent based orchestrator, AMCOP (also offered as a SaaS AES) for lifecycle management, real-time policy, and closed loop automation for edge and 5G services. If you’d like to discuss your orchestration needs, please contact us for a free consultation.

This content originally posted on https://www.aarna.ml/

#aarna.ml #technews

0 notes

aarna-blog · 2 months ago

Text

Automating Cloud Infrastructure and Network Functions with Nephio

At the recent IEEE Workshop on Network Automation, I had the opportunity to share insights on the advancements in automating cloud infrastructure and network functions using Nephio. This blog post aims to encapsulate the essence of that presentation, delving into the transformative potential of Nephio in the telecommunications industry.

Recent Trends in Telco Network Automation - Move to Cloud Native

Telco Cloud Moving to IaaS

Scale

Telecommunications networks are evolving rapidly, driven by the increasing demand for faster, more reliable connectivity and the emergence of technologies like 5G and edge computing. Telco network automation plays a pivotal role in this evolution, enabling operators to streamline operations, enhance efficiency, and deliver superior services to end-users.‍

Challenges in Traditional Approaches:

Traditionally, telcos have relied on manual configurations and management of network infrastructure, leading to inefficiencies, human errors, and slow response times. The complexity of modern networks exacerbates these challenges, necessitating a paradigm shift towards automation to meet the demands of today's digital landscape.‍

Enter Nephio:

Nephio emerges as a game-changer in the realm of telco network automation, offering a comprehensive platform equipped with advanced capabilities to automate cloud infrastructure and network functions seamlessly. Nephio empowers operators to achieve unparalleled levels of agility, scalability, and performance in their networks.‍

In this talk we had a deep dive into Nephio concepts and discussed below in detail:

Config Injection

Package Specialization

Condition Choreography‍

Then we talked about some of the industry relevant use cases like orchestrating bare-metal servers, deploying different kinds of workload on top of them.

At the end we discussed the next steps and how we can use the power of AI along with Nephio.‍

The Kubernetes Nephio framework is ideal for using GenAI for human-machine interaction due to its declarative intent and we can use AI for :

Prompts to declare intent (instead of YAML files)

Prompts to interact with logs/metrics (instead of looking at dashboards)

Prompts to get solutions to system anomalies ‍

In conclusion, the presentation on "Automating Cloud Infrastructure and Network Functions with Nephio" at IEEE Telco Network Automation underscored the significance of embracing innovative technologies like Nephio to navigate the complexities of modern telecommunications networks effectively. As the industry continues to evolve, Nephio stands at the forefront of driving digital transformation and empowering telcos to thrive in an increasingly competitive landscape.

This content originally posted on https://www.aarna.ml/

#aarna.ml #business #technews

0 notes

aarna-blog · 2 months ago

Text

Dynamic AI-RAN Orchestration for NVIDIA Accelerated Computing Infrastructure

NVIDIA accelerated computing can significantly accelerate many different types of workloads. In this blog, I will explain how the same NVIDIA GPU computing infrastructure (all the way to fractional GPU) can be shared for different workloads, such as RAN (Radio Access Network) and AI/ML workloads, in a fully automated manner. This is the foundational requirement for enabling AI-RAN, a technology that is being embraced widely by the telecommunications industry, to fuse AI and RAN on a common infrastructure as the next step towards AI-native 5G and 6G networks. I will also show a practical use case that was demonstrated to a Tier-1 telco.

First some background before diving into the details: The infrastructure requirements for a specific type of workload (e.g., RAN or AI/ML) will vary dynamically, and the workloads cannot be statically assigned to the resources. This is particularly aggravated by the fact that RAN utilization can vary wildly, with the average being between 20%-30%. The unused cycles can be dynamically allocated to other workloads. The challenges in sharing the same GPU pool across multiple workloads can be summarized below:

Infrastructure requirements may be different for RAN/5G & AI workloads

Dependency on networking, such as switch re-configuration, IP/MAC address reassignment, etc.

Full isolation at infra level for security and performance SLAs between workloads

Multi-Instance GPU (MIG) sizing - Fixed partitions or dynamic configuration of MIG

Additional workflows that may be required, such as workload migration/scaling

This means that there is a need for an intelligent management entity, which is capable of orchestrating both infrastructure as well as different types of workloads, and switch the workloads in a dynamic fashion. This is accomplished using AMCOP (Aarna Networks Multicluster Orchestration Platform, which is Aarna’s Orchestration platform that supports orchestrating infrastructure, workloads, and applications).

The end-to-end scenario works as follows:

Create tenants for different workloads – RAN & AI. There may be multiple tenants for AI workloads if multiple user AI jobs are scheduled dynamically

Allocate required resources (servers or GPUs/fractional GPUs) for each tenant

Create network and storage isolation between the workloads

Provide an observability dashboard for the admin to monitor the GPU utilization & other KPIs

Deploy RAN components i.e. DU, CU, and NVIDIA AI Aerial (with Day-0 configuration) from RAN tenant

Deploy AI workloads (such as an NVIDIA AI Enterprise serverless API or NIM microservice) from AI tenant(s)

Monitor RAN traffic metrics

If the RAN traffic load goes below the threshold, consolidate RAN workload to fewer servers/GPUs/fractional GPUs

Deploy (or scale out) the AI workload (e.g. LLM Inferencing workload), after performing isolation

If the RAN traffic load exceeds the threshold, spin down (or scale in) AI workload, and subsequently, bring up RAN workload

The demo for showcasing a subset of this functionality using a single NVIDIA GH200 Grace Hopper Superchip is described below. This uses a single GPU (which is divided into fractional GPUs, as 3+4 MIG configuration), which are allocated to different workloads.

The following functionality can be seen in the demo, as part of the end-to-end flow.

Open the dashboard and show the RAN KPIs on the orchestrator GUI. Also, show the GPU and MIG metrics.

Show all the RAN KPIs and GPU + MIG metrics for the past duration (hours / days)

Show the updated RAN & GPU / MIG utilizations + AI metrics

Initiate the AI load/performance testing and then show the AI metrics and GPU/MIG utilizations on the dashboard

Query the RAG model (from a UE) from a custom GUI and show the response.

Next Steps:

Over the next few years, we predict every RAN site to run on an NVIDIA GPU-accelerated infrastructure. Contact us for help on getting started with sharing NVIDIA GPU compute resources within your infrastructure. Aarna.ml’s AI-Cloud Management Software (also known as AMCOP) orchestrates and manages GPU-accelerated environments including with support for NVIDIA AI Enterprise software and NVIDIA NIM microservices. Working closely with NVIDIA, we have deep expertise with the NVIDIA Grace Hopper platform, as well as NVIDIA Triton Inference Server and NVIDIA NeMo software.

This content originally posted on https://www.aarna.ml/

#technology #business #aarna.ml #ai cloud

0 notes

aarna-blog · 2 months ago

Text

How aarna.ml GPU CMS Addresses IndiaAI Requirements

India is on the cusp of a transformative AI revolution, driven by the ambitious IndiaAI initiative. This nationwide program aims to democratize access to cutting-edge AI services by building a scalable, high-performance AI Cloud to support academia, startups, government agencies, and research bodies. This AI Cloud will need to deliver on-demand AI compute, multi-tier networking, scalable storage, and end-to-end AI platform capabilities to a diverse user base with varying needs and technical sophistication.

At the heart of this transformation lies the management layer – the orchestration engine that ensures smooth provisioning, operational excellence, SLA enforcement, and seamless platform access. This is where aarna.ml GPU Cloud Management Software (GPU CMS) plays a crucial role. By enabling dynamic GPUaaS (GPU-as-a-Service), aarna.ml GPU CMS allows providers to manage multi-tenant GPU clouds with full automation, operational efficiency, and built-in compliance with IndiaAI requirements.

Key IndiaAI Requirements and aarna.ml GPU CMS Coverage

The IndiaAI tender defines a comprehensive set of requirements for delivering AI services on cloud. While the physical infrastructure—hardware, storage, and basic network layers—will come from hardware partners, aarna.ml GPU CMS focuses on the management, automation, and operational control layers. These are the areas where our platform directly aligns with IndiaAI’s expectations.

Service Provisioning

aarna.ml GPU CMS automates the provisioning of GPU resources across bare-metal servers, virtual machines, and Kubernetes clusters. It supports self-service onboarding for tenants, allowing them to request and deploy compute instances through an intuitive portal or via APIs. This dynamic provisioning capability ensures optimal utilization of resources, avoiding underused static allocations.

Operational Management

The platform delivers end-to-end operational management, starting from infrastructure discovery and topology validation to real-time performance monitoring and automated issue resolution. Every step of the lifecycle—from tenant onboarding to resource allocation to decommissioning—is automated, ensuring that GPU resources are always used efficiently.

SLA Management

SLA enforcement is a critical part of the IndiaAI framework. aarna.ml GPU CMS continuously tracks service uptime, performance metrics, and event logs to ensure compliance with pre-defined SLAs. If an issue arises—such as a failed node, misconfiguration, or performance degradation—the self-healing mechanisms automatically trigger corrective actions, ensuring high availability with minimal manual intervention.

AI Platform Integration

IndiaAI expects the AI Cloud to offer end-to-end AI platforms with tools for model training, job submission, and model serving. aarna.ml GPU CMS integrates seamlessly with MLOps and LLMOps tools, enabling users to run AI workloads directly on provisioned infrastructure with full support for NVIDIA GPU Operator, CUDA environments, and NVIDIA AI Enterprise (NVAIE) software stack. Support for Kubernetes clusters, job schedulers like SLURM and Run:AI, and integration with tools like Jupyter and PyTorch make it easy to transition from development to production.

Tenant Isolation and Multi-Tenancy

A core requirement of IndiaAI is ensuring strict tenant isolation across compute, network, and storage layers. aarna.ml GPU CMS fully supports multi-tenancy, providing each tenant with isolated infrastructure resources, ensuring data privacy, performance consistency, and security. Network isolation (including InfiniBand partitioning), per-tenant storage mounts, and independent GPU allocation guarantee that each tenant’s environment operates independently.

Admin Portal

The Admin Portal consolidates all these capabilities into a single pane of glass, ensuring that infrastructure operators have centralized control while providing tenants with transparent self-service capabilities.

Conclusion

The IndiaAI initiative requires a sophisticated orchestration platform to manage the complexities of multi-tenant GPU cloud environments. aarna.ml GPU CMS delivers exactly that—a robust, future-proof solution that combines dynamic provisioning, automated operations, self-healing infrastructure, and comprehensive SLA enforcement.

By seamlessly integrating with underlying hardware, networks, and AI platforms, aarna.ml GPU CMS empowers GPUaaS providers to meet the ambitious goals of IndiaAI, ensuring that AI compute resources are efficiently delivered to the researchers, startups, and government bodies driving India’s AI innovation.

This content originally posted on https://www.aarna.ml/

#technology #tech #aarna.ml

0 notes

aarna-blog · 2 months ago

Text

Simplified Billing Management for AI Cloud Services with aarna.ml GPU CMS and Monetize360

Managing billing in a multi-tenant AI cloud environment can be complex — especially when handling diverse customers, varying resource usage patterns, and multiple service plans. With aarna.ml GPU Cloud Management Software (GPU CMS), this process is simplified through seamless integration with Monetize360, offering a single pane of glass experience for both cloud providers and tenants.

Integrated Billing

With aarna.ml GPU CMS, AI cloud providers and their customers do not need to switch between multiple portals to manage infrastructure and view billing information. Instead, all billing-related functions from Monetize360 are directly accessible within the aarna.ml GPU CMS interface, ensuring a smooth, uninterrupted user experience.

From the initial catalog pricing definition to tenant-level resource consumption tracking, invoice generation, and invoice visibility for tenant users, the entire billing lifecycle is integrated into the same UI that manages the cloud infrastructure. This eliminates the confusion caused by fragmented workflows and makes billing fully transparent.

Multi-Level User Experience

The billing integration supports different user personas, ensuring each user type gets the right level of visibility and control.

NCP Admin (Cloud Provider Admin) defines the pricing catalog, creates tenants, and configures billing preferences.

Tenant Admin manages tenant-specific resources and can view invoices for specific billing periods.They can download invoices directly from the aarna.ml GPU CMS portal, ensuring full visibility into usage and costs without needing to access Monetize360 separately.

Tenant Users view their allocated resources and usage metrics.

Automated Usage Tracking and Invoice Generation

The process starts when the NCP Admin sets up the service catalog, defining available AI compute instances and their hourly rates. When tenants allocate resources, all usage metrics are automatically collected and passed to Monetize360 through the integrated pipeline.

At any time, the NCP Admin can trigger invoice generation for the desired billing period. The system queries all resource usage data, generates the invoices in Monetize360, and makes them visible within aarna.ml GPU CMS for tenant users. The downloadable invoices follow a standard format with full breakdowns of allocated resources, rates, and total charges.

Real-Time Transparency for Tenants

Tenant users have direct access to their billing information without needing to rely on the NCP Admin or manually request invoices. Through the same portal where they manage their AI workloads, they can:

View current and historical invoices.

Check detailed usage and charges.

Download invoices for offline review or accounting purposes.

This transparent, self-service billing experience not only simplifies financial operations but also enhances trust between cloud providers and their customers.

This content originally published on https://www.aarna.ml/

#business #aarna.ml #technews #technology #cloudgpu

0 notes

aarna-blog · 2 months ago

Text

Seamless Integration of aarna.ml GPU CMS with DDN EXAScaler for High-Performance AI Workloads

Managing external storage forGPU-accelerated AI workloads can be complex—especially when ensuring thatstorage volumes are provisioned correctly, isolated per tenant, andautomatically mounted to the right compute nodes. With aarna.ml GPU CloudManagement Software (GPU CMS), this entire process is streamlined throughseamless integration with DDN EXAScaler.

End-to-EndAutomation with No Manual Steps

With aarna.ml GPU CMS, end users don’tneed to manually log into multiple systems, configure storage mounts, or worryabout compatibility between compute and storage. The DDN EXAScaler integrationis fully automated—allowing users to simply specify:

Everything else—from tenant-awareprovisioning, storage policy enforcement, network isolation, to automatic mountpoint creation—is handled seamlessly by aarna.ml GPU CMS.

Simpleand Efficient Flow

Theprocess starts with the NCP admin (cloud provider admin) importing the entireGPU infrastructure (compute, storage, E⇔W network, N⇔S network) into thesoftware and setting up a new tenant. Once the tenant is created, the tenantuser can allocate a GPU bare-metal or VM instance and request external storagefrom DDN.

The tenant simply provides:

Once these inputs are provided, aarna.mlGPU CMS handles all interactions with DDN, including:

This zero-touch integration eliminatesany need for the tenant to interact with the DDN portal directly.

Real-TimeValidation Across Systems

To ensure transparency and operationalassurance, the NCP admin or tenant admin can view all configured storagevolumes directly within aarna.ml GPU CMS. For additional verification, they canalso cross-check the automatically created tenants, networks, policies, andmount points directly in the DDN admin portal.

All configurations are performed via APIswith no manual intervention.

FullTenant Experience

Once the storage is provisioned, thetenant user can log directly into their allocated GPU compute node andimmediately access the mounted DDN EXAScaler storage volume. Whether forlarge-scale AI training data or model checkpoints or inference, this automatedmount ensures data is available where and when the user needs it.

KeyBenefits

The aarna.ml GPU CMS provides thefollowing key benefits:

API-DrivenConsistency: All configurations—frommount points to network overlays—are performed through automated APIs, ensuringaccuracy and compliance with tenant policies.

This post originally published on https://www.aarna.ml

#technology #technews #business #tech

0 notes

aarna-blog · 2 months ago

Text

Seamless External Storage Integration with VAST Using aarna.ml GPU Cloud Management Software

Managing external storage for GPU-accelerated AI workloads can be complex—especially when ensuring that storage volumes are provisioned correctly, isolated per tenant, and automatically mounted to the right compute nodes. With aarna.ml GPU Cloud Management Software (GPU CMS), this entire process is streamlined through seamless integration with VAST external storage systems.

End-to-End Automation with No Manual Steps

With aarna.ml GPU CMS, end users don’t need to manually log into multiple systems, configure storage mounts, or worry about compatibility between compute and storage. The VAST integration is fully automated—allowing users to simply specify:

The desired storage size.

The bare metal node where the storage should be mounted.

Everything else—from tenant-aware provisioning to storage policy enforcement and automatic mount point creation—is handled seamlessly by aarna.ml GPU CMS in the background.

Simple and Efficient Flow

The process starts with the NCP admin (cloud provider admin) importing the compute node into the system and setting up a new tenant. Once the tenant is onboarded, the tenant user can allocate a GPU bare-metal instance and request external storage from VAST.

The tenant simply provides:

The desired storage size.

The specific compute node where the storage should be mounted.

Once these inputs are provided, aarna.ml GPU CMS handles all interactions with VAST, including:

Configuring storage volumes.

Assigning tenant-specific quotas.

Creating the mount point.

Ensuring the mount point is immediately available on the compute node.

This zero-touch integration eliminates any need for the tenant to interact with the VAST portal directly.

Real-Time Validation Across Systems

To ensure transparency and operational assurance, the NCP admin or tenant admin can view all configured storage volumes directly within aarna.ml GPU CMS. For additional verification, they can also cross-check the automatically created tenants, networks, policies, and mount points directly in the VAST admin portal.

This two-way visibility ensures that:

The tenant’s allocated storage matches the requested size.

The network isolation policies (north-south overlays) are correctly applied.

All configurations are performed via APIs with no manual intervention.

Full Tenant Experience

Once the storage is provisioned, the tenant user can log directly into their allocated GPU compute node and immediately access the mounted VAST storage volume. Whether for large-scale AI training data or model checkpoints, this automated mount ensures data is available where and when the user needs it.

To further validate, the tenant can create and save files to the external storage—confirming that the VAST integration is complete and the storage is fully accessible from their compute instance.

Key Benefits

End-to-End Automation: No manual steps—just specify size and compute node, and aarna.ml GPU CMS handles everything else.

Single Pane of Glass: Both compute and storage provisioning are managed from a single interface.

Full Tenant Isolation: Each tenant’s storage is isolated with tenant-specific quotas and network policies.

Real-Time Observability: Both admins and tenants can view and validate storage allocations directly within the aarna.ml GPU CMS portal.

API-Driven Consistency: All configurations—from mount points to network overlays—are performed through automated APIs, ensuring accuracy and compliance with tenant policies.

This content originally published on https://www.aarna.ml/

#technology #business #gpu management software #technews

0 notes

aarna-blog · 2 months ago

Text

Automated InfiniBand Network Isolation with aarna.ml GPU Cloud Management Software

Managing network isolation in AI cloud environments is critical for ensuring tenant data security, performance consistency, and compliance. This becomes even more important in high-performance AI clusters that rely on InfiniBand fabric for ultra-low latency communication between GPU nodes.

With aarna.ml GPU Cloud Management Software (GPU CMS), cloud providers can achieve complete InfiniBand network isolation for every tenant—all through an automated, policy-driven process. This ensures each tenant’s data and traffic are fully segregated, with no manual intervention required.

Fully Automated InfiniBand Isolation

The aarna.ml GPU CMS achieves end-to-end isolation on InfiniBand fabrics by integrating seamlessly with NVIDIA UFM (Unified Fabric Manager). This allows for:

This policy-based automation eliminates manual errors, guarantees secure isolation across the entire InfiniBand fabric, and ensures each tenant receives a fully segregated high-performance network.

Seamless Visibility and Control

All discovery, tenant creation, and isolation enforcement actions are fully visible within the aarna.ml GPU CMS Admin Portal. Both NCP admins (cloud provider admins) and tenant admins can track:

This centralized visibility ensures operational transparency and gives cloud providers the tools they need to enforce multi-tenant isolation at scale.

Key Benefits of InfiniBand Integration

Complete Network Isolation Across Ethernet & InfiniBand

While this blog focuses on InfiniBand isolation, aarna.ml GPU CMS also supports Ethernet network isolation, including full integration with NVIDIA Spectrum-X switches. Whether using Ethernet, InfiniBand, or a combination of both, aarna.ml GPU CMS ensures complete network separation between tenants—across both the control plane and data plane.

This content originally published on https://www.aarna.ml/

#technology #business #technews #tech

0 notes

aarna-blog · 3 months ago

Text

Why GPU PaaS Is Incomplete Without Infrastructure Orchestration and Tenant Isolation

GPU Platform-as-a-Service (PaaS) is gaining popularity as a way to simplify AI workload execution — offering users a friendly interface to submit training, fine-tuning, and inferencing jobs. But under the hood, many GPU PaaS solutions lack deep integration with infrastructure orchestration, making them inadequate for secure, scalable multi-tenancy.

If you’re a Neocloud, sovereign GPU cloud, or an enterprise private GPU cloud with strict compliance requirements, you are probably looking at offering job scheduling of Model-as-a-Service to your tenants/users. An easy approach is to have a global Kubernetes cluster that is shared across multiple tenants. The problem with this approach is poor security as the underlying OS kernel, CPU, GPU, network, and storage resources are shared by all users without any isolation. Case-in-point, in September 2024, Wiz discovered a critical GPU container and Kubernetes vulnerability that affected over 35% of environments. Thus, doing just Kubernetes namespace or vCluster isolation is not safe.

You need to provision bare metal, configure network and fabric isolation, allocate high-performance storage, and enforce tenant-level security boundaries — all automated, dynamic, and policy-driven.

In short: PaaS is not enough. True GPUaaS begins with infrastructure orchestration.

The Pitfall of PaaS-Only GPU Platforms

Many AI platforms stop at providing:

A web UI for job submission

A catalog of AI/ML frameworks or models

Basic GPU scheduling on Kubernetes

What they don’t offer:

Control over how GPU nodes are provisioned (bare metal vs. VM)

Enforcement of north-south and east-west isolation per tenant

Configuration and Management of Infiniband, RoCE or Spectrum-X fabric

Lifecycle Management and Isolation of External Parallel Storage like DDN, VAST, or WEKA

Per-Tenant Quota, Observability, RBAC, and Policy Governance

Without these, your GPU PaaS is just a thin UI on top of a complex, insecure, and hard-to-scale backend.

What Full-Stack Orchestration Looks Like

To build a robust AI cloud platform — whether sovereign, Neocloud, or enterprise — the orchestration layer must go deeper.

How aarna.ml GPU CMS Solves This Problem

aarna.ml GPU CMS is built from the ground up to be infrastructure-aware and multi-tenant-native. It includes all the PaaS features you would expect, but goes beyond PaaS to offer:

‍BMaaS and VMaaS orchestration: Automated provisioning of GPU bare metal or VM pools for different tenants.

‍Tenant-level network isolation: Support for VXLAN, VRF, and fabric segmentation across Infiniband, Ethernet, and Spectrum-X.

‍Storage orchestration: Seamless integration with DDN, VAST, WEKA with mount point creation and tenant quota enforcement.

‍Full-stack observability: Usage stats, logs, and billing metrics per tenant, per GPU, per model.

All of this is wrapped with a PaaS layer that supports Ray, SLURM, KAI, Run:AI, and more, giving users flexibility while keeping cloud providers in control of their infrastructure and policies.

Why This Matters for AI Cloud Providers

If you're offering GPUaaS or PaaS without infrastructure orchestration:

You're exposing tenants to noisy neighbors or shared vulnerabilities

You're missing critical capabilities like multi-region scaling or LLM isolation

You’ll be unable to meet compliance, governance, and SemiAnalysis ClusterMax1 grade maturity

With aarna.ml GPU CMS, you deliver not just a PaaS, but a complete, secure, and sovereign-ready GPU cloud platform.

Conclusion

GPU PaaS needs to be a complete stack with IaaS — it’s not just a model serving interface!

To deliver scalable, secure, multi-tenant AI services, your GPU PaaS stack must be expanded to a full GPU cloud management software stack to include automated provisioning of compute, network, and storage, along with tenant-aware policy and observability controls.

Only then is your GPU PaaS truly production-grade.

Only then are you ready for sovereign, enterprise, and commercial AI cloud success.

To see a live demo or for a free trial, contact aarna.ml

This post orginally posted on https://www.aarna.ml/

#technology #business #technews #tech #aarna.ml #aarna

0 notes