#vLLM
Explore tagged Tumblr posts
govindhtech · 6 months ago
Text
How To Use Llama 3.1 405B FP16 LLM On Google Kubernetes
Tumblr media
How to set up and use large open models for multi-host generation AI over GKE
Access to open models is more important than ever for developers as generative AI grows rapidly due to developments in LLMs (Large Language Models). Open models are pre-trained foundational LLMs that are accessible to the general population. Data scientists, machine learning engineers, and application developers already have easy access to open models through platforms like Hugging Face, Kaggle, and Google Cloud’s Vertex AI.
How to use Llama 3.1 405B
Google is announcing today the ability to install and run open models like Llama 3.1 405B FP16 LLM over GKE (Google Kubernetes Engine), as some of these models demand robust infrastructure and deployment capabilities. With 405 billion parameters, Llama 3.1, published by Meta, shows notable gains in general knowledge, reasoning skills, and coding ability. To store and compute 405 billion parameters at FP (floating point) 16 precision, the model needs more than 750GB of GPU RAM for inference. The difficulty of deploying and serving such big models is lessened by the GKE method discussed in this article.
Customer Experience
You may locate the Llama 3.1 LLM as a Google Cloud customer by selecting the Llama 3.1 model tile in Vertex AI Model Garden.
Once the deploy button has been clicked, you can choose the Llama 3.1 405B FP16 model and select GKE.Image credit to Google Cloud
The automatically generated Kubernetes yaml and comprehensive deployment and serving instructions for Llama 3.1 405B FP16 are available on this page.
Deployment and servicing multiple hosts
Llama 3.1 405B FP16 LLM has significant deployment and service problems and demands over 750 GB of GPU memory. The total memory needs are influenced by a number of parameters, including the memory used by model weights, longer sequence length support, and KV (Key-Value) cache storage. Eight H100 Nvidia GPUs with 80 GB of HBM (High-Bandwidth Memory) apiece make up the A3 virtual machines, which are currently the most potent GPU option available on the Google Cloud platform. The only practical way to provide LLMs such as the FP16 Llama 3.1 405B model is to install and serve them across several hosts. To deploy over GKE, Google employs LeaderWorkerSet with Ray and vLLM.
LeaderWorkerSet
A deployment API called LeaderWorkerSet (LWS) was created especially to meet the workload demands of multi-host inference. It makes it easier to shard and run the model across numerous devices on numerous nodes. Built as a Kubernetes deployment API, LWS is compatible with both GPUs and TPUs and is independent of accelerators and the cloud. As shown here, LWS uses the upstream StatefulSet API as its core building piece.
A collection of pods is controlled as a single unit under the LWS architecture. Every pod in this group is given a distinct index between 0 and n-1, with the pod with number 0 being identified as the group leader. Every pod that is part of the group is created simultaneously and has the same lifecycle. At the group level, LWS makes rollout and rolling upgrades easier. For rolling updates, scaling, and mapping to a certain topology for placement, each group is treated as a single unit.
Each group’s upgrade procedure is carried out as a single, cohesive entity, guaranteeing that every pod in the group receives an update at the same time. While topology-aware placement is optional, it is acceptable for all pods in the same group to co-locate in the same topology. With optional all-or-nothing restart support, the group is also handled as a single entity when addressing failures. When enabled, if one pod in the group fails or if one container within any of the pods is restarted, all of the pods in the group will be recreated.
In the LWS framework, a group including a single leader and a group of workers is referred to as a replica. Two templates are supported by LWS: one for the workers and one for the leader. By offering a scale endpoint for HPA, LWS makes it possible to dynamically scale the number of replicas.
Deploying multiple hosts using vLLM and LWS
vLLM is a well-known open source model server that uses pipeline and tensor parallelism to provide multi-node multi-GPU inference. Using Megatron-LM’s tensor parallel technique, vLLM facilitates distributed tensor parallelism. With Ray for multi-node inferencing, vLLM controls the distributed runtime for pipeline parallelism.
By dividing the model horizontally across several GPUs, tensor parallelism makes the tensor parallel size equal to the number of GPUs at each node. It is crucial to remember that this method requires quick network connectivity between the GPUs.
However, pipeline parallelism does not require continuous connection between GPUs and divides the model vertically per layer. This usually equates to the quantity of nodes used for multi-host serving.
In order to support the complete Llama 3.1 405B FP16 paradigm, several parallelism techniques must be combined. To meet the model’s 750 GB memory requirement, two A3 nodes with eight H100 GPUs each will have a combined memory capacity of 1280 GB. Along with supporting lengthy context lengths, this setup will supply the buffer memory required for the key-value (KV) cache. The pipeline parallel size is set to two for this LWS deployment, while the tensor parallel size is set to eight.
In brief
We discussed in this blog how LWS provides you with the necessary features for multi-host serving. This method maximizes price-to-performance ratios and can also be used with smaller models, such as the Llama 3.1 405B FP8, on more affordable devices. Check out its Github to learn more and make direct contributions to LWS, which is open-sourced and has a vibrant community.
You can visit Vertex AI Model Garden to deploy and serve open models via managed Vertex AI backends or GKE DIY (Do It Yourself) clusters, as the Google Cloud Platform assists clients in embracing a gen AI workload. Multi-host deployment and serving is one example of how it aims to provide a flawless customer experience.
Read more on Govindhtech.com
2 notes · View notes
mysocial8onetech · 2 years ago
Text
vLLM: an open-source library that speeds up the inference and serving of large language models (LLMs) on GPUs. It does this by using PagedAttention, a new attention algorithm that stores key-value tensors more efficiently in the non-contiguous spaces of the GPU VRAM.
0 notes
canmom · 2 months ago
Text
I tried running the hot new small 32B-parameter reasoning model QwQ locally, and it's still just a bit too big for my rig to handle - using Ollama (whose backend is llama.cpp), it flooded my VRAM and a fair chunk of my RAM, and ended up generating at about 1 token/s (roughly, not measured), but unfortunately my mouse was also moving at about 1fps at the same time - though maybe whatever microsoft did to the task manager is the real culprit here since it surprisingly cleared up after a bit. in any case, I looked into VLLM, which is the qwen team's recommended server for inference, but it doesn't play nice on windows, so it will have to wait until I can get it running on Linux, or maybe get WSL up and running.
anyway, prompting it with a logic puzzle suggested by a friend -
I have three eggs. I eat one. One hatches into a chicken. The chicken lays an egg. How many eggs do I have?
led to it generating a seemingly infinitely long chain of thought as it kept going round and round nuances of the problem like whether you own a chicken that hatches from your egg. I decided to cut it off rather than let it terminate. here's a sample though.
Tumblr media
#ai
10 notes · View notes
centlinux · 23 days ago
Text
VLLM Docker: Fast LLM Containers Made Easy
Learn how to deploy large language models (LLMs) efficiently with VLLM Docker. This guide covers setup, optimization, and best practices for fast, scalable LLM inference in containers. #centlinux #linux #docker #openai
Tumblr media
0 notes
ecommerceupdate · 2 months ago
Text
Tumblr media
A Red Hat acaba de lançar novas atualizações no Red Hat AI, seu portfólio de produtos e serviços projetados para acelerar o desenvolvimento e a implantação de soluções de IA na nuvem híbrida. O Red Hat AI oferece uma plataforma de IA empresarial para treinamento de modelos e inferência, proporcionando mais experiência, flexibilidade e uma experiência simplificada para implantar sistemas em qualquer lugar na nuvem híbrida. Na busca por reduzir os custos de implementação de grandes modelos de linguagem (LLMS) para atender a um número crescente de casos de uso, empresas ainda enfrentam o desafio de integrar esses sistemas com seus dados proprietários e acessá-los de qualquer lugar: seja em um datacenter, na nuvem pública ou até mesmo na edge. Integrando tanto o Red Hat OpenShift AI como o Red Hat Enterprise Linux AI (RHEL AI), o Red Hat AI responde a essas preocupações ao fornecer uma plataforma de IA empresarial que permite adotar modelos mais eficientes e otimizados, ajustados com os dados específicos do negócio, com possibilidade de serem implantados na nuvem híbrida para treinar modelos em uma ampla gama de arquiteturas de computação. Para Joe Fernandes, vice-presidente e gerente geral de Unidade de Negócios de IA de Red Hat, a atualização possibilita que organizações serem precisas e econômicas em suas jornadas de IA. "A Red Hat sabe que as empresas vão precisar de maneiras para gerenciar o custo crescente de suas implantações de IA generativa, à medida que trazem mais casos de uso para produção e operam em escala.. O Red Hat AI auxilia as organizações a endereçarem esses desafios, permitindo que elas disponham de modelos mais eficientes, desenvolvidos para um propósito, treinados com seus dados e que possibilitam inferência flexível em ambientes on-premises, de nuvem e na edge." Red Hat OpenShift AI O Red Hat OpenShift AI oferece uma plataforma completa de IA para gerenciar os ciclos de vida de IA preditiva e generativa (gen AI) na nuvem híbrida, incluindo operações de aprendizado de máquina (MLOps) e capacidades de Large Language Model Operations (LLMOps). A plataforma fornece funcionalidades para construir modelos preditivos e ajustar modelos gen AI, juntamente com ferramentas para simplificar o gerenciamento de modelos de IA, desde pipelines de ciência de dados e modelos até o monitoramento de modelos, governança e muito mais.   Versão mais recente da plataforma, o Red Hat OpenShift AI 2.18, adiciona novas atualizações e capacidades para apoiar o objetivo do Red Hat AI de trazer modelos de IA mais otimizados e eficientes para a nuvem híbrida. Os principais recursos incluem: ●      Serviço distribuído: disponível por meio do servidor de inferência vLLM, o serviço distribuído permite que as equipes de TI dividam o serviço de modelos entre várias unidades de processamento gráfico (GPUs). Isso ajuda a aliviar a carga em um único servidor, acelera o treinamento e o ajuste-fino e promove o uso mais eficiente dos recursos de computação, ao mesmo tempo em que ajuda a distribuir os serviços entre os nós para os modelos de IA. ●      Experiência de ajuste de modelo de ponta a ponta: usando o InstructLab e os pipelines de ciência de dados do Red Hat OpenShift AI, esse novo recurso ajuda a simplificar o ajuste fino dos LLMs, tornando-os mais escaláveis, eficientes e auditáveis em grandes ambientes de produção, ao mesmo tempo em que entrega gerenciamento por meio do painel de controle do Red Hat OpenShift AI. ●      AI Guardrails: o Red Hat OpenShift AI 2.18 ajuda a melhorar a precisão, o desempenho, a latência e a transparência dos LLMs por meio de uma pré-visualização da tecnologia AI Guardrails, que monitora e protege as interações de entrada do usuário e as saídas do modelo. O AI Guardrails oferece recursos adicionais de detecção para auxiliar as equipes de TI a identificar e mitigar discursos potencialmente odiosos, abusivos ou profanos, informações pessoais identificáveis, dados de competidores ou outros restritos por políticas corporativas. ●      Avaliação de modelo: usando o componente de avaliação de modelo de linguagem (lm-eval) para fornecer informações importantes sobre a qualidade geral do modelo, a avaliação de modelo permite que os cientistas de dados comparem o desempenho dos seus LLMs em várias tarefas, desde raciocínio lógico e matemático até a linguagem natural adversarial, ajudando a criar modelos de IA mais eficazes, responsivos e adaptados. RHEL AI Parte do portfólio Red Hat AI, o RHEL AI é uma plataforma de modelos fundamentais para desenvolver, testar e executar LLMs de forma mais consistente, com o objetivo de impulsionar aplicativos empresariais. O RHEL AI oferece modelos Granite LLMs e ferramentas de alinhamento de modelos InstructLab, que são pacotes em uma imagem inicializável do Red Hat Enterprise Linux e podem ser implantados na nuvem híbrida. Lançado em fevereiro de 2025, o RHEL 1.4 trouxe diversas melhorias, incluindo: ●      Suporte ao modelo Granite 3.1 8B como a mais recente adição à família de modelos Granite com licença open source. O modelo adiciona suporte multilíngue para inferência e personalização de taxonomia/conhecimento (pré-visualização para desenvolvedores), além de uma janela de contexto de 128k para melhorar a adoção de resultados de sumarização e tarefas de Retrieval-Augmented Generation (RAG) ●      Nova interface gráfica do usuário para contribuir com habilidades e conhecimentos prévios, disponível no formato de pré-visualização para desenvolvedores, com o objetivo de simplificar o consumo e a fragmentação de dados, bem como permitir que usuários adicionem suas próprias habilidades e contribuições a modelos de IA. ●      Document Knowledge-bench (DK-bench) para facilitar comparações entre modelos de IA ajustados com dados privados relevantes com o desempenho dos mesmos modelos base não ajustados. Red Hat AI InstructLab no IBM Cloud Cada vez mais, as empresas estão em busca de soluções de IA que priorizem a precisão e a segurança de seus dados, ao mesmo tempo em que mantêm os custos e a complexidade os mais baixos possíveis. O Red Hat AI InstructLab, disponível como um serviço no IBM Cloud, foi projetado para simplificar, escalar e ajudar a melhorar a segurança no treinamento e na implantação de sistemas de IA. Ao simplificar o ajuste de modelos do InstructLab, organizações podem construir plataformas mais eficientes, adaptadas às suas necessidades únicas, mantendo o controle de suas informações sigilosas.  Treinamento gratuito sobre os Fundamentos da IA A IA é uma oportunidade transformadora que está redefinindo como as empresas operam e competem. Para apoiar organizações nesse cenário dinâmico, a Red Hat oferece treinamentos online gratuitos sobre Fundamentos de IA. A empresa está oferecendo dois certificados de aprendizado em IA, voltados tanto para líderes seniores experientes quanto para iniciantes, ajudando a educar usuários de todos os níveis sobre como a IA pode ajudar a transformar operações comerciais, agilizar a tomada de decisões e impulsionar a inovação. Disponibilidade O Red Hat OpenShift AI 2.18 e o Red Hat Enterprise Linux AI 1.4 já estão disponíveis. Mais informações sobre recursos adicionais, melhorias, correções de bugs e de como atualizar a sua versão do Red Hat OpenShift AI para a mais recente podem ser encontradas aqui, e a versão mais recente do RHEL AI pode ser encontrada aqui. O Red Hat AI InstructLab no IBM Cloud estará disponível em breve. O treinamento sobre os Fundamentos de IA da Red Hat já está disponível para clientes. Read the full article
0 notes
lowendbox · 2 months ago
Text
Tumblr media
In today’s tech landscape, the average VPS just doesn’t cut it for everyone. Whether you're a machine learning enthusiast, video editor, indie game developer, or just someone with a demanding workload, you've probably hit a wall with standard CPU-based servers. That’s where GPU-enabled VPS instances come in. A GPU VPS is a virtual server that includes access to a dedicated Graphics Processing Unit, like an NVIDIA RTX 3070, 4090, or even enterprise-grade cards like the A100 or H100. These are the same GPUs powering AI research labs, high-end gaming rigs, and advanced rendering farms. But thanks to the rise of affordable infrastructure providers, you don’t need to spend thousands to tap into that power. At LowEndBox, we’ve always been about helping users find the best hosting deals on a budget. Recently, we’ve extended that mission into the world of GPU servers. With our new Cheap GPU VPS Directory, you can now easily discover reliable, low-cost GPU hosting solutions for all kinds of high-performance tasks. So what exactly can you do with a GPU VPS? And why should you rent one instead of buying hardware? Let’s break it down. 1. AI & Machine Learning If you’re doing anything with artificial intelligence, machine learning, or deep learning, a GPU VPS is no longer optional, it’s essential. Modern AI models require enormous amounts of computation, particularly during training or fine-tuning. CPUs simply can’t keep up with the matrix-heavy math required for neural networks. That’s where GPUs shine. For example, if you’re experimenting with open-source Large Language Models (LLMs) like Mistral, LLaMA, Mixtral, or Falcon, you’ll need a GPU with sufficient VRAM just to load the model—let alone fine-tune it or run inference at scale. Even moderately sized models such as LLaMA 2–7B or Mistral 7B require GPUs with 16GB of VRAM or more, which many affordable LowEndBox-listed hosts now offer. Beyond language models, researchers and developers use GPU VPS instances for: Fine-tuning vision models (like YOLOv8 or CLIP) Running frameworks like PyTorch, TensorFlow, JAX, or Hugging Face Transformers Inference serving using APIs like vLLM or Text Generation WebUI Experimenting with LoRA (Low-Rank Adaptation) to fine-tune LLMs on smaller datasets The beauty of renting a GPU VPS through LowEndBox is that you get access to the raw horsepower of an NVIDIA GPU, like an RTX 3090, 4090, or A6000, without spending thousands upfront. Many of the providers in our Cheap GPU VPS Directory support modern drivers and Docker, making it easy to deploy open-source AI stacks quickly. Whether you’re running Stable Diffusion, building a custom chatbot with LLaMA 2, or just learning the ropes of AI development, a GPU-enabled VPS can help you train and deploy models faster, more efficiently, and more affordably. 2. Video Rendering & Content Creation GPU-enabled VPS instances aren’t just for coders and researchers, they’re a huge asset for video editors, 3D animators, and digital artists as well. Whether you're rendering animations in Blender, editing 4K video in DaVinci Resolve, or generating visual effects with Adobe After Effects, a capable GPU can drastically reduce render times and improve responsiveness. Using a remote GPU server also allows you to offload intensive rendering tasks, keeping your local machine free for creative work. Many users even set up a pipeline using tools like FFmpeg, HandBrake, or Nuke, orchestrating remote batch renders or encoding jobs from anywhere in the world. With LowEndBox’s curated Cheap GPU List, you can find hourly or monthly rentals that match your creative needs—without having to build out your own costly workstation. 3. Cloud Gaming & Game Server Hosting Cloud gaming is another space where GPU VPS hosting makes a serious impact. Want to stream a full Windows desktop with hardware-accelerated graphics? Need to host a private Minecraft, Valheim, or CS:GO server with mods and enhanced visuals? A GPU server gives you the headroom to do it smoothly. Some users even use GPU VPSs for game development, testing their builds in environments that simulate the hardware their end users will have. It’s also a smart way to experiment with virtualized game streaming platforms like Parsec or Moonlight, especially if you're developing a cloud gaming experience of your own. With options from providers like InterServer and Crunchbits on LowEndBox, setting up a GPU-powered game or dev server has never been easier or more affordable. 4. Cryptocurrency Mining While the crypto boom has cooled off, GPU mining is still very much alive for certain coins, especially those that resist ASIC centralization. Coins like Ethereum Classic, Ravencoin, or newer GPU-friendly tokens still attract miners looking to earn with minimal overhead. Renting a GPU VPS gives you a low-risk way to test your mining setup, compare hash rates, or try out different software like T-Rex, NBMiner, or TeamRedMiner, all without buying hardware upfront. It's a particularly useful approach for part-time miners, researchers, or developers working on blockchain infrastructure. And with LowEndBox’s flexible, budget-focused listings, you can find hourly or monthly GPU rentals that suit your experimentation budget perfectly. Why Rent a GPU VPS Through LowEndBox? ✅ Lower CostEnterprise GPU hosting can get pricey fast. We surface deals starting under $50/month—some even less. For example: Crunchbits offers RTX 3070s for around $65/month. InterServer lists setups with RTX 4090s, Ryzen CPUs, and 192GB RAM for just $399/month. TensorDock provides hourly options, with prices like $0.34/hr for RTX 4090s and $2.21/hr for H100s. Explore all your options on our Cheap GPU VPS Directory. ✅ No Hardware CommitmentRenting gives you flexibility. Whether you need GPU power for just a few hours or a couple of months, you don’t have to commit to hardware purchases—or worry about depreciation. ✅ Easy ScalabilityWhen your project grows, so can your resources. Many GPU VPS providers listed on LowEndBox offer flexible upgrade paths, allowing you to scale up without downtime. Start Exploring GPU VPS Deals Today Whether you’re training models, rendering video, mining crypto, or building GPU-powered apps, renting a GPU-enabled VPS can save you time and money. Start browsing the latest GPU deals on LowEndBox and get the computing power you need, without the sticker shock. We've included a couple links to useful lists below to help you make an informed VPS/GPU-enabled purchasing decision. https://lowendbox.com/cheap-gpu-list-nvidia-gpus-for-ai-training-llm-models-and-more/ https://lowendbox.com/best-cheap-vps-hosting-updated-2020/ https://lowendbox.com/blog/2-usd-vps-cheap-vps-under-2-month/ Read the full article
0 notes
sailai · 2 months ago
Text
使用 VLLM 部署 DeepSeek:基于 Ubuntu 22.04 + RTX 4090 + Docker 的完整指南
最近,大语言模型(LLM)的部署已经成为 AI 开发者绕不开的核心技能。而 VLLM 作为一款高性能、低延迟的推理引擎,在大模型推理领域迅速崛起。今天,我就带大家从零开始,在 Ubuntu 22.04 + RTX 4090 + Docker 环境下,部署 DeepSeek模型,并让它跑起来!
这篇文章适合那些想快速上手 vLLM 的开发者,文章涵盖了显卡驱动、CUDA、Docker 环境的安装,以及 vLLM 的完整运行流程。让我们开始吧!
什么是 VLLM?
VLLM(Very Large Language Model Inference)是一个 高性能、优化显存管理 的大模型推理引擎。它的目标是 最大化推理吞吐量,并降低显存消耗,让大语言模型(LLMs)在 单卡或多 GPU 服务器 上运行得更高效。
VLLM 的核心优势:
高吞吐量:支持批量推理,减少 token 生成延迟,高效 KV
缓存管理:优化 GPU 显存,支持 更长的上下文
多 GPU 支持:Tensor Parallel 加速推
OpenAI API 兼容:可以作为 本地 API 服务器 运行
继续阅读全文:使用 VLLM 部署 DeepSeek:基于 Ubuntu 22.04 + RTX 4090 + Docker 的完整指南
1 note · View note
tumnikkeimatome · 2 months ago
Text
PagedAttentionとKVキャッシュ最適化による大規模言語モデルの推論高速化:vLLMフレームワークの性能評価と実装解析
vLLMフレームワークの概要と革新性 大規模言語モデル(LLM)の産業応用が急速に進む中、推論処理の効率化は重要な技術課題となっています。 vLLMは、UC BerkeleyのSky Computing Labで開発された大規模言語モデル推論を高速化するためのオープンソースフレームワークです。 革新的なPagedAttentionアルゴリズムを核として、Hugging Face Transformersと比較して最大24倍、Text Generation Inferenceと比較して最大3.5倍のスループット向上を実現しています。 vLLMの名称は「仮想大規模言語モデル(Virtual Large Language…
0 notes
billtj · 3 months ago
Text
A Step-by-Step Guide to Install DeepSeek-R1 Locally with Ollama, vLLM or Transformers - DEV Community
#AI
0 notes
ttiikkuu · 5 months ago
Text
Tumblr media
vLLM: революция в мире больших языковых моделей. Ты, наверное, уже слышал о больших языковых моделях (LLM). Эти штуки — настоящие гении, но есть одна проблема: они обожают "есть" ресурсы. Вот тут и появляется виртуальная LLM, библиотека, которая делает их более умными в плане производительности. Давай разбираться, что к чему! Что такое vLLM? vLLM (Virtual Large Language Model) — это открытая Python-библиотека, разработанная студентами из UC Berkeley в 2023 году. Она создана, чтобы оптимизировать работу больших языковых моделей, снижая задержки и повышая масштабируемость. Как тебе такое, Илон Маск? Зачем нужна оптимизация? Обычные методы обработки данных тратят от 60% до 80% памяти LLM впустую. Это как если бы ты покупал огромный холодильник, чтобы хранить одну бутылку воды. vLLM использует новый алгоритм PagedAttention, который сокращает этот "мусор" до всего 4%! А что это значит для производительности? Она возрастает в 24 раза. Впечатляет, правда? Как работает vLLM? Основной секрет виртуальных LLM — это умное управление памятью и расчетами. Давай разберемся, как это выглядит в цифрах. Сравнение традиционных методов и vLLM МетодПроцент потерь памятиПроизводительностьТрадиционный60%-80%НизкаяvLLM4%Высокая (24x выше) Ключевые особенности vLLM - Поддержка NVIDIA и AMD GPU. - Совместимость с популярными LLM на платформе HuggingFace. - Алгоритм PagedAttention для оптимального использования памяти. - Огромное сообщество — более 31.7K звезд на GitHub. Почему vLLM — это будущее? vLLM — это не просто библиотека. Это часть большой тенденции — инструментов для обучения LLM. За последний год интерес к "обучению LLM" вырос на 60%, а это говорит о том, что всё больше компаний и разработчиков погружаются в эту сферу. Динамика тренда vLLM Что еще стоит знать об обучении LLM? - LLM обычно обучаются на наборах данных размером не менее 1 ТБ. - Количество параметров может достигать сотен миллиардов. - Этапы включают подготовку данных, настройку моделей и дообучение. Трендовые стартапы в области LLM Мир LLM развивается настолько быстро, что уже появляются компании, которые предлагают решения для их обучения и настройки. Вот несколько из них: Cohere Предоставляет кастомизируемые LLM для масштабирования AI в облаке или на локальных серверах. Run:AI Автоматизирует управление ресурсами для обучения LLM. Настоящая находка для разработчиков. Unstructured AI Превращает "сырые" данные в пригодные для работы LLM форматы. Pareto AI Помогает находить специалистов для настройки моделей и работы с данными. Часто задаваемые вопросы (ЧаВо) Что такое vLLM? vLLM (Virtual Large Language Model) — это открытая Python-библиотека, разработанная студентами UC Berkeley в 2023 году. Она создана для оптимизации работы больших языковых моделей, снижая задержки и повышая масштабируемость. Чем vLLM отличается от традиционных методов? Традиционные методы обработки данных теряют от 60% до 80% памяти, в то время как vLLM благодаря алгоритму PagedAttention сокращает потери до 4%, увеличивая производительность в 24 раза. Какие ключевые особенности vLLM? - Поддержка NVIDIA и AMD GPU. - Совместимость с LLM на платформе HuggingFace. - Использование алгоритма PagedAttention для оптимального управления памятью. - Большое сообщество разработчиков с более чем 31.7K звезд на GitHub. Почему vLLM считается будущим в работе с большими языковыми моделями? vLLM позволяет значительно экономить ресурсы, повышать производительность и масштабируемость. Всё это делает библиотеку незаменимым инструментом для разработчиков, работающих с большими языковыми моделями. Какие компании уже работают с технологиями для обучения LLM? - Cohere: кастомизируемые LLM для облаков и локальных серверов. - Run:AI: автоматизация управления ресурсами для обучения LLM. - Unstructured AI: обработка данных для использования в LLM. - Pareto AI: подбор специалистов для настройки моделей и обработки данных. Заключение Если ты работаешь с большими языковыми моделями, vLLM — это именно то, что тебе нужно. Оптимизация, экономия ресурсов и повышение производительности в разы. А главное — сообщество и инструменты, которые делают работу комфортной. Попробуй сам и убедись, что за этой библиотекой — будущее! Read the full article
0 notes
govindhtech · 6 months ago
Text
PyTorch/XLA 2.5: vLLM Support And Developer Improvements
Tumblr media
PyTorch/XLA 2.5
PyTorch/XLA 2.5: enhanced development experience and support for vLLM
PyTorch/XLA, a Python package that connects the PyTorch deep learning framework with Cloud TPUs via the XLA deep learning compiler, has machine learning engineers enthusiastic. Additionally, PyTorch/XLA 2.5 has arrived with a number of enhancements to improve the developer experience and add support for vLLM. This release’s features include:
An explanation of the plan to replace the outdated torch_xla API with the current PyTorch API, which would simplify the development process. The transfer of the current Distributed API serves as an illustration of this.
A number of enhancements to the torch_xla.compile function that enhance developers’ debugging experience when they are working on a project.
You can expand your current deployments and use the same vLLM interface across all of your TPUs thanks to experimental support in vLLM for TPUs.
Let’s examine each of these improvements.
Streamlining torch_xla API
Google Cloud is making a big stride toward improving the consistency of the API with upstream PyTorch with PyTorch/XLA 2.5. Its goal is to make XLA devices easier to use by reducing the learning curve for developers who are already familiar with PyTorch. When feasible, this entails phasing out and deprecating proprietary PyTorch/XLA API calls in favor of more sophisticated functionality, then switching the API calls to their PyTorch equivalents. Before the migration, several features were still included in the current Python module.
It has switched to using some of the existing PyTorch distributed API functions when running models on top of PyTorch/XLA in this release to make the development process for PyTorch/XLA easier. In this release, it moved the majority of the calls for the distributed API from the torch_xla module to torch.distributed.
With PyTorch/XLA 2.4
import torch_xla.core.xla_model as xm xm.all_reduce()
Supported after PyTorch/XLA 2.5
torch.distrbuted.all_reduce()
A better version of “torch_xla.compile”
To assist you in debugging or identifying possible problems in your model code, it also includes a few new compilation features. For instance, when there are many compilation graphs, the “full_graph” mode generates an error message. This aids in the early detection (during compilation) of possible problems brought on by various compilation graphs.
You may now also indicate how many recompilations you anticipate for compiled functions. This can assist you in troubleshooting performance issues if a function may be recompiled more frequently than necessary, such as when it exhibits unexpected dynamism.
Additionally, you can now give compiled functions a meaningful name rather than one that is generated automatically. When debugging messages, naming compiled targets gives you additional context, which makes it simpler to identify the potential issue. Here’s an illustration of how that actually appears in practice:
named code
@torch_xla.compile def dummy_cos_sin_decored(self, tensor): return torch.cos(torch.sin(tensor))
target dumped HLO renamed with named code function name
… module_0021.SyncTensorsGraph.4.hlo_module_config.txt module_0021.SyncTensorsGraph.4.target_arguments.txt module_0021.SyncTensorsGraph.4.tpu_comp_env.txt module_0024.dummy_cos_sin_decored.5.before_optimizations.txt module_0024.dummy_cos_sin_decored.5.execution_options.txt module_0024.dummy_cos_sin_decored.5.flagfile module_0024.dummy_cos_sin_decored.5.hlo_module_config.txt module_0024.dummy_cos_sin_decored.5.target_arguments.txt module_0024.dummy_cos_sin_decored.5.tpu_comp_env.txt …
You can observe the difference between the original and named outputs from the same file by looking at the output above. The automatically produced name is “SyncTensorsGraph.” The renamed file associated with the preceding tiny code example is shown below.
vLLM on TPU (testing)
You can now use TPU as a backend if you serve models on GPUs using vLLM. A memory-efficient and high-throughput inference and serving engine for LLMs is called vLLM. To make model testing on TPU easier, vLLM on TPU maintains the same vLLM interface that developers adore, including direct integration into Hugging Face Model Hub.
It only takes a few configuration adjustments to switch your vLLM endpoint to TPU. Everything is unchanged except for the TPU image: the model source code, load balancing, autoscaling metrics, and the request payload. Refer to the installation guide for further information.
Pallas kernels like paged attention, flash attention, and dynamo bridge speed optimizations are among the other vLLM capabilities it has added to TPU. These are all now included in the PyTorch/XLA repository (code). Although PyTorch TPU users may now access vLLM, this work is still in progress, and it anticipate adding more functionality and improvements in upcoming releases.
Use PyTorch/XLA 2.5
Downloading the most recent version via your Python package manager will allow you to begin utilizing these new capabilities. For installation instructions and more thorough information, see the project’s GitHub page if you’ve never heard of PyTorch/XLA before.
Read more on Govindhtech.com
0 notes
ai-news · 6 months ago
Link
Neural Magic has released the LLM Compressor, a state-of-the-art tool for large language model optimization that enables far quicker inference through much more advanced model compression. Hence, the tool is an important building block in Neural Mag #AI #ML #Automation
0 notes
y2fear · 10 months ago
Photo
Tumblr media
Optimizing LLM Deployment: vLLM PagedAttention and the Future of Efficient AI Serving
0 notes
valianttimetravelcowboy · 1 year ago
Text
Selecting and Configuring Inference Engines for LLMs
There are many optimization techniques developed to mitigate the inefficiencies that occur in the different stages of the inference process. It is difficult to scale the inference at scale with vanilla transformer/ techniques. Inference engines wrap up the optimizations into one package and eases us in the inference process.
For a very small set of adhoc testing, or quick reference we can use the vanilla transformer code to do the inference.
The landscape of inference engines is quickly evolving, as we have multiple choices, it is important to test and short list the best of best for specific use cases. Below, are some inference engines experiments which we made and the reasons we found out why it worked for our case.
For our fine tuned Vicuna-7B model, we have tried
TGI 
vLLM
Aphrodite 
Optimum-Nvidia
PowerInfer
LLAMACPP
Ctranslate2
We went through the github page and its quick start guide to setup these engines, PowerInfer, LlaamaCPP, Ctranslate2 are not very flexible and do not support many optimization techniques like continuous batching, paged attention and held sub-par performance when compared to other mentioned engines.
To obtain higher throughput the inference engine/server should maximize the memory and compute capabilities and both client and server must work in a parallel/ asynchronous way of serving requests to keep the server always in work. As mentioned earlier, without help of optimization techniques like PagedAttention, Flash Attention, Continuous batching it will always lead to suboptimal performance.
TGI, vLLM and Aphrodite are more suitable candidates in this regard and by doing multiple experiments stated below, we found the optimal configuration to squeeze the maximum performance out of the inference. Techniques like Continuous batching and paged attention are enabled by default, speculative decoding needs to be enabled manually in the inference engine for the below tests.
0 notes
hackernewsrobot · 2 years ago
Text
vLLM: 24x faster LLM serving than HuggingFace Transformers
https://vllm.ai/
0 notes
emasters · 2 years ago
Text
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM…
View On WordPress
0 notes