#VisionLanguageModels
Explore tagged Tumblr posts
govindhtech · 8 months ago
Text
NVIDIA AI Blueprints For Build Visual AI Data In Any Sector
Tumblr media
NVIDIA AI Blueprints
Businesses and government agencies worldwide are creating AI agents to improve the skills of workers who depend on visual data from an increasing number of devices, such as cameras, Internet of Things sensors, and automobiles.
Developers in almost any industry will be able to create visual AI agents that analyze image and video information with the help of a new NVIDIA AI Blueprints for video search and summarization. These agents are able to provide summaries, respond to customer inquiries, and activate alerts for particular situations.
The blueprint is a configurable workflow that integrates NVIDIA computer vision and generative AI technologies and is a component of NVIDIA Metropolis, a suite of developer tools for creating vision AI applications.
The NVIDIA AI Blueprints for visual search and summarization is being brought to businesses and cities around the world by global systems integrators and technology solutions providers like Accenture, Dell Technologies, and Lenovo. This is launching the next wave of AI applications that can be used to increase productivity and safety in factories, warehouses, shops, airports, traffic intersections, and more.
The NVIDIA AI Blueprint, which was unveiled prior to the Smart City Expo World Congress, provides visual computing developers with a comprehensive set of optimized tools for creating and implementing generative AI-powered agents that are capable of consuming and comprehending enormous amounts of data archives or live video feeds.
Deploying virtual assistants across sectors and smart city applications is made easier by the fact that users can modify these visual AI agents using natural language prompts rather than strict software code.
NVIDIA AI Blueprint Harnesses Vision Language Models
Vision language models (VLMs), a subclass of generative AI models, enable visual AI agents to perceive the physical world and carry out reasoning tasks by fusing language comprehension and computer vision.
NVIDIA NIM microservices for VLMs like NVIDIA VILA, LLMs like Meta’s Llama 3.1 405B, and AI models for GPU-accelerated question answering and context-aware retrieval-augmented generation may all be used to configure the NVIDIA AI Blueprint for video search and summarization. The NVIDIA NeMo platform makes it simple for developers to modify other VLMs, LLMs, and graph databases to suit their particular use cases and settings.
By using the NVIDIA AI Blueprints, developers may be able to avoid spending months researching and refining generative AI models for use in smart city applications. It can significantly speed up the process of searching through video archives to find important moments when installed on NVIDIA GPUs at the edge, on-site, or in the cloud.
An AI agent developed using this methodology could notify employees in a warehouse setting if safety procedures are broken. An AI bot could detect traffic accidents at busy crossroads and provide reports to support emergency response activities. Additionally, to promote preventative maintenance in the realm of public infrastructure, maintenance personnel could request AI agents to analyze overhead imagery and spot deteriorating roads, train tracks, or bridges.
In addition to smart places, visual AI agents could be used to automatically create video summaries for visually impaired individuals, classify large visual datasets for training other AI models, and summarize videos for those with visual impairments.
The workflow for video search and summarization is part of a set of NVIDIA AI blueprints that facilitate the creation of digital avatars driven by AI, the development of virtual assistants for individualized customer support, and the extraction of enterprise insights from PDF data.
With NVIDIA AI Enterprise, an end-to-end software platform that speeds up data science pipelines and simplifies the development and deployment of generative AI, developers can test and download NVIDIA AI Blueprints for free. These blueprints can then be implemented in production across accelerated data centers and clouds.
AI Agents to Deliver Insights From Warehouses to World Capitals
With the assistance of NVIDIA’s partner ecosystem, enterprise and public sector clients can also utilize the entire library of NVIDIA AI Blueprints.
With its Accenture AI Refinery, which is based on NVIDIA AI Foundry and allows clients to create custom AI models trained on enterprise data, the multinational professional services firm Accenture has integrated NVIDIA AI Blueprints.
For smart city and intelligent transportation applications, global systems integrators in Southeast Asia, such as ITMAX in Malaysia and FPT in Vietnam, are developing AI agents based on the NVIDIA AI Blueprint for video search and summarization.
Using computing, networking, and software from international server manufacturers, developers can also create and implement NVIDIA AI Blueprints on NVIDIA AI systems.
In order to improve current edge AI applications and develop new edge AI-enabled capabilities, Dell will combine VLM and agent techniques with its NativeEdge platform. VLM capabilities in specialized AI workflows for data center, edge, and on-premises multimodal corporate use cases will be supported by the NVIDIA AI Blueprint for video search and summarization and the Dell Reference Designs for the Dell AI Factory with NVIDIA.
Lenovo Hybrid AI solutions powered by NVIDIA also utilize NVIDIA AI blueprints.
The new NVIDIA AI Blueprint will be used by businesses such as K2K, a smart city application supplier in the NVIDIA Metropolis ecosystem, to create AI agents that can evaluate real-time traffic camera data. City officials will be able to inquire about street activities and get suggestions on how to make things better with to this. Additionally, the company is utilizing NIM microservices and NVIDIA AI blueprints to deploy visual AI agents in collaboration with city traffic management in Palermo, Italy.
NVIDIA booth at the Smart Cities Expo World Congress, which is being held in Barcelona until November 7, to learn more about the NVIDIA AI Blueprints for video search and summarization.
Read more on Govindhtech.com
2 notes · View notes
mysocial8onetech · 1 year ago
Text
Embark on a journey with our new article that delves into the intricacies of MoAI, an innovative Mixture of Experts approach in an open-source Large Language and Vision Model (LLVM). Learn how MoAI leveraging auxiliary visual information and multiple intelligences to revolutionize the field. Discover how this model aligns and condenses outputs from external CV models, efficiently using relevant information for vision language tasks. Understand the unique blend of visual features, auxiliary features from external CV models, and language features that MoAI brings together.
0 notes
futurride · 6 months ago
Link
0 notes
govindhtech · 1 month ago
Text
What Is NanoVLM? Key Features, Components And Architecture
Tumblr media
The NanoVLM initiative develops VLMs for NVIDIA Jetson devices, specifically the Orin Nano. These models aim to improve interaction performance by increasing processing speed and decreasing memory usage. Documentation includes supported VLM families, benchmarks, and setup parameters such Jetson device and Jetpack compatibility. Video sequence processing, live streaming analysis, and multimodal chat via online user interfaces or command-line interfaces are also covered.
What's nanoVLM?
NanoVLM is the fastest and easiest repository for training and optimising micro VLMs.
Hugging Face streamlined this teaching method. We want to democratise vision-language model creation via a simple PyTorch framework. Inspired by Andrej Karratha's nanoGPT, NanoVLM prioritises readability, modularity, and transparency without compromising practicality. About 750 lines of code define and train nanoVLM, plus parameter loading and reporting boilerplate.
Architecture and Components
NanoVLM is a modular multimodal architecture with a modality projection mechanism, lightweight language decoder, and vision encoder. The vision encoder uses transformer-based SigLIP-B/16 for dependable photo feature extraction.
Visual backbone translates photos into language model-friendly embeddings.
Textual side uses SmolLM2, an efficient and clear causal decoder-style converter.
Vision-language fusion is controlled by a simple projection layer that aligns picture embeddings into the language model's input space.
Transparent, readable, and easy to change, the integration is suitable for rapid prototyping and instruction.
The effective code structure includes the VLM (~100 lines), Language Decoder (~250 lines), Modality Projection (~50 lines), Vision Backbone (~150 lines), and a basic training loop (~200 lines).
Sizing and Performance
HuggingFaceTB/SmolLM2-135M and SigLIP-B/16-224-85M backbones create 222M nanoVLMs. Version nanoVLM-222M is available.
NanoVLM is compact and easy to use but offers competitive results. The 222M model trained for 6 hours on a single H100 GPU with 1.7M samples from the_cauldron dataset had 35.3% accuracy on the MMStar benchmark. SmolVLM-256M-like performance was achieved with fewer parameters and computing.
NanoVLM is efficient enough for educational institutions or developers using a single workstation.
Key Features and Philosophy
NanoVLM is a simple yet effective VLM introduction.
It enables users test micro VLMs' capabilities by changing settings and parameters.
Transparency helps consumers understand logic and data flow with minimally abstracted and well-defined components. This is ideal for repeatability research and education.
Its modularity and forward compatibility allow users to replace visual encoders, decoders, and projection mechanisms. This provides a framework for multiple investigations.
Get Started and Use
Cloning the repository and establishing the environment lets users start. Despite pip, uv is recommended for package management. Dependencies include torch, numpy, torchvision, pillow, datasets, huggingface-hub, transformers, and wandb.
NanoVLM includes easy methods for loading and storing Hugging Face Hub models. VisionLanguageModel.from_pretrained() can load pretrained weights from Hub repositories like “lusxvr/nanoVLM-222M”.
Pushing trained models to the Hub creates a model card (README.md) and saves weights (model.safetensors) and configuration (config.json). Repositories can be private but are usually public.
Model can load and store models locally.VisionLanguageModel.from_pretrained() and save_pretrained() with local paths.
To test a trained model, generate.py is provided. An example shows how to use an image and “What is this?” to get cat descriptions.
In the Models section of the NVIDIA Jetson AI Lab, “NanoVLM” is included, however the content focusses on using NanoLLM to optimise VLMs like Llava, VILA, and Obsidian for Jetson devices. This means Jetson and other platforms can benefit from nanoVLM's small VLM optimisation techniques.
Training
Train nanoVLM with the train.py script, which uses models/config.py. Logging with WANDB is common in training.
VRAM specs
VRAM needs must be understood throughout training.
A single NVIDIA H100 GPU evaluating the default 222M model shows batch size increases peak VRAM use.
870.53 MB of VRAM is allocated after model loading.
Maximum VRAM used during training is 4.5 GB for batch size 1 and 65 GB for batch size 256.
Before OOM, 512-batch training peaked at 80 GB.
Results indicate that training with a batch size of up to 16 requires at least ~4.5 GB of VRAM, whereas training with a batch size of up to 16 requires roughly 8 GB.
Variations in sequence length or model architecture affect VRAM needs.
To test VRAM requirements on a system and setup, measure_vram.py is provided.
Contributions and Community
NanoVLM welcomes contributions.
Contributions with dependencies like transformers are encouraged, but pure PyTorch implementation is preferred. Deep speed, trainer, and accelerate won't work. Open an issue to discuss new feature ideas. Bug fixes can be submitted using pull requests.
Future research includes data packing, multi-GPU training, multi-image support, image-splitting, and VLMEvalKit integration. Integration into the Hugging Face ecosystem allows use with Transformers, Datasets, and Inference Endpoints.
In summary
NanoVLM is a Hugging Face project that provides a simple, readable, and flexible PyTorch framework for building and testing small VLMs. It is designed for efficient use and education, with training, creation, and Hugging Face ecosystem integration paths.
0 notes
govindhtech · 7 months ago
Text
Vision Language Models: Learning From Text & Images Together
Tumblr media
Vision language models are models that can learn from both text and images at the same time to perform a variety of tasks, such as labeling photos and answering visual questions. The primary components of visual language models are covered in this post: get a general idea, understand how they operate, choose the best model, utilize them for inference, and quickly adjust them using the recently available try version!
What is a Vision Language Model?
Multimodal models that are able to learn from both text and images are sometimes referred to as vision language models. They are a class of generative models that produce text outputs from inputs that include images and text. Large vision language models can deal with a variety of picture types, such as papers and web pages, and they have good zero-shot capabilities and good generalization.
Among the use cases are image captioning, document comprehension, visual question answering, image recognition through instructions, and image chat. Spatial features in an image can also be captured by some vision language models. When asked to detect or segment a specific subject, these models can produce segmentation masks or bounding boxes. They can also localize various items or respond to queries regarding their absolute or relative positions. The current collection of huge vision language models varies greatly in terms of their training data, picture encoding methods, and, consequently, their capabilities.Image credit to Hugging face
An Overview of Vision Language Models That Are Open Source
The Hugging Face Hub has a large number of open vision language models. The table below lists a few of the most well-known.
Base models and chat-tuned models are available for use in conversational mode.
“Grounding” is a characteristic of several of these models that lessens model hallucinations.
Unless otherwise indicated, all models are trained on English.
Vision Language Model
Finding the right Vision Language Model
The best model for your use case can be chosen in a variety of ways.
Vision Arena is a leaderboard that is updated constantly and is entirely dependent on anonymous voting of model results. Users input a prompt and a picture in this arena, and outputs from two distinct models are anonymously sampled before the user selects their favorite output. In this manner, human preferences alone are used to create the leaderboard.
Another leaderboard that ranks different vision language models based on similar parameters and average scores is the Open VLM Leaderboard. Additionally, you can filter models based on their sizes, open-source or proprietary licenses, and rankings for other criteria.
The Open VLM Leaderboard is powered by the VLMEvalKit toolbox, which is used to execute benchmarks on visual language models. LMMS-Eval is an additional evaluation suite that offers a standard command line interface for evaluating Hugging Face models of your choosing using datasets stored on the Hugging Face Hub, such as the ones shown below:
accelerate launch –num_processes=8 -m lmms_eval –model llava –model_args pretrained=”liuhaotian/llava-v1.5-7b” –tasks mme,mmbench_en –batch_size 1 –log_samples –log_samples_suffix llava_v1.5_mme_mmbenchen –output_path ./logs/
The Open VLM Leaderbard and the Vision Arena are both restricted to the models that are provided to them; new models must be added through updates. You can search the Hub for models under the task image-text-to-text if you’d like to locate more models.
The leaderboards may present you with a variety of benchmarks to assess vision language models. We’ll examine some of them.
MMMU
The most thorough benchmark to assess vision language models is A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU). It includes 11.5K multimodal tasks that call for college-level topic knowledge and critical thinking in a variety of fields, including engineering and the arts.
MMBench
3000 single-choice questions covering 20 distinct skills, such as object localization and OCR, make up the MMBench assessment benchmark. The study also presents Circular Eval, an assessment method in which the model is supposed to provide the correct response each time the question’s answer options are jumbled in various combinations. Other more specialized benchmarks are available in other disciplines, such as OCRBench document comprehension, ScienceQA science question answering, AI2D diagram understanding, and MathVista visual mathematical reasoning.
Technical Details
A vision language model can be pretrained in a number of ways. Unifying the text and image representation and feeding it to a text decoder for generation is the key trick. An image encoder, an embedding projector often a dense neural network to align picture and word representations, and a text decoder are frequently stacked in the most popular and widely used models. Regarding the training phases, many models have been used various methodologies.
In contrast to LLaVA-like pre-training, the creators of KOSMOS-2 decided to fully train the model end-to-end, which is computationally costly. To align the model, the authors then fine-tuned language-only instruction. Another example is the Fuyu-8B, which lacks even an image encoder. Rather, the sequence passes via an auto-regressive decoder after picture patches are supplied straight into a projection layer. Pre-training a vision language model is usually not necessary because you can either utilize an existing model or modify it for your particular use case.
What Are Vision Language Models?
A vision-language model is a hybrid of natural language and vision models. By consuming visuals and accompanying written descriptions, it learns to link the information from the two modalities. The vision component of the model pulls spatial elements from the photographs, while the language model encodes information from the text.
Detected objects, image layout, and text embeddings are mapped between modalities.For example, the model will learn to associate a term from the text descriptions with a bird in the picture.
In this way, the model learns to understand images and translate them into text, which is Natural Language Processing, and vice versa.
VLM instruction
To create VLMs, zero-shot learning and pre-training foundation models are required. Transfer learning techniques such as knowledge distillation can be used to improve the models for more complex downstream tasks.
These are simpler techniques that, despite using fewer datasets and less training time, yield decent results.
On the other hand, modern frameworks use several techniques to provide better results, such as
Learning through contrast.
Mask-based language-image modeling.
Encoder-decoder modules with transformers, for example.
These designs may learn complex correlations between the many modalities and produce state-of-the-art results. Let’s take a closer look at these.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Introducing PaliGemma, Gemma 2 & Upgrades in Responsible AI
Tumblr media
PaliGemma
At Google, Google Cloud team think that teamwork and open research can spur innovation, so they’re happy to see that Gemma has received millions of downloads in the short time since its release.
This positive reaction has been tremendously motivating, as developers have produced a wide range of projects. From Octopus v2, an on-device action model, to Navarasa, a multilingual variant for Indic languages, developers are demonstrating Gemma’s potential to produce useful and approachable AI solutions.
Google cloud’s development of CodeGemma, with its potent code completion and generation capabilities, and RecurrentGemma, which provides effective inference and research opportunities, has likewise been driven by this spirit of exploration and invention.
The Gemma family of open models is composed of lightweight, cutting-edge models that are constructed using the same technology and research as the Gemini models. With the release of PaliGemma, a potent open vision-language model (VLM), and the launch of Gemma 2,Google cloud is thrilled to share more news about their plans to grow the Gemma family. Furthermore, they’re strengthening their dedication to responsible AI by releasing upgrades to our Responsible Generative AI Toolkit, which give developers fresh and improved resources for assessing model safety and screening out inappropriate content.
Presenting the Open Vision-Language Model, PaliGemma
Inspired by PaLI-3, PaliGemma is a potent open vector loader. PaliGemma is built on open components, such as the Gemma language model and the SigLIP vision model, and is intended to achieve class-leading fine-tune performance on a variety of vision-language tasks. This covers word recognition in images, object detection and segmentation, visual question answering, and captioning of images and short videos.
Google Cloud can offer numerous resolutions of both pretrained and fine-tuned checkpoints, in addition to checkpoints tailored to a variety of tasks for instant investigation.
PaliGemma is accessible through a number of platforms and tools to encourage open exploration and research. With free choices like Kaggle and Colab notebooks, you can start exploring right now. In order to fund their work, academic academics who want to advance the field of vision-language study can also apply for Google Cloud credits.
Start using PaliGemma right now. PaliGemma may be easily integrated using Hugging Face Transformers and JAX, and is available on GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com (accelerated using TensoRT-LLM). (Keras merger is about to happen) This Hugging Face Space is another way that you can communicate with the model.
Presenting Gemma 2: Efficiency and Performance of the Future
Google Cloud is excited to share the news that Gemma 2, the next generation of Gemma models, will soon be available. Gemma 2 boasts a revolutionary architecture intended for ground-breaking performance and efficiency, and will be offered in new sizes to accommodate a wide range of AI developer use cases.
Benefits include:
Class Leading Performance: With less than half the size and performance comparable to Llama 3 70B, Gemma 2 operates at 27 billion parameters. This ground-breaking efficiency raises the bar for open models.
Lower Deployment Costs: Gemma 2 fits on less than half the compute of similar models because to its effective design. The 27B model may operate effectively on a single TPU host in Vertex AI or is optimised to work on NVIDIA GPUs, making deployment more affordable and accessible for a larger variety of users.
Flexible Toolchains for Tuning: Gemma 2 will give developers powerful tools for tuning across a wide range of platforms and resources. Gemma 2 can be fine-tuned more easily than ever before with the help of community tools like Axolotl and cloud-based solutions like Google Cloud. Additionally, you can maximise performance and deploy across a range of hardware configurations thanks to our own JAX and Keras, as well as smooth partner interaction with Hugging Face and NVIDIA TensorRT-LLM.
Source: Hugging Face Open LLM Leaderboard (April 22, 2024) and Grok announcement blog
Gemma 2 is still pretraining. This chart shows performance from the latest Gemma 2 checkpoint along with benchmark pretraining metrics.
Watch this page for Gemma 2’s formal release in the upcoming weeks!
Increasing the Toolkit for Responsible Generative AI
Because of this, Google cloud is making the LLM Comparator open source as part of their Responsible Generative AI Toolkit to assist developers in conducting more thorough model evaluations. A new interactive and visual tool for doing efficient side-by-side evaluations to gauge the reliability and quality of model answers is the LLM Comparator. Visit their demo to witness the LLM Comparator in action, which compares Gemma 1.1 to Gemma 1.0.
It is Google Cloud aim that this tool will further the goal of the toolkit, which is to assist developers in creating AI applications that are safe, responsible, and inventive.
Google Cloud is committed to creating a cooperative atmosphere where state-of-the-art AI technology and responsible development coexist as Google cloud grow the Gemma family of open models. Google Cloud can’t wait to see what you create with these new resources and how Google cloud can work together to influence AI in the future.
Read more on govindhtech.com
0 notes
mysocial8onetech · 2 years ago
Text
Discover GPT4RoI, the vision-language model that supports multi-region spatial instructions for detailed region-level understanding.
0 notes