#IBMSSM
Explore tagged Tumblr posts
govindhtech · 22 hours ago
Text
IBM SSM Transformer Speed Performance With Bamba Model
Tumblr media
IBM built “Bamba” by crossing a transformer and SSM. In collaboration with CMU, Princeton, and the University of Illinois, IBM Research created an open-source LLM that combines state-space model runtime performance with transformer expressiveness. Important enhancements are coming to IBM Granite 4.0.
IBM SSM
The transformer architecture of today's huge language models can create human-like writing. Self-attention, which allows the model to examine every word in an input sequence before responding, boosts its effectiveness.
The issue grows with prolonged talk. The model's memory retention of the ongoing sequence during response quadratically increases generation cost. Double the context window size and the cost of processing and responding quadruples. This “quadratic bottleneck” often delays model responses to queries. Duplicated computing is also produced. Before ChatGPT popularised the transformer in 2022, scholars were exploring for alternate architectures.
There are two possible answers.
IBM SSM layers with state-space models (SSMs) and transformers. IBM's first hybrid project, Bamba, can parse long sequences like a transformer and run as quickly as an SSM. It was just made public. A few months from now, IBM's Granite 4.0 machines will have Bamba's upgrades. Bamba-9B can run at least twice as fast as equivalent transformers while preserving accuracy by dramatically reducing KV (key value) cache memory demands. Everything depends on the IBM researcher leading the KV cache reduction initiative. Greater context duration, lower latency, and higher throughput. State-space models, the most important model you've never heard of, have been used to represent dynamic systems for decades but aren't as well known as transformers. They are crucial to robotics, control theory, signal processing, and electrical engineering. IBM researchers helped convert SSMs to deep learning. State-space models analyse time-series data in any discipline. SSMs can explain weather, stock markets, and brain electrical activity using mathematical calculations. An SSM uses observations to find a “hidden state” of a given size that captures the system's important properties. Consider the state a historical summary. The hidden state updates with future forecasts without increasing when fresh information arrives. SSMs evolved into neural networks in 2021 when Stanford academics Albert Gu and colleagues released S4, which used state variables to language. The transformer and RNNs before it and the SSM processed word sequences well. However, it handled long sequences faster and better than transformers and RNNs. SSMs compress historical data in a hidden state, whereas transformers process every word in the context window. Selective retention speeds up inference and reduces memory overhead. S4, which was difficult to build, surprised participants in Long Range Arena, a test for language models' ability to handle long sequences. Gupta, an IBM AI resident, helped Gu and his colleagues reduce the model with diagonal state spaces. The “diagonal” IBM SSM lowered S4's 1,000 lines of code to 10. After introducing a gating mechanism that filtered out irrelevant information, Gupta helped SSMs equal transformers' "expressivity," or sequence-modeling capacity, for the first time. That team revealed a possible first hybrid transformer. He investigated hybrids since he works on IBM's Granite Vision models. Text with local dependencies can be handled using standard attention blocks while leveraging SSMs for longer-range contextualisation. Tri Dao at Princeton and Gu, then a CMU professor, released the gated SSM variant Mamba2 in 2023, sparking a wave of hybrids like Samba and Mamba Former. Last year, Nvidia announced Nemotron-H after proving that these hybrids could speed up inferencing and surpass either architecture.
Overcoming KV cache bottleneck
IBM Research's Granite LLMs for enterprise have always prioritised efficiency. As IBM Granite expanded, researchers studied the quadratic bottleneck. IBM researchers built their own hybrid Bamba-9B after internal confirmation of Nvidia's findings. They used Nvidia's Mamba2 architecture and released practically all of Bamba's components open-source, including data, training recipes, IBM's data loader for large-scale distributed training, and a quantisation framework to decrease storage and inference costs. First, Bamba was trained with 2 trillion tokens—words and fragments. Motivated by the results, they decreased the model's bit width from Mamba2's 16-bit floating-point precision to 8-bits, added trillion tokens, and quantised it to reduce its size from 18 GB to 9 GB. Bamba matches Meta's Llama-3.1 8B model, which was trained on seven times more data, on crucial benchmarks because to its design and training data. SSM execution optimisation was their next problem with vLLM. The Bamba team worked together with Red Hat to integrate the model into the “virtual” LLM, the most popular open-source inference server for Large Language Models. SSMs require customised state management, making support difficult. Ganti requested the audience to enhance Bamba when it was published late last year. Bamba's Hugging Face introduction said, “Let's work together to overcome the KV-cache bottleneck.” After training on 4,000 token sequences, Bamba can handle 32,000 token conversations. Ganti said vLLM can reach one million tokens and function five times quicker than a transformer if it supports SSM.
0 notes