xLSTM : A Comprehensive Guide to Extended Long Short-Term Memory
New Post has been published on https://thedigitalinsider.com/xlstm-a-comprehensive-guide-to-extended-long-short-term-memory/
xLSTM : A Comprehensive Guide to Extended Long Short-Term Memory
For over two decades, Sepp Hochreiter’s pioneering Long Short-Term Memory (LSTM) architecture has been instrumental in numerous deep learning breakthroughs and real-world applications. From generating natural language to powering speech recognition systems, LSTMs have been a driving force behind the AI revolution.
However, even the creator of LSTMs recognized their inherent limitations that prevented them from realizing their full potential. Shortcomings like an inability to revise stored information, constrained memory capacities, and lack of parallelization paved the way for the rise of transformer and other models to surpass LSTMs for more complex language tasks.
But in a recent development, Hochreiter and his team at NXAI have introduced a new variant called extended LSTM (xLSTM) that addresses these long-standing issues. Presented in a recent research paper, xLSTM builds upon the foundational ideas that made LSTMs so powerful, while overcoming their key weaknesses through architectural innovations.
At the core of xLSTM are two novel components: exponential gating and enhanced memory structures. Exponential gating allows for more flexible control over the flow of information, enabling xLSTMs to effectively revise decisions as new context is encountered. Meanwhile, the introduction of matrix memory vastly increases storage capacity compared to traditional scalar LSTMs.
But the enhancements don’t stop there. By leveraging techniques borrowed from large language models like parallelizability and residual stacking of blocks, xLSTMs can efficiently scale to billions of parameters. This unlocks their potential for modeling extremely long sequences and context windows – a capability critical for complex language understanding.
The implications of Hochreiter’s latest creation are monumental. Imagine virtual assistants that can reliably track context over hours-long conversations. Or language models that generalize more robustly to new domains after training on broad data. Applications span everywhere LSTMs made an impact – chatbots, translation, speech interfaces, program analysis and more – but now turbocharged with xLSTM’s breakthrough capabilities.
In this deep technical guide, we’ll dive into the architecturalDetailsOf xLSTM, evaluating its novel components like scalar and matrix LSTMs, exponential gating mechanisms, memory structures and more. You’ll gain insights from experimental results showcasing xLSTM’s impressive performance gains over state-of-the-art architectures like transformers and latest recurrent models.
Understanding the Origins: The Limitations of LSTM
Before we dive into the world of xLSTM, it’s essential to understand the limitations that traditional LSTM architectures have faced. These limitations have been the driving force behind the development of xLSTM and other alternative approaches.
Inability to Revise Storage Decisions: One of the primary limitations of LSTM is its struggle to revise stored values when a more similar vector is encountered. This can lead to suboptimal performance in tasks that require dynamic updates to stored information.
Limited Storage Capacities: LSTMs compress information into scalar cell states, which can limit their ability to effectively store and retrieve complex data patterns, particularly when dealing with rare tokens or long-range dependencies.
Lack of Parallelizability: The memory mixing mechanism in LSTMs, which involves hidden-hidden connections between time steps, enforces sequential processing, hindering the parallelization of computations and limiting scalability.
These limitations have paved the way for the emergence of Transformers and other architectures that have surpassed LSTMs in certain aspects, particularly when scaling to larger models.
The xLSTM Architecture
Extended LSTM (xLSTM) family
At the core of xLSTM lies two main modifications to the traditional LSTM framework: exponential gating and novel memory structures. These enhancements introduce two new variants of LSTM, known as sLSTM (scalar LSTM) and mLSTM (matrix LSTM).
sLSTM: The Scalar LSTM with Exponential Gating and Memory Mixing
Exponential Gating: sLSTM incorporates exponential activation functions for input and forget gates, enabling more flexible control over information flow.
Normalization and Stabilization: To prevent numerical instabilities, sLSTM introduces a normalizer state that keeps track of the product of input gates and future forget gates.
Memory Mixing: sLSTM supports multiple memory cells and allows for memory mixing via recurrent connections, enabling the extraction of complex patterns and state tracking capabilities.
mLSTM: The Matrix LSTM with Enhanced Storage Capacities
Matrix Memory: Instead of a scalar memory cell, mLSTM utilizes a matrix memory, increasing its storage capacity and enabling more efficient retrieval of information.
Covariance Update Rule: mLSTM employs a covariance update rule, inspired by Bidirectional Associative Memories (BAMs), to store and retrieve key-value pairs efficiently.
Parallelizability: By abandoning memory mixing, mLSTM achieves full parallelizability, enabling efficient computations on modern hardware accelerators.
These two variants, sLSTM and mLSTM, can be integrated into residual block architectures, forming xLSTM blocks. By residually stacking these xLSTM blocks, researchers can construct powerful xLSTM architectures tailored for specific tasks and application domains.
The Math
Traditional LSTM:
The original LSTM architecture introduced the constant error carousel and gating mechanisms to overcome the vanishing gradient problem in recurrent neural networks.
The repeating module in an LSTM – Source
The LSTM memory cell updates are governed by the following equations:
Cell State Update: ct = ft ⊙ ct-1 + it ⊙ zt
Hidden State Update: ht = ot ⊙ tanh(ct)
Where:
𝑐𝑡 is the cell state vector at time 𝑡t
𝑓𝑡 is the forget gate vector
𝑖𝑡 is the input gate vector
𝑜𝑡 is the output gate vector
𝑧𝑡 is the input modulated by the input gate
⊙ represents element-wise multiplication
The gates ft, it, and ot control what information gets stored, forgotten, and outputted from the cell state ct, mitigating the vanishing gradient issue.
xLSTM with Exponential Gating:
The xLSTM architecture introduces exponential gating to allow more flexible control over the information flow. For the scalar xLSTM (sLSTM) variant:
Cell State Update: ct = ft ⊙ ct-1 + it ⊙ zt
Normalizer State Update: nt = ft ⊙ nt-1 + it
Hidden State Update: ht = ot ⊙ (ct / nt)
Input & Forget Gates: it = exp(W_i xt + R_i ht-1 + b_i) ft = σ(W_f xt + R_f ht-1 + b_f) OR ft = exp(W_f xt + R_f ht-1 + b_f)
The exponential activation functions for the input (it) and forget (ft) gates, along with the normalizer state nt, enable more effective control over memory updates and revising stored information.
xLSTM with Matrix Memory:
For the matrix xLSTM (mLSTM) variant with enhanced storage capacity:
Cell State Update: Ct = ft ⊙ Ct-1 + it ⊙ (vt kt^T)
Normalizer State Update: nt = ft ⊙ nt-1 + it ⊙ kt
Hidden State Update: ht = ot ⊙ (Ct qt / max(qt^T nt, 1))
Where:
𝐶𝑡 is the matrix cell state
𝑣𝑡 and 𝑘𝑡 are the value and key vectors
𝑞𝑡 is the query vector used for retrieval
These key equations highlight how xLSTM extends the original LSTM formulation with exponential gating for more flexible memory control and matrix memory for enhanced storage capabilities. The combination of these innovations allows xLSTM to overcome limitations of traditional LSTMs.
Key Features and Advantages of xLSTM
Ability to Revise Storage Decisions: Thanks to exponential gating, xLSTM can effectively revise stored values when encountering more relevant information, overcoming a significant limitation of traditional LSTMs.
Enhanced Storage Capacities: The matrix memory in mLSTM provides increased storage capacity, enabling xLSTM to handle rare tokens, long-range dependencies, and complex data patterns more effectively.
Parallelizability: The mLSTM variant of xLSTM is fully parallelizable, allowing for efficient computations on modern hardware accelerators, such as GPUs, and enabling scalability to larger models.
Memory Mixing and State Tracking: The sLSTM variant of xLSTM retains the memory mixing capabilities of traditional LSTMs, enabling state tracking and making xLSTM more expressive than Transformers and State Space Models for certain tasks.
Scalability: By leveraging the latest techniques from modern Large Language Models (LLMs), xLSTM can be scaled to billions of parameters, unlocking new possibilities in language modeling and sequence processing tasks.
Experimental Evaluation: Showcasing xLSTM’s Capabilities
The research paper presents a comprehensive experimental evaluation of xLSTM, highlighting its performance across various tasks and benchmarks. Here are some key findings:
Synthetic Tasks and Long Range Arena:
xLSTM excels at solving formal language tasks that require state tracking, outperforming Transformers, State Space Models, and other RNN architectures.
In the Multi-Query Associative Recall task, xLSTM demonstrates enhanced memory capacities, surpassing non-Transformer models and rivaling the performance of Transformers.
On the Long Range Arena benchmark, xLSTM exhibits consistent strong performance, showcasing its efficiency in handling long-context problems.
Language Modeling and Downstream Tasks:
When trained on 15B tokens from the SlimPajama dataset, xLSTM outperforms existing methods, including Transformers, State Space Models, and other RNN variants, in terms of validation perplexity.
As the models are scaled to larger sizes, xLSTM continues to maintain its performance advantage, demonstrating favorable scaling behavior.
In downstream tasks such as common sense reasoning and question answering, xLSTM emerges as the best method across various model sizes, surpassing state-of-the-art approaches.
Performance on PALOMA Language Tasks:
Evaluated on 571 text domains from the PALOMA language benchmark, xLSTM[1:0] (the sLSTM variant) achieves lower perplexities than other methods in 99.5% of the domains compared to Mamba, 85.1% compared to Llama, and 99.8% compared to RWKV-4.
Scaling Laws and Length Extrapolation:
When trained on 300B tokens from SlimPajama, xLSTM exhibits favorable scaling laws, indicating its potential for further performance improvements as model sizes increase.
In sequence length extrapolation experiments, xLSTM models maintain low perplexities even for contexts significantly longer than those seen during training, outperforming other methods.
These experimental results highlight the remarkable capabilities of xLSTM, positioning it as a promising contender for language modeling tasks, sequence processing, and a wide range of other applications.
Real-World Applications and Future Directions
The potential applications of xLSTM span a wide range of domains, from natural language processing and generation to sequence modeling, time series analysis, and beyond. Here are some exciting areas where xLSTM could make a significant impact:
Language Modeling and Text Generation: With its enhanced storage capacities and ability to revise stored information, xLSTM could revolutionize language modeling and text generation tasks, enabling more coherent, context-aware, and fluent text generation.
Machine Translation: The state tracking capabilities of xLSTM could prove invaluable in machine translation tasks, where maintaining contextual information and understanding long-range dependencies is crucial for accurate translations.
Speech Recognition and Generation: The parallelizability and scalability of xLSTM make it well-suited for speech recognition and generation applications, where efficient processing of long sequences is essential.
Time Series Analysis and Forecasting: xLSTM’s ability to handle long-range dependencies and effectively store and retrieve complex patterns could lead to significant improvements in time series analysis and forecasting tasks across various domains, such as finance, weather prediction, and industrial applications.
Reinforcement Learning and Control Systems: The potential of xLSTM in reinforcement learning and control systems is promising, as its enhanced memory capabilities and state tracking abilities could enable more intelligent decision-making and control in complex environments.
Architectural Optimizations and Hyperparameter Tuning
While the current results are promising, there is still room for optimizing the xLSTM architecture and fine-tuning its hyperparameters. Researchers could explore different combinations of sLSTM and mLSTM blocks, varying the ratios and placements within the overall architecture. Additionally, a systematic hyperparameter search could lead to further performance improvements, particularly for larger models.
Hardware-Aware Optimizations: To fully leverage the parallelizability of xLSTM, especially the mLSTM variant, researchers could investigate hardware-aware optimizations tailored for specific GPU architectures or other accelerators. This could involve optimizing the CUDA kernels, memory management strategies, and leveraging specialized instructions or libraries for efficient matrix operations.
Integration with Other Neural Network Components: Exploring the integration of xLSTM with other neural network components, such as attention mechanisms, convolutions, or self-supervised learning techniques, could lead to hybrid architectures that combine the strengths of different approaches. These hybrid models could potentially unlock new capabilities and improve performance on a wider range of tasks.
Few-Shot and Transfer Learning: Exploring the use of xLSTM in few-shot and transfer learning scenarios could be an exciting avenue for future research. By leveraging its enhanced memory capabilities and state tracking abilities, xLSTM could potentially enable more efficient knowledge transfer and rapid adaptation to new tasks or domains with limited training data.
Interpretability and Explainability: As with many deep learning models, the inner workings of xLSTM can be opaque and difficult to interpret. Developing techniques for interpreting and explaining the decisions made by xLSTM could lead to more transparent and trustworthy models, facilitating their adoption in critical applications and promoting accountability.
Efficient and Scalable Training Strategies: As models continue to grow in size and complexity, efficient and scalable training strategies become increasingly important. Researchers could explore techniques such as model parallelism, data parallelism, and distributed training approaches specifically tailored for xLSTM architectures, enabling the training of even larger models and potentially reducing computational costs.
These are a few potential future research directions and areas for further exploration with xLSTM.
Conclusion
The introduction of xLSTM marks a significant milestone in the pursuit of more powerful and efficient language modeling and sequence processing architectures. By addressing the limitations of traditional LSTMs and leveraging novel techniques such as exponential gating and matrix memory structures, xLSTM has demonstrated remarkable performance across a wide range of tasks and benchmarks.
However, the journey does not end here. As with any groundbreaking technology, xLSTM presents exciting opportunities for further exploration, refinement, and application in real-world scenarios. As researchers continue to push the boundaries of what is possible, we can expect to witness even more impressive advancements in the field of natural language processing and artificial intelligence.
0 notes
During a keynote speech in New York on Monday from the managing director of Google's Israel business, an employee in the company's cloud division protested publicly, proclaiming “I refuse to build technology that powers genocide.”
The Google Cloud engineer was subsequently fired, CNBC has learned[...]
There was more internal controversy this week, also tied to the crisis in Gaza.
Ahead of an International Women's Day Summit in Silicon Valley on Thursday, Google's employee message board was hit with an influx of staffer comments about the company's military contracts with Israel. The online forum, which was going to be used to help inform what questions were asked of executives at the event, was shut down for what a spokesperson described to CNBC as "divisive content that is disruptive to our workplace."[...]
In recent weeks, more than 600 Google workers signed a letter addressed to leadership asking that the company drop its sponsorship of the annual Mind the Tech conference promoting the Israeli tech industry. The event on Monday in New York featured an address from Barak Regev, managing director of Google Israel.
A video of the employee protesting during the speech went viral.
“No cloud for apartheid,” the employee yelled. Members of the crowd booed him as he was escorted by security out of the building.
Regev then told the crowd, “Part of the privilege of working in a company, which represents democratic values is giving the stage for different opinions."
A Google spokesperson said the employee was fired for "interfering with an official company-sponsored event" in an email to CNBC on Thursday. "This behavior is not okay, regardless of the issue, and the employee was terminated for violating our policies." The spokesperson didn't specify which policies were violated.[...]
Ahead of Google's International Women's Day summit on Thursday, called Her Power, Her Voice, some women filled the company's internal discussion forum Dory with questions about how the Israeli military contract and Google's AI chatbot Gemini are impacting Palestinian women. Some of the comments had hundreds of "upvotes" from employees, according to internal correspondence viewed by CNBC.[...]
Another highly-rated comment on the forum asked how the company is recognizing Mai Ubeid, a young woman and former Google software engineer who was reportedly killed in an Israeli airstrike in Gaza along with her family late last year. (Some employees and advocacy groups gathered to honor Ubeid in New York in December.)
One employee asked, "Given the ongoing International War Crimes against Palestinian women, how can we use the 'Her Power, Her Voice' theme to amplify their daily struggles?" The comment received over 100 upvotes.
"It's essential to question how we can truly support the notion of 'Her Power, Her Voice,' while at the same time, ignoring the cries for help from Palestinian women who have been systematically deprived of their fundamental human rights," another said.
As the number of comments swelled, Google prematurely shut down the forum.
8 Mar 24
1K notes
·
View notes
I read this week that Instagram is pushing “overtly sexual adult videos” to young users. For a Wall Street Journal investigation, journalists created accounts that could belong to children, following young gymnasts, cheerleaders and influencers. The test accounts were soon served sexual and disturbing content on Instagram Reels, alongside ads for dating apps, livestream platforms with “adult nudity” and AI chatbots “built for cybersex”. Some were next to ads for kids’ brands like Disney.
This is something I’ve been trying to get across to parents about social media. The problem is not just porn sites. They are of course a massive concern. Kids as young as nine are addicted. The average age to discover porn is now 13, for boys and girls. And many in my generation are now realising just how much being raised on porn affected them, believing it “destroyed their brain” and distorted their view of sex.
But the problem is bigger than that. Porn is everywhere now. TikTok is serving up sex videos to minors and promoting sites like OnlyFans. The gaming platform Twitch is exposing kids to explicit live-streams. Ads for “AI sex workers” are all over Instagram, some featuring kids’ TV characters like SpongeBob and the Cookie Monster. And there’s also this sort of “soft-porn” now that pervades everything. Pretty much every category of content that kids could stumble across, from beauty trends to TikTok dances to fitness pages, is now pornified or sexualised in some way for clicks.
I think this does a lot of damage to Gen Z. I think it desensitises us to sex. I think it can ruin relationships. But beyond that, I also believe a major problem with everything being pornified is the pressure it puts on young girls to pornify themselves. To fit the sex doll beauty standard; to seek validation through self-sexualisation, and potentially monetise all this like the influencers they’re inundated with.
Which, of course, puts girls at risk of predators. Predators who are all over TikTok, Instagram and Snapchat. Predators whose algorithms helpfully deliver them more content of minors and steer them towards kids’ profiles. Predators who are taking TikToks of underage girls and putting them on platforms like Pornhub.
And this is even more terrifying because adolescent girls are especially vulnerable today. They are vulnerable anyway at that age—but today they have far less life experience than previous generations of girls did. They are extremely insecure and anxious, and much less resilient. Combine this with the fact that they are now more easily exposed to predatory men than ever before in history, and served to strangers by algorithms. And another thing: girls are also able to look way older now. They have AI editing apps to sexualise themselves. TikTok filters to pornify their bodies. And access to every kind of make-up and hair and fashion tutorial you can think of to look sexier and more mature. I don’t think enough parents realise how dangerous this situation is.
Which is why I find it so frustrating to see some progressives downplay the dangers of all this. Those that dismiss anyone concerned about the pornification of everything as a stuffy conservative. And somehow can’t see how the continual loosening of sexual norms might actually empower predatory men, and put pressure on vulnerable girls? That seems delusional to me.
Let’s just say I have little patience for those on the left who loudly celebrate women sexualising themselves online, selling it as fun, feminist and risk-free, but are then horrified to hear about 12 year-olds doing the same thing. C’mon. No wonder they want to.
But I also find it frustrating to see some on the right approach this with what seems like a complete lack of compassion. I don’t think it helps to relentlessly ridicule and blame young women for sexualising themselves online. I don’t think it’s fair either. We can’t give girls Instagram at 12 and then be surprised when as young women they base their self-worth on the approval of strangers. We can’t inundate kids with sexual content all the time and be shocked when they don’t see sex as sacred, or think sex work is just work! We can’t give them platforms as pre-teens where they are rewarded for sexualising themselves and presenting themselves like products and then shame them for starting an OnlyFans. We can’t expose them to online worlds where everything is sexualised and then be confused why some of Gen Z see their sexuality as their entire identity.
And again, on top of these platforms, girls are growing up in a culture that celebrates all of this. They are being raised to believe that they must be liberated from every restraint around sex and relationships to be free and happy, and many have never heard any different. Celebrities encourage them to be a slut, get naked, make/watch porn and make money! Mainstream magazines teach them how to up their nude selfie game! Influencers tell millions of young followers to start an OnlyFans, and pretend it’s about empowering young girls to do whatever they want with their bodies! I can’t say this enough: their world is one where the commodification and sexualisation the self is so normalised. It’s heartbreaking. And cruel that anyone celebrates it.
So sure, young women make their own choices. But when we have children sexualising themselves online, when girls as young as 13 are using fake IDs to post explicit content on OnlyFans, when a third of those selling nudes on Twitter are under the age of 18, I think it’s safe to say we are failing them from an early age.
I guess what I’m trying to get across is this: it’s tough for girls right now. It’s tough to be twelve and anxious and feel unattractive and this is how everyone else is getting attention. It’s tough to constantly compare yourself to the hyper-sexualised influencers that the boys you’re interested in are liking and following and thinking you have to compete. It’s tough to feel like the choice is sexualise yourself or nobody will notice you. The sad reality is we live in a superficial, pornified culture that rewards this stuff, and in many ways punishes you if you’re modest and sensitive and reserved, and a lot of girls are just trying to keep up with it.
We need serious cultural change. We need to wake up to how insane this all is, how utterly mental it is that we allow young girls anywhere near social media, and how we’ve let the liberalising of sexual mores escalate to the point where pre-teens are posing like porn stars and are lied to that it’s liberation. And where we need to start is with an absolute refusal from parents to let their kids on these platforms.
So please. If the relentless social comparison and obliteration of their attention span and confusion about their identity wasn’t enough, this has to be. Don’t let your daughters on social media.
433 notes
·
View notes