#LiveCodeBench | Explore Tumblr posts and blogs

govindhtech · 2 months ago

Text

NVIDIA Llama Nemotron Ultra Reinvent Open AI Performance

AI today has deep thinking, complex problem-solving, and powerful adaptability for business, finance, customer service, and healthcare. Producing words and graphics is no longer enough.

The latest NVIDIA Llama Nemotron Ultra reasoning model is available. It improves computing efficiency and has the greatest accuracy among open-source intelligence and coding models. Model, weights, and training data for AI workflow automation, research assistants, and coding copilots are from Hugging Face.

NVIDIA Llama Nemotron Ultra codes maths and science well

Llama Nemotron Ultra redefines AI in scientific thinking, computing, and maths. High-impact AI requires depth and adaptability, therefore the model is built for real-world industrial needs including copilots, information assistants, and automated procedures. Post-trained for sophisticated thinking, RAG, human-aligned discourse, and tool usage.

Llama Nemotron Ultra uses synthetic and commercial data and advanced training methods to improve Llama 3.1. For agentic processes, it offers inexpensive, high-performance AI with solid reasoning. NVIDIA has released two outstanding post-training training datasets accessible to help construct reasoning models.

The community may start building cost-effective, high-performing models using these resources. NVIDIA's @KaggleAI Mathematical Olympiad win in competitive reasoning showed its efficacy. After then, Llama Nemotron Ultra received data, technology, and insights. These three requirements are addressed in depth below.

GPQA Diamond standard

Figures 1, 2 and 3 show that the Llama Nemotron Ultra thinking model outperforms other open models in a scientific reasoning benchmark. PhD-level experts prepared the 198 meticulously designed GPQA Diamond benchmark questions in biology, physics, and chemistry.

Graduate-level problems need deep comprehension and multistep reasoning beyond memory and inference. Although PhDs normally attain 65% accuracy on this challenging subgroup, Llama Nemotron Ultra has set a new standard and became the top open model in scientific reasoning with 76% accuracy. Vellum and Artificial Analysis leaderboards illustrate this.

LiveCodeBench test

Figures 4, 5 show that Llama Nemotron Ultra performs well on complex scientific benchmarks and LiveCodeBench, a solid test for real-world coding abilities. LiveCodeBench focusses on general coding tasks such code creation, debugging, self-repair, test outcome prediction, and execution.

Every problem in LiveCodeBench is date-stamped for unbiased, out-of-distribution evaluation. Prioritising problem-solving above code output checks correct generalisation. Both GitHub LiveCodeBench and Artificial Analysis leaderboards show this.

AIME standard

Llama Nemotron Ultra exceeds other open models in the AIME benchmark, which tests mathematical thinking. Watch the LLM standings live.

Open data and tools

One of Llama Nemotron's greatest achievements is its open design. NVIDIA AI published the model and two commercially viable datasets that powered its reasoning. They're top-trending Hugging Face Datasets.

Over 735K Python examples from 28K questions from popular competitive programming platforms make up the OpenCodeReasoning Dataset. This dataset, designed for supervised fine-tuning (SFT), lets enterprise developers incorporate advanced reasoning into their models. OpenCodeReasoning may help organisations create more intelligent and durable code solutions for AI systems.

The Llama-Nemotron-Post-Training Dataset was artificially constructed using open and public models including DeepSeek-R1, Nemotron, Qwen, and Llama. This dataset improves a model's performance on essential reasoning tasks, making it ideal for general reasoning, math, coding, and instruction following. It helps developers build more competent and coherent AI systems by optimising models to understand and respond to complex, multi-step instructions.

Free datasets on Hugging Face from NVIDIA aim to democratise reasoning model training. Startups, research labs, and enterprises may now use the same resources as NVIDIA internal teams, accelerating the adoption of agentic AI that can reason, plan, and act across complicated workflows.

Enterprise-ready speed, precision, and flexibility

Llama Nemotron Ultra, a commercially successful model, may be used for task-oriented assistants, autonomous research agents, customer care chatbots, and coding copilots. Due to its high scientific reasoning and code benchmark performance, it is ideal for real-world applications that need precision, flexibility, and multistep problem resolution.

Llama Nemotron Ultra has the maximum throughput and model correctness in open-reasoning models. Throughput closely correlates with savings. It uses Neural Architecture Search (NAS) to reduce the model's memory footprint and maintain performance in a data centre. This permits more workloads with less GPUs.

A robust post-training pipeline comprised reinforcement learning (RL) and supervised fine-tuning to improve the model's reasoning and non-reasoning skills. The model's “On” and “Off” capabilities allow businesses to utilise reasoning only when needed, reducing overhead for simpler, non-agentic activities.

#technology #technews #govindhtech #news #technologynews #Llama Nemotron Ultra #Llama Nemotron #AIME benchmark #LiveCodeBench #GPQA Diamond benchmark #NVIDIA Llama Nemotron

0 notes

allnews-95 · 7 days ago

Text

Techmeme: [Thread] A US paper shows the best frontier LLM models solve 0% of hard coding problems from Codeforces ICPC and IOI domains where expert humans still excel (Rohan Paul/@rohanpaul_ai)

Rohan Paul / @rohanpaul_ai: [Thread] A US paper shows the best frontier LLM models solve 0% of hard coding problems from Codeforces, ICPC, and IOI, domains where expert humans still excel — This is really BAD news of LLM's coding skill. ☹️ The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel. LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI ("International [image] June 17, 2025 at 04:05AM

#IFTTT #FeedMiX news #cars #tech

0 notes

strategictech · 24 days ago

Text

China’s DeepSeek Upgrades Its R1 AI Model, Intensifying Global Competition

Chinese startup DeepSeek has discreetly launched an upgraded version of its widely discussed R1 artificial intelligence reasoning model. The update was released on the AI repository Hugging Face without any formal announcement, continuing the company’s pattern of quiet disruption in the competitive AI landscape.

DeepSeek captured global attention earlier this year when its open-source R1 model outperformed models from major tech players, including Meta and OpenAI. The model’s rapid development and minimal cost triggered market volatility, erasing billions in market value from U.S. tech stocks such as Nvidia. Although these losses were short-lived, they underscored the growing threat of leaner, faster-developing AI challengers. The upgraded version of R1 is a reasoning model, designed to handle complex tasks using logical step-by-step processes. According to LiveCodeBench, a benchmarking platform for AI models, the new R1 version ranks just below OpenAI’s o4-mini and o3 models in performance.

0 notes

mlearningai · 26 days ago

Text

1️⃣ GPT o3

2️⃣ DeepSeek-R1-0528

3️⃣ Claude Opus 4

4️⃣ Gemini 2.5 PRO

This radar chart reveals how today's leading AI models compare across key capabilities that directly impact creative workflows, from code generation (LiveCodeBench, HumanEval) that can help automate repetitive tasks or create interactive experiences, to advanced reasoning (MATH, AIME_2024) that enables more sophisticated problem-solving in complex creative projects

OpenAI's GPT o3 (shown in bright blue) demonstrates exceptional all-around performance, particularly excelling in general knowledge (MMLU) and coding tasks, making it a versatile choice for creators who need an AI assistant that can handle everything from conceptual brainstorming to technical implementation.

DeepSeek and Claude Opus 4 show distinctive strengths in mathematical and analytical tasks, which translates to better performance in data-driven creative work like generative art algorithms or music composition, while Gemini 2.5 PRO's balanced profile suggests reliability across diverse creative applications

#machinelearning #artificialintelligence #art #digitalart #mlart #datascience #ai #algorithm #bigdata #gpt o3 #deepseek #claude opus 4 #gemini 2.5 PRO

1 note · View note

digitalmore · 2 months ago

Text

#IFTTT #Digital More

0 notes

brookszd · 3 months ago

Text

DeepCoder-14B new opensource coding model

DeepCoder is an innovative platform that leverages AI technology to revolutionize code generation. It offers a powerful open-source model, known as the DeepCoder-14B-Preview, which is fine-tuned for coding tasks and achieves a remarkable 60.6% Pass@1 accuracy on LiveCodeBench.

DeepCoder-14B-Preview model is a great code reasoning LLM fine-tuned from DeepSeek-R1-Distilled-Qwen-14B using distributed reinforcement learning to scale up to long context lengths. This model is the result of a collaboration between Together AI and Agentica. DeepCoder is designed to assist developers in creating efficient code by generating solutions from problem statements instantly.

It supports various coding tasks, including competitive programming, code debugging, and algorithmic solutions. The platform is accessible through Ollama, allowing users to deploy it with simple commands.

#artificial intelligence #ai

0 notes

kobak · 3 months ago

Text

youtube

LG AI Research EXAONE Deep nyelvi modell teszt: újabb trónkövetelő a "gondolkodó" modellek között? Az LG AI Research bemutatta az EXAONE Deep családot. 3 fő modellt jelentettek be: 2.4B, 7.8B és egy 32B paramétereset. Ezek kifejezetten a gondolkodási feladatokra lettek optimalizálva. - A kisebb modellek, a 2.4B és a 7.8B felülmúlják a hasonló méretű többi modellt A 7.8B-s modell még az eredményeik szerint az OpenAI o1-mini, nem nyílt forráskódú következtetési modellt is maga mögé utasította - A legnagyobb modell, a 32B versenyképes teljesítményt nyújt a vezető nyílt forráskódú modellekkel szemben, mint például a QwQ-32B és a DeepSeek-R1 Sőt, a DeepSeek-R1 desztillált verzióit is veri - Ezeket a modelleket főként következtetésre specializált adathalmazokon képezték, amelyek hosszú gondolatmeneteket tartalmaznak A képzéshez olyan bevált technikákat használtak, mint a felügyelt finomhangolás (SFT), a közvetlen preferencia optimalizálás (DPO) és az online megerősítéses tanulás (Online RL) - Az EXAONE Deep modellek célja, hogy lépésről lépésre gondolkodjanak, ezt a thought és thought tagek közötti strukturált gondolatmenetük is mutatja A végső válaszuk pedig egy tömör összefoglalása ennek a gondolatmenetnek - A modellek teljesítményét különböző benchmarkokon tesztelték, beleértve a matek (MATH-500, AIME, CSAT), a tudomány (GPQA Diamond), a kódolás (LiveCodeBench) és az általános tudás (MMLU, MMLU-Pro) területeit Az eredmények azt mutatják, hogy az EXAONE Deep modellek mérettől függően javított következtetési képességeket mutatnak - Ezek a modellek kutatási célokra szabadon elérhetőek és letölthetőek a Hugging Face-ről A használatukhoz azonban el kell fogadni az EXAONE AI Model License Agreement 1.1 - NC licencfeltételeit, ami nem engedélyezi a kereskedelmi felhasználást Fontos megjegyezni, hogy a GitHubon érkezett visszajelzések szerint a 7.8B-s modellnek lehetnek problémái az utasítások egyértelműségével, a strukturált gondolkodással és a túlbonyolítással. * Legyél Te is Tagja az Mp3Pintyo csatornának * https://www.youtube.com/channel/UC-3YkVvPQbZiApqrRXEOaPg/join *** DISCORD*** Mp3Pintyo szerver: https://ift.tt/r65ELW2 *** Támogatás *** Patreon: https://ift.tt/bcw2YXv *** Linkek *** Kutatási anyag: https://ift.tt/rQyNjlb LG AI Research Blog: https://ift.tt/OACh7Ne Hugging Face modellek: https://ift.tt/xPvGe1H GitHub: https://ift.tt/UAxFPM2 *** BUYING MY ARTS *** ► https://ift.tt/Cxgwzti ► https://ift.tt/2b46Qlm *** STAY ACTIVE FOR A FOLLOW *** ►TWITTER: https://twitter.com/Mp3Pintyo ►INSTAGRAM: https://ift.tt/7s1ZlTv ►PINTEREST: https://ift.tt/1m3ZVku ►SOUNDCLOUD: https://ift.tt/J3SuQEL Ez a videó bemutatja a mesterséges intelligencia alkalmazását. Az AI (mesterséges intelligencia) rengeteg területen könnyíti és segíti az életünket. #ai #mesterségesintelligencia #mi #mp3pintyo via YouTube https://www.youtube.com/watch?v=4YRaBG_o0KQ

#IFTTT #YouTube #Youtube

0 notes

tumnikkeimatome · 5 months ago

Text

DeepSeek-Coder-V2で無料のGitHub Copilot環境を構築する手順

DeepSeek-Coder-V2の特徴と性能 DeepSeek-Coder-V2は、GPT-4 Turboに匹敵する性能を持つオープンソースのコード生成AI言語モデルです。 338種類のプログラミング言語に対応し、128Kトークンまでのコンテキスト長を処理できます。 LiveCodeBenchでは74.2%のPass@1スコアを達成し、GPT-4 Turboの71.8%を上回る優れた性能を示しています。 DeepSeek-Coder-V2の実行性能と導入メリット DeepSeek-Coder-V2は、ローカル環境で動作する高性能なAIコーディングアシスタントです。 M2 Pro搭載のMacBookやWindows…

0 notes

allnews-95 · 7 days ago

Text

Techmeme: [Thread] A new US paper shows the best frontier LLM models achieve 0% on hard real-life Programming Contest problems domains where expert humans still excel (Rohan Paul/@rohanpaul_ai)

Rohan Paul / @rohanpaul_ai: [Thread] A new US paper shows the best frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel — This is really BAD news of LLM's coding skill. ☹️ The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel. LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI ("International [image] June 17, 2025 at 04:05AM

#IFTTT #FeedMiX news #cars #tech

0 notes

jcmarchi · 23 days ago

Text

The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive

New Post has been published on https://thedigitalinsider.com/the-sequence-radar-554-the-new-deepseek-r1-0528-is-very-impressive/

The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive

The new model excels at math and reasoning.

Created Using GPT-4o

Next Week in The Sequence:

In our series about evals, we discuss multiturn benchmarks. The engineering section dives into the amazing Anthropic Circuits for ML interpretability. In research, we discuss some of UC Berkeley’s recent work in LLM reasoning. Our opinion section dives into the state of AI interpretablity.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: The New DeepSeek R1-0528 is Very Impressive

This week, DeepSeek AI pushed the boundaries of open-source language modeling with the release of DeepSeek R1-0528. Building on the foundation of the original R1 release, this update delivers notable gains in mathematical reasoning, code generation, and long-context understanding. With improvements derived from enhanced optimization and post-training fine-tuning, R1-0528 marks a critical step toward closing the performance gap between open models and their proprietary counterparts like GPT-4 and Gemini 1.5.

At its core, DeepSeek R1-0528 preserves the powerful 672B Mixture-of-Experts (MoE) architecture, activating 37B parameters per forward pass. This architecture delivers high-capacity performance while optimizing for efficiency, especially in inference settings. One standout feature is its support for 64K-token context windows, enabling the model to engage with substantially larger inputs—ideal for technical documents, structured reasoning chains, and multi-step planning.

In terms of capability uplift, the model shows remarkable progress in competitive benchmarks. On AIME 2025, DeepSeek R1-0528 jumped from 70% to an impressive 87.5%, showcasing an increasingly sophisticated ability to tackle complex mathematical problems. This leap highlights not just better fine-tuning, but a fundamental improvement in reasoning depth—an essential metric for models serving scientific, technical, and educational use cases.

For software engineering and development workflows, R1-0528 brings meaningful updates. Accuracy on LiveCodeBench rose from 63.5% to 73.3%, confirming improvements in structured code synthesis. The inclusion of JSON-formatted outputs and native function calling support positions the model as a strong candidate for integration into automated pipelines, copilots, and tool-augmented environments where structured outputs are non-negotiable.

To ensure broad accessibility, DeepSeek also launched a distilled variant: R1-0528-Qwen3-8B. Despite its smaller footprint, this model surpasses Qwen3-8B on AIME 2024 by over 10%, while rivaling much larger competitors like Qwen3-235B-thinking. This reflects DeepSeek’s commitment to democratizing frontier performance, enabling developers and researchers with constrained compute resources to access state-of-the-art capabilities.

DeepSeek R1-0528 is more than just a model upgrade—it’s a statement. In an ecosystem increasingly dominated by closed systems, DeepSeek continues to advance the case for open, high-performance AI. By combining transparent research practices, scalable deployment options, and world-class performance, R1-0528 signals a future where cutting-edge AI remains accessible to the entire community—not just a privileged few.

Join Me for a Chat About AI Evals and Benchmarks:

🔎 AI Research

FLEX-Judge: THINK ONCE, JUDGE ANYWHERE

Lab: KAIST AI Summary: FLEX-Judge is a reasoning-first multimodal evaluator trained on just 1K text-only explanations, achieving zero-shot generalization across images, audio, video, and molecular tasks while outperforming larger commercial models. Leverages textual reasoning alone to train a judge model that generalizes across modalities without modality-specific supervision.

Learning to Reason without External Rewards

Lab: UC Berkeley & Yale University Summary: INTUITOR introduces a novel self-supervised reinforcement learning framework using self-certainty as intrinsic reward, matching supervised methods on math and outperforming them on code generation without any external feedback. The technique proposes self-certainty as an effective intrinsic reward signal for reinforcement learning, replacing gold labels.

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

AI Lab: Google DeepMind & Northwestern University Summary: This paper introduces BARL, a novel Bayes-Adaptive RL algorithm that enables large language models to perform test-time reflective reasoning by switching strategies based on posterior beliefs over MDPs. The authors show that BARL significantly outperforms Markovian RL approaches in math reasoning tasks by improving token efficiency and adaptive exploration.

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

AI Lab: Microsoft Research Asia Summary: The authors present rStar-Coder, a dataset of 418K competitive programming problems and 580K verified long-reasoning code solutions, which drastically boosts the performance of Qwen models on code reasoning benchmarks. Their pipeline introduces a robust input-output test case synthesis method and mutual-verification mechanism, achieving state-of-the-art performance even with smaller models.

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

AI Lab: Fudan University, CUHK MMLab, Shanghai AI Lab Summary: MME-Reasoning offers a benchmark of 1,188 multimodal reasoning tasks spanning inductive, deductive, and abductive logic, revealing significant limitations in current MLLMs’ logical reasoning. The benchmark includes multiple question types and rigorous metadata annotations, exposing reasoning gaps especially in abductive tasks.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

AI Lab: Carnegie Mellon University, NOVA LINCS, INESC-ID Summary: DeepResearchGym is an open-source sandbox providing reproducible search APIs and evaluation protocols over ClueWeb22 and FineWeb for benchmarking deep research agents. It supports scalable dense retrieval and long-form response evaluation using LLM-as-a-judge assessments across dimensions like relevance and factual grounding.

Fine-Tuning Large Language Models with User-Level Differential Privacy

AI Lab: Google Research Summary: This study compares two scalable user-level differential privacy methods (ELS and ULS) for LLM fine-tuning, with a novel privacy accountant that tightens DP guarantees for ELS. Experiments show that ULS generally offers better utility under large compute budgets or strong privacy settings, while maintaining scalability to hundreds of millions of parameters and users.

🤖 AI Tech Releases

DeepSeek-R1-0528

DeeSeek released a new version of its marquee R1 model.

Anthropic Circuits

Anthropic open sourced its circuit interpretability technology.

Perplexity Labs

Perplexity released a new tool that can generate charts, spreadsheets and dashboards.

Codestral Embed

Mistral released Codestral Embed, an embedding model specialized in coding.

🛠 AI in Production

Multi-Task Learning at Netflix

Netflix shared some details about its multi-task prediction strategy for user intent.

📡AI Radar

Salesforce agreed to buy Informatica for $8 billion.

xAI and Telegram partnered to enable Grok for its users.

Netflix’s Reed Hastings joined Anthropic’s board of directors.

Grammarly raised $1 billion to accelerate sales and acquisitions.

Spott raises $3.2 million for an AI-native recruiting firm.

Buildots $45 million for its AI for construction platform.

Context raised $11 million to power an AI-native office suite.

Rillet raised $25 million to enable AI for mid market accounting.

HuggingFace unveiled two new open source robots.

0 notes

govindhtech · 1 month ago

Text

Gemini Diffusion: Google’s new Experimental Research Model

Gemini Dispersion

Google DeepMind developed Gemini Diffusion, an experimental research model. A cutting-edge text dissemination model, it's called. Gemini Diffusion is a DeepMind AI prototype.

The model uses diffusion to model language. This method differs from autoregressive language models. Traditional autoregressive models sequentially output text. This sequential arrangement may reduce content quality and coherence and hinder productivity.

However, diffusion models learn outputs by gradually improving noise. Instead of anticipating text linearly, they iteratively process chaotic input to produce coherent output. This iterative refinement approach lets diffusion models quickly test solutions. They can also correct generation errors, which is useful. They excel at code and math editing due to their iterative improvement and error repair abilities. The latest Google DeepMind image and video synthesis models learn to produce outputs by turning random noise into intelligible text or code.

Gemini Diffusion's main capabilities from this diffusion approach are:

Quick response: Gemini Diffusion produces information faster than Google's fastest model. All evaluated activities average 1479 tokens per second without overhead. The overhead is 0.84 seconds.

Unlike autoregressive models, Gemini Diffusion creates blocks of tokens at once, making language more cohesive. This method makes the model respond to user queries more logically.

Iterative refinement: The model can correct generation errors to produce more trustworthy outputs.

Benchmarks

Despite being speedier, Gemini Diffusion's external benchmark scores are comparable to larger models. Insiders say it codes as fast as Google's fastest model. Benchmarks compare Gemini Diffusion to Gemini 2.0 Flash-Lite in several domains:

Code: LBPP (v2), MBPP (76.0% vs. 75.8%), and LiveCodeBench (v6) yield higher results for Gemini Diffusion. On BigCodeBench (45.8% vs. 45.4%), SWE-Bench Verified (28.5% vs. 22.9%), and HumanEval (90.2% vs. 89.6%), Gemini 2.0 Flash-Lite performs slightly better SWE-Bench Verified uses a non-agentic evaluation (single-turn edit only) with a 32K prompt length.

Science: Gemini 2.0 Flash-Lite outperforms Gemini Diffusion on GPQA Diamond (56.5% vs. 40.4%).

Mathematics AIME 2025 score is 23.3% for Gemini Diffusion, up from 20.0%.

Reason: Gemini 2.0 Flash-Lite scores 21.0% on BIG-Bench Extra Hard versus 15.0%.

Gemini 2.0 Flash-Lite has a higher Global MMLU (Lite) score (79.0% vs. 69.1%). No majority voting was used in the benchmark technique, hence all outcomes are pass@1. The AI Studio API ran the Gemini 2.0 Flash-Lite tests for comparison using the model-id gemini-2.0-flash-lite and default sampling parameters.

Gemini Diffusion is available for experimentation. With this internal demo, future models are produced and refined. Anyone can join the demo waitlist.

The Gemini Diffusion study of diffusion for text generation aims to give users more control, creativity, and speed when writing. Google DeepMind is using diffusion to improve its models' efficiency and effectiveness.

#GeminiDiffusion #Gemini #GeminiDiffusionmodel #GoogleDeepMind #GoogleDeepMindresearchmodel #Googleresearchmodel #technology #technews #technologynews #news #govindhtech

0 notes

jcmarchi · 4 months ago

Text

Alibaba Qwen QwQ-32B: Scaled reinforcement learning showcase

New Post has been published on https://thedigitalinsider.com/alibaba-qwen-qwq-32b-scaled-reinforcement-learning-showcase/

Alibaba Qwen QwQ-32B: Scaled reinforcement learning showcase

The Qwen team at Alibaba has unveiled QwQ-32B, a 32 billion parameter AI model that demonstrates performance rivalling the much larger DeepSeek-R1. This breakthrough highlights the potential of scaling Reinforcement Learning (RL) on robust foundation models.

The Qwen team have successfully integrated agent capabilities into the reasoning model, enabling it to think critically, utilise tools, and adapt its reasoning based on environmental feedback.

“Scaling RL has the potential to enhance model performance beyond conventional pretraining and post-training methods,” the team stated. “Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models.”

QwQ-32B achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated), a testament to the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. This remarkable outcome underscores the potential of RL to bridge the gap between model size and performance.

The model has been evaluated across a range of benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities.

The results highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Benchmark results:

AIME24: QwQ-32B achieved 79.5, slightly behind DeepSeek-R1-6718’s 79.8, but significantly ahead of OpenAl-o1-mini’s 63.6 and the distilled models.

LiveCodeBench: QwQ-32B scored 63.4, again closely matched by DeepSeek-R1-6718’s 65.9, and surpassing the distilled models and OpenAl-o1-mini’s 53.8.

LiveBench: QwQ-32B achieved 73.1, with DeepSeek-R1-6718 scoring 71.6, and outperforming the distilled models and OpenAl-o1-mini’s 57.5.

IFEval: QwQ-32B scored 83.9, very close to DeepSeek-R1-6718’s 83.3, and leading the distilled models and OpenAl-o1-mini’s 59.1.

BFCL: QwQ-32B achieved 66.4, with DeepSeek-R1-6718 scoring 62.8, demonstrating a lead over the distilled models and OpenAl-o1-mini’s 49.3.

The Qwen team’s approach involved a cold-start checkpoint and a multi-stage RL process driven by outcome-based rewards. The initial stage focused on scaling RL for math and coding tasks, utilising accuracy verifiers and code execution servers. The second stage expanded to general capabilities, incorporating rewards from general reward models and rule-based verifiers.

“We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding,” the team explained.

QwQ-32B is open-weight and available on Hugging Face and ModelScope under the Apache 2.0 license, and is also accessible via Qwen Chat. The Qwen team views this as an initial step in scaling RL to enhance reasoning capabilities and aims to further explore the integration of agents with RL for long-horizon reasoning.

“As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence (AGI),” the team stated.

See also: Deepgram Nova-3 Medical: AI speech model cuts healthcare transcription errors

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

0 notes

jcmarchi · 5 months ago

Text

DeepSeek-R1 reasoning models rival OpenAI in performance

New Post has been published on https://thedigitalinsider.com/deepseek-r1-reasoning-models-rival-openai-in-performance/

DeepSeek-R1 reasoning models rival OpenAI in performance

.pp-multiple-authors-boxes-wrapper display:none; img width:100%;

DeepSeek has unveiled its first-generation DeepSeek-R1 and DeepSeek-R1-Zero models that are designed to tackle complex reasoning tasks.

DeepSeek-R1-Zero is trained solely through large-scale reinforcement learning (RL) without relying on supervised fine-tuning (SFT) as a preliminary step. According to DeepSeek, this approach has led to the natural emergence of “numerous powerful and interesting reasoning behaviours,” including self-verification, reflection, and the generation of extensive chains of thought (CoT).

“Notably, [DeepSeek-R1-Zero] is the first open research to validate that reasoning capabilities of LLMs can be incentivised purely through RL, without the need for SFT,” DeepSeek researchers explained. This milestone not only underscores the model’s innovative foundations but also paves the way for RL-focused advancements in reasoning AI.

However, DeepSeek-R1-Zero’s capabilities come with certain limitations. Key challenges include “endless repetition, poor readability, and language mixing,” which could pose significant hurdles in real-world applications. To address these shortcomings, DeepSeek developed its flagship model: DeepSeek-R1.

Introducing DeepSeek-R1

DeepSeek-R1 builds upon its predecessor by incorporating cold-start data prior to RL training. This additional pre-training step enhances the model’s reasoning capabilities and resolves many of the limitations noted in DeepSeek-R1-Zero.

Notably, DeepSeek-R1 achieves performance comparable to OpenAI’s much-lauded o1 system across mathematics, coding, and general reasoning tasks, cementing its place as a leading competitor.

DeepSeek has chosen to open-source both DeepSeek-R1-Zero and DeepSeek-R1 along with six smaller distilled models. Among these, DeepSeek-R1-Distill-Qwen-32B has demonstrated exceptional results—even outperforming OpenAI’s o1-mini across multiple benchmarks.

MATH-500 (Pass@1): DeepSeek-R1 achieved 97.3%, eclipsing OpenAI (96.4%) and other key competitors.

LiveCodeBench (Pass@1-COT): The distilled version DeepSeek-R1-Distill-Qwen-32B scored 57.2%, a standout performance among smaller models.

AIME 2024 (Pass@1): DeepSeek-R1 achieved 79.8%, setting an impressive standard in mathematical problem-solving.

A pipeline to benefit the wider industry

DeepSeek has shared insights into its rigorous pipeline for reasoning model development, which integrates a combination of supervised fine-tuning and reinforcement learning.

According to the company, the process involves two SFT stages to establish the foundational reasoning and non-reasoning abilities, as well as two RL stages tailored for discovering advanced reasoning patterns and aligning these capabilities with human preferences.

“We believe the pipeline will benefit the industry by creating better models,” DeepSeek remarked, alluding to the potential of their methodology to inspire future advancements across the AI sector.

One standout achievement of their RL-focused approach is the ability of DeepSeek-R1-Zero to execute intricate reasoning patterns without prior human instruction—a first for the open-source AI research community.

Importance of distillation

DeepSeek researchers also highlighted the importance of distillation—the process of transferring reasoning abilities from larger models to smaller, more efficient ones, a strategy that has unlocked performance gains even for smaller configurations.

Smaller distilled iterations of DeepSeek-R1 �� such as the 1.5B, 7B, and 14B versions – were able to hold their own in niche applications. The distilled models can outperform results achieved via RL training on models of comparable sizes.

🔥 Bonus: Open-Source Distilled Models!

🔬 Distilled from DeepSeek-R1, 6 small models fully open-sourced 📏 32B & 70B models on par with OpenAI-o1-mini 🤝 Empowering the open-source community

🌍 Pushing the boundaries of **open AI**!

🐋 2/n pic.twitter.com/tfXLM2xtZZ

— DeepSeek (@deepseek_ai) January 20, 2025

For researchers, these distilled models are available in configurations spanning from 1.5 billion to 70 billion parameters, supporting Qwen2.5 and Llama3 architectures. This flexibility empowers versatile usage across a wide range of tasks, from coding to natural language understanding.

DeepSeek has adopted the MIT License for its repository and weights, extending permissions for commercial use and downstream modifications. Derivative works, such as using DeepSeek-R1 to train other large language models (LLMs), are permitted. However, users of specific distilled models should ensure compliance with the licences of the original base models, such as Apache 2.0 and Llama3 licences.

(Photo by Prateek Katyal)

See also: Microsoft advances materials discovery with MatterGen

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Tags: ai, artificial intelligence, benchmark, comparison, deepseek, deepseek-r1, large language models, llm, models, reasoning, reasoning models, reinforcement learning, test

0 notes