#Buildots AI | Explore Tumblr posts and blogs

jcmarchi · 20 days ago

Text

The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive

New Post has been published on https://thedigitalinsider.com/the-sequence-radar-554-the-new-deepseek-r1-0528-is-very-impressive/

The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive

The new model excels at math and reasoning.

Created Using GPT-4o

Next Week in The Sequence:

In our series about evals, we discuss multiturn benchmarks. The engineering section dives into the amazing Anthropic Circuits for ML interpretability. In research, we discuss some of UC Berkeley’s recent work in LLM reasoning. Our opinion section dives into the state of AI interpretablity.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: The New DeepSeek R1-0528 is Very Impressive

This week, DeepSeek AI pushed the boundaries of open-source language modeling with the release of DeepSeek R1-0528. Building on the foundation of the original R1 release, this update delivers notable gains in mathematical reasoning, code generation, and long-context understanding. With improvements derived from enhanced optimization and post-training fine-tuning, R1-0528 marks a critical step toward closing the performance gap between open models and their proprietary counterparts like GPT-4 and Gemini 1.5.

At its core, DeepSeek R1-0528 preserves the powerful 672B Mixture-of-Experts (MoE) architecture, activating 37B parameters per forward pass. This architecture delivers high-capacity performance while optimizing for efficiency, especially in inference settings. One standout feature is its support for 64K-token context windows, enabling the model to engage with substantially larger inputs—ideal for technical documents, structured reasoning chains, and multi-step planning.

In terms of capability uplift, the model shows remarkable progress in competitive benchmarks. On AIME 2025, DeepSeek R1-0528 jumped from 70% to an impressive 87.5%, showcasing an increasingly sophisticated ability to tackle complex mathematical problems. This leap highlights not just better fine-tuning, but a fundamental improvement in reasoning depth—an essential metric for models serving scientific, technical, and educational use cases.

For software engineering and development workflows, R1-0528 brings meaningful updates. Accuracy on LiveCodeBench rose from 63.5% to 73.3%, confirming improvements in structured code synthesis. The inclusion of JSON-formatted outputs and native function calling support positions the model as a strong candidate for integration into automated pipelines, copilots, and tool-augmented environments where structured outputs are non-negotiable.

To ensure broad accessibility, DeepSeek also launched a distilled variant: R1-0528-Qwen3-8B. Despite its smaller footprint, this model surpasses Qwen3-8B on AIME 2024 by over 10%, while rivaling much larger competitors like Qwen3-235B-thinking. This reflects DeepSeek’s commitment to democratizing frontier performance, enabling developers and researchers with constrained compute resources to access state-of-the-art capabilities.

DeepSeek R1-0528 is more than just a model upgrade—it’s a statement. In an ecosystem increasingly dominated by closed systems, DeepSeek continues to advance the case for open, high-performance AI. By combining transparent research practices, scalable deployment options, and world-class performance, R1-0528 signals a future where cutting-edge AI remains accessible to the entire community—not just a privileged few.

Join Me for a Chat About AI Evals and Benchmarks:

🔎 AI Research

FLEX-Judge: THINK ONCE, JUDGE ANYWHERE

Lab: KAIST AI Summary: FLEX-Judge is a reasoning-first multimodal evaluator trained on just 1K text-only explanations, achieving zero-shot generalization across images, audio, video, and molecular tasks while outperforming larger commercial models. Leverages textual reasoning alone to train a judge model that generalizes across modalities without modality-specific supervision.

Learning to Reason without External Rewards

Lab: UC Berkeley & Yale University Summary: INTUITOR introduces a novel self-supervised reinforcement learning framework using self-certainty as intrinsic reward, matching supervised methods on math and outperforming them on code generation without any external feedback. The technique proposes self-certainty as an effective intrinsic reward signal for reinforcement learning, replacing gold labels.

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

AI Lab: Google DeepMind & Northwestern University Summary: This paper introduces BARL, a novel Bayes-Adaptive RL algorithm that enables large language models to perform test-time reflective reasoning by switching strategies based on posterior beliefs over MDPs. The authors show that BARL significantly outperforms Markovian RL approaches in math reasoning tasks by improving token efficiency and adaptive exploration.

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

AI Lab: Microsoft Research Asia Summary: The authors present rStar-Coder, a dataset of 418K competitive programming problems and 580K verified long-reasoning code solutions, which drastically boosts the performance of Qwen models on code reasoning benchmarks. Their pipeline introduces a robust input-output test case synthesis method and mutual-verification mechanism, achieving state-of-the-art performance even with smaller models.

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

AI Lab: Fudan University, CUHK MMLab, Shanghai AI Lab Summary: MME-Reasoning offers a benchmark of 1,188 multimodal reasoning tasks spanning inductive, deductive, and abductive logic, revealing significant limitations in current MLLMs’ logical reasoning. The benchmark includes multiple question types and rigorous metadata annotations, exposing reasoning gaps especially in abductive tasks.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

AI Lab: Carnegie Mellon University, NOVA LINCS, INESC-ID Summary: DeepResearchGym is an open-source sandbox providing reproducible search APIs and evaluation protocols over ClueWeb22 and FineWeb for benchmarking deep research agents. It supports scalable dense retrieval and long-form response evaluation using LLM-as-a-judge assessments across dimensions like relevance and factual grounding.

Fine-Tuning Large Language Models with User-Level Differential Privacy

AI Lab: Google Research Summary: This study compares two scalable user-level differential privacy methods (ELS and ULS) for LLM fine-tuning, with a novel privacy accountant that tightens DP guarantees for ELS. Experiments show that ULS generally offers better utility under large compute budgets or strong privacy settings, while maintaining scalability to hundreds of millions of parameters and users.

🤖 AI Tech Releases

DeepSeek-R1-0528

DeeSeek released a new version of its marquee R1 model.

Anthropic Circuits

Anthropic open sourced its circuit interpretability technology.

Perplexity Labs

Perplexity released a new tool that can generate charts, spreadsheets and dashboards.

Codestral Embed

Mistral released Codestral Embed, an embedding model specialized in coding.

🛠 AI in Production

Multi-Task Learning at Netflix

Netflix shared some details about its multi-task prediction strategy for user intent.

📡AI Radar

Salesforce agreed to buy Informatica for $8 billion.

xAI and Telegram partnered to enable Grok for its users.

Netflix’s Reed Hastings joined Anthropic’s board of directors.

Grammarly raised $1 billion to accelerate sales and acquisitions.

Spott raises $3.2 million for an AI-native recruiting firm.

Buildots $45 million for its AI for construction platform.

Context raised $11 million to power an AI-native office suite.

Rillet raised $25 million to enable AI for mid market accounting.

HuggingFace unveiled two new open source robots.

0 notes

wolfliving · 5 years ago

Text

Buildots AI for construction

AI that scans a construction site can spot when things are falling behind

Building sites in Europe are now using image recognition software made by Buildots that flags up delays or errors automatically.

by Will Douglas Heaven

October 16, 2020

https://www.technologyreview.com/2020/10/16/1010617/ai-image-recognition-construction-computer-vision-costs-delays/

Construction sites are vast jigsaws of people and parts that must be pieced together just so at just the right times. As projects get larger, mistakes and delays get more expensive. The consultancy Mckinsey estimates that on-site mismanagement costs the construction industry $1.6 trillion a year. But typically you might only have five managers overseeing construction of a building with 1,500 rooms, says Roy Danon, founder and CEO of British-Israeli startup Buildots: “There’s no way a human can control that amount of detail.”

Danon thinks that AI can help. Buildots is developing an image recognition system that monitors every detail of an ongoing construction project and flags up delays or errors automatically. It is already being used by two of the biggest building firms in Europe, including UK construction giant Wates in a handful of large residential builds. Construction is essentially a kind of manufacturing, says Danon. If high-tech factories now use AI to manage their processes, why not construction sites?

AI is starting to change various aspects of construction, from design to self-driving diggers. Some companies even provide a kind of overall AI site inspector that matches images taken on site against a digital plan of the building. Now Buildots is making that process easier than ever by using video footage from GoPro cameras mounted on the hard hats of workers....

0 notes