#CUDA12 | Explore Tumblr posts and blogs

govindhtech · 1 month ago

Text

LM Studio Improves LLM with CUDA 12.8 & GeForce RTX GPUs

LM Studio Accelerates LLM with CUDA 12.8 and GeForce RTX GPUs

The latest desktop application update improves model controls, dev tools, and RTX GPU performance.

As AI use cases proliferate, developers and hobbyists want faster and more flexible ways to run large language models (LLMs), from document summarisation to custom software agents.

Running models locally on PCs with NVIDIA GeForce RTX GPUs enables high-performance inference, data privacy, and AI deployment and integration management. Free programs like LM Studio let users examine and operate with LLMs on their own hardware.

LM Studio is a popular local LLM inference application. Based on the fast llama.cpp runtime, the application allows models to run offline and be utilised as OpenAI-compatible API endpoints in custom workflows.

LM Studio 0.3.15 uses CUDA 12.8 to boost RTX GPU model load and response times. The upgrade adds developer-focused features like a revised system prompt editor and “tool_choice” tool usage.

The latest LM Studio improvements improve usability and speed, enabling the highest throughput on RTX AI PCs. This leads to faster reactions, snappier interactions, and better local AI development and integration tools.

AI Acceleration Meets Common Apps

LM Studio can be used for light experimentation and significant integration into unique processes due to its versatility. Developer mode permits desktop chat or OpenAI-compatible API calls to models. Local LLMs can be integrated with custom desktop agents or processes in Visual Studio Code.

The popular markdown-based knowledge management tool Obsidian may be integrated with LM Studio. Local LLMs in LM Studio allow users to query their notes, produce content, and summarise research using community-developed plug-ins like Text Generator and Smart Connections. These plug-ins enable fast, private AI interactions without the cloud by connecting to LM Studio's local server.

Developer enhancements in 0.3.15 include an updated system prompt editor for longer or more sophisticated prompts and more accurate tool usage management through the “tool_choice” option.

The tool_choice argument lets developers require a tool call, turn it off, or allow the model decide how to connect with external tools. Adding flexibility to structured interactions, retrieval-augmented generation (RAG) workflows, and agent pipelines is beneficial. Together, these upgrades improve LLM use cases for developers in experimental and production.

LM Studio supports Gemma, Llama 3, Mistral, and Orca open models and quantisation formats from 4-bit to full precision.

Common use cases include RAG, document-based Q&A, multi-turn chat with long context windows, and local agent pipelines. Local inference servers run by the NVIDIA RTX-accelerated llama.cpp software package allow RTX AI PC users to simply integrate local LLMs.

LM Studio gives you full control, speed, and privacy on RTX, whether you're optimising a modest PC for efficiency or a big desktop for throughput.

Maximise RTX GPU Throughput

LM Studio's acceleration relies on the open-source runtime llama.cpp for consumer hardware inference. NVIDIA worked with LM Studio and llama.cpp to increase RTX GPU performance.

Important optimisations include:

CUDA graph enablement reduces CPU overhead and boosts model throughput by 35% by integrating GPU operations into a CPU call.

Flash attention CUDA kernels can boost throughput by 15% by improving LLM attention handling in transformer models. This improvement allows longer context windows without increasing memory or computing power.

Supports the newest RTX architectures: LM Studio's CUDA 12.8 update works with high-end PCs, so clients can deploy local AI processes from laptops. all RTX AI PCs, from GeForce 20 Series to NVIDIA Blackwell-class GPUs.

LM Studio automatically changes to CUDA 12.8 with a compatible driver, improving model load times and performance.

These improvements speed up response times and smooth inference on all RTX AI PCs, from small laptops to large desktops and workstations.

Utilise LM Studio

Linux, macOS, and Windows have free LM Studio. The recent 0.3.15 release and continual optimisations should improve local AI performance, customisation, and usability, making it faster, more versatile, and easier to use.

Developer mode offers an OpenAI-compatible API, and desktop chat allows users import models.

Start immediately by downloading and launching the latest LM Studio.

Click the left magnifying glass to open Discover.

See the CUDA 12 llama.cpp (Windows) runtime in the availability list after selecting Runtime choices on the left side. Click “Download and Install”.

After installation, select CUDA 12 llama.cpp (Windows) from the Default Selections selection to set LM Studio to use this runtime.

To optimise CUDA execution in LM Studio, load a model and click the gear icon to the left of it to open Settings.

Drag the “GPU Offload” slider to the right to offload all model layers to the GPU, then enable “Flash Attention” from the selection menu.

Local NVIDIA GPU inference is possible if these functions are enabled and configured.

LM Studio supports model presets, quantisation formats, and developer options like tool_choice for exact inference. The llama.cpp GitHub project is continually updated and evolving with community and NVIDIA performance enhancements for anyone who wants to contribute.

LM Studio 0.3.15 offers RTX 50-series GPUs and API tool utilisation improvements

A stable version of LM Studio 0.3.15 is available. This release supports NVIDIA RTX 50-series GPUs (CUDA 12) and UI changes include a revamped system prompt editor. Added possibility to log each fragment to API server logs and improved tool use API support (tool_choice parameter).

RTX 50-series GPU CUDA 12 compatibility

With llama.cpp engines, LM Studio supports RTX 50-series GPUs CUDA 12.8 for Linux and Windows. As expected, this improvement speeds up RTX 50-series GPU first-time model load times. LM Studio will update RTX 50-series GPUs to CUDA 12 if NVIDIA drivers are acceptable.

The minimum driver version is:

Windows version 551.61+

Linux: 550.54.14 minimum

LM Studio will immediately update to CUDA 12 if the driver version matches your RTX 50-series GPU. LM Studio uses CUDA 11 even with incompatible RTX 50 GPU drivers. Controlled by Command+Shift+R.

New System Prompt Editor UI

System suggestions change model behaviour well. They range from a few words to several pages. LM Studio 0.3.15 adds a larger visual space for modifying long prompts. The sidebar's little prompt editor works.

Improved Tool Use API Support

The OpenAI-like REST API now supports tool_choice, which helps you configure model tool use. The tool_choice argument has three values:

“tool_choice”: “none” means the model will call no tools.

“tool_choice”: “auto” The model decides whether to invoke tools with the option.

tool_choice: “required” Just output tools (llama.cpp engines)

NVIDIA also fixed LM Studio's OpenAI-compatibility mode bug that prohibited the chunk “finish_reason” from being changed to “tool_calls”.

Preview Community Presets

Presets combine system prompts with model parameters.

Since LM Studio 0.3.15, you can download and share user-made presets online. Additionally, you can like and fork other settings.

Settings > General > Enable publishing and downloading presets activates this option.

Right-clicking a sidebar setting reveals a “Publish” button once activated. Share your preset with the community.

#CUDA128 #LMStudio #CUDA12 #LMStudio0315 #RTX50seriesGPUs #NVIDIAGeForceRTXGPUs #technology #TechNews #technologynews #news #govindhtech

0 notes