#CUDA for ML
Explore tagged Tumblr posts
jcmarchi · 10 months ago
Text
Master CUDA: For Machine Learning Engineers
New Post has been published on https://thedigitalinsider.com/master-cuda-for-machine-learning-engineers/
Master CUDA: For Machine Learning Engineers
CUDA for Machine Learning: Practical Applications
Structure of a CUDA C/C++ application, where the host (CPU) code manages the execution of parallel code on the device (GPU).
Now that we’ve covered the basics, let’s explore how CUDA can be applied to common machine learning tasks.
Matrix Multiplication
Matrix multiplication is a fundamental operation in many machine learning algorithms, particularly in neural networks. CUDA can significantly accelerate this operation. Here’s a simple implementation:
__global__ void matrixMulKernel(float *A, float *B, float *C, int N) int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; if (row < N && col < N) for (int i = 0; i < N; i++) sum += A[row * N + i] * B[i * N + col]; C[row * N + col] = sum; // Host function to set up and launch the kernel void matrixMul(float *A, float *B, float *C, int N) dim3 threadsPerBlock(16, 16); dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y); matrixMulKernelnumBlocks, threadsPerBlock(A, B, C, N);
This implementation divides the output matrix into blocks, with each thread computing one element of the result. While this basic version is already faster than a CPU implementation for large matrices, there’s room for optimization using shared memory and other techniques.
Convolution Operations
Convolutional Neural Networks (CNNs) rely heavily on convolution operations. CUDA can dramatically speed up these computations. Here’s a simplified 2D convolution kernel:
__global__ void convolution2DKernel(float *input, float *kernel, float *output, int inputWidth, int inputHeight, int kernelWidth, int kernelHeight) int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < inputWidth && y < inputHeight) float sum = 0.0f; for (int ky = 0; ky < kernelHeight; ky++) for (int kx = 0; kx < kernelWidth; kx++) int inputX = x + kx - kernelWidth / 2; int inputY = y + ky - kernelHeight / 2; if (inputX >= 0 && inputX < inputWidth && inputY >= 0 && inputY < inputHeight) sum += input[inputY * inputWidth + inputX] * kernel[ky * kernelWidth + kx]; output[y * inputWidth + x] = sum;
This kernel performs a 2D convolution, with each thread computing one output pixel. In practice, more sophisticated implementations would use shared memory to reduce global memory accesses and optimize for various kernel sizes.
Stochastic Gradient Descent (SGD)
SGD is a cornerstone optimization algorithm in machine learning. CUDA can parallelize the computation of gradients across multiple data points. Here’s a simplified example for linear regression:
__global__ void sgdKernel(float *X, float *y, float *weights, float learningRate, int n, int d) int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) float prediction = 0.0f; for (int j = 0; j < d; j++) prediction += X[i * d + j] * weights[j]; float error = prediction - y[i]; for (int j = 0; j < d; j++) atomicAdd(&weights[j], -learningRate * error * X[i * d + j]); void sgd(float *X, float *y, float *weights, float learningRate, int n, int d, int iterations) int threadsPerBlock = 256; int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock; for (int iter = 0; iter < iterations; iter++) sgdKernel<<<numBlocks, threadsPerBlock>>>(X, y, weights, learningRate, n, d);
This implementation updates the weights in parallel for each data point. The atomicAdd function is used to handle concurrent updates to the weights safely.
Optimizing CUDA for Machine Learning
While the above examples demonstrate the basics of using CUDA for machine learning tasks, there are several optimization techniques that can further enhance performance:
Coalesced Memory Access
GPUs achieve peak performance when threads in a warp access contiguous memory locations. Ensure your data structures and access patterns promote coalesced memory access.
Shared Memory Usage
Shared memory is much faster than global memory. Use it to cache frequently accessed data within a thread block.
Understanding the memory hierarchy with CUDA
This diagram illustrates the architecture of a multi-processor system with shared memory. Each processor has its own cache, allowing for fast access to frequently used data. The processors communicate via a shared bus, which connects them to a larger shared memory space.
For example, in matrix multiplication:
__global__ void matrixMulSharedKernel(float *A, float *B, float *C, int N) __shared__ float sharedA[TILE_SIZE][TILE_SIZE]; __shared__ float sharedB[TILE_SIZE][TILE_SIZE]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int row = by * TILE_SIZE + ty; int col = bx * TILE_SIZE + tx; float sum = 0.0f; for (int tile = 0; tile < (N + TILE_SIZE - 1) / TILE_SIZE; tile++) if (row < N && tile * TILE_SIZE + tx < N) sharedA[ty][tx] = A[row * N + tile * TILE_SIZE + tx]; else sharedA[ty][tx] = 0.0f; if (col < N && tile * TILE_SIZE + ty < N) sharedB[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col]; else sharedB[ty][tx] = 0.0f; __syncthreads(); for (int k = 0; k < TILE_SIZE; k++) sum += sharedA[ty][k] * sharedB[k][tx]; __syncthreads(); if (row < N && col < N) C[row * N + col] = sum;
This optimized version uses shared memory to reduce global memory accesses, significantly improving performance for large matrices.
Asynchronous Operations
CUDA supports asynchronous operations, allowing you to overlap computation with data transfer. This is particularly useful in machine learning pipelines where you can prepare the next batch of data while the current batch is being processed.
cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // Asynchronous memory transfers and kernel launches cudaMemcpyAsync(d_data1, h_data1, size, cudaMemcpyHostToDevice, stream1); myKernel<<<grid, block, 0, stream1>>>(d_data1, ...); cudaMemcpyAsync(d_data2, h_data2, size, cudaMemcpyHostToDevice, stream2); myKernel<<<grid, block, 0, stream2>>>(d_data2, ...); cudaStreamSynchronize(stream1); cudaStreamSynchronize(stream2);
Tensor Cores
For machine learning workloads, NVIDIA’s Tensor Cores (available in newer GPU architectures) can provide significant speedups for matrix multiply and convolution operations. Libraries like cuDNN and cuBLAS automatically leverage Tensor Cores when available.
Challenges and Considerations
While CUDA offers tremendous benefits for machine learning, it’s important to be aware of potential challenges:
Memory Management: GPU memory is limited compared to system memory. Efficient memory management is crucial, especially when working with large datasets or models.
Data Transfer Overhead: Transferring data between CPU and GPU can be a bottleneck. Minimize transfers and use asynchronous operations when possible.
Precision: GPUs traditionally excel at single-precision (FP32) computations. While support for double-precision (FP64) has improved, it’s often slower. Many machine learning tasks can work well with lower precision (e.g., FP16), which modern GPUs handle very efficiently.
Code Complexity: Writing efficient CUDA code can be more complex than CPU code. Leveraging libraries like cuDNN, cuBLAS, and frameworks like TensorFlow or PyTorch can help abstract away some of this complexity.
As machine learning models grow in size and complexity, a single GPU may no longer be sufficient to handle the workload. CUDA makes it possible to scale your application across multiple GPUs, either within a single node or across a cluster.
CUDA Programming Structure
To effectively utilize CUDA, it’s essential to understand its programming structure, which involves writing kernels (functions that run on the GPU) and managing memory between the host (CPU) and device (GPU).
Host vs. Device Memory
In CUDA, memory is managed separately for the host and device. The following are the primary functions used for memory management:
cudaMalloc: Allocates memory on the device.
cudaMemcpy: Copies data between host and device.
cudaFree: Frees memory on the device.
Example: Summing Two Arrays
Let’s look at an example that sums two arrays using CUDA:
__global__ void sumArraysOnGPU(float *A, float *B, float *C, int N) int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) C[idx] = A[idx] + B[idx]; int main() int N = 1024; size_t bytes = N * sizeof(float); float *h_A, *h_B, *h_C; h_A = (float*)malloc(bytes); h_B = (float*)malloc(bytes); h_C = (float*)malloc(bytes); float *d_A, *d_B, *d_C; cudaMalloc(&d_A, bytes); cudaMalloc(&d_B, bytes); cudaMalloc(&d_C, bytes); cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice); int blockSize = 256; int gridSize = (N + blockSize - 1) / blockSize; sumArraysOnGPU<<<gridSize, blockSize>>>(d_A, d_B, d_C, N); cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); return 0;
In this example, memory is allocated on both the host and device, data is transferred to the device, and the kernel is launched to perform the computation.
Conclusion
CUDA is a powerful tool for machine learning engineers looking to accelerate their models and handle larger datasets. By understanding the CUDA memory model, optimizing memory access, and leveraging multiple GPUs, you can significantly enhance the performance of your machine learning applications.
0 notes
3acesnews · 2 days ago
Photo
Tumblr media
NVIDIA Advances ML in Manufacturing with CUDA-X Data Science
0 notes
vndta-vps · 8 days ago
Text
Dịch vụ VPS GPU – Giải pháp tối ưu cho AI, Render và Đào Coin
Trong thời đại công nghệ số phát triển mạnh mẽ như hiện nay, nhu cầu sử dụng máy chủ ảo (VPS) ngày càng gia tăng, đặc biệt là trong các lĩnh vực đòi hỏi hiệu suất xử lý cao như trí tuệ nhân tạo (AI), học máy (Machine Learning), đồ họa 3D, render video và khai thác tiền điện tử (crypto mining). Chính vì vậy, dịch vụ VPS GPU trở thành giải pháp tối ưu, đáp ứng hoàn hảo các yêu cầu về sức mạnh tính toán và khả năng xử lý đồ họa vượt trội. Bài viết sau sẽ giúp bạn hiểu rõ hơn về VPS GPU, lợi ích và cách lựa chọn nhà cung cấp uy tín.
VPS GPU là gì?
VPS (Virtual Private Server) là máy chủ ảo được phân chia từ một máy chủ vật lý thông qua công nghệ ảo hóa. Khác với VPS thông thường, VPS GPU được trang bị thêm card đồ họa rời (GPU – Graphics Processing Unit), mang lại khả năng xử lý đồ họa và tính toán song song vượt trội. GPU không chỉ hữu ích trong việc hiển thị hình ảnh mà còn đóng vai trò quan trọng trong các tác vụ như huấn luyện mô hình AI, deep learning, render video chất lượng cao và đào tiền ảo.
Lợi ích nổi bật của dịch vụ VPS GPU
Hiệu suất tính toán vượt trội: Dịch vụ VPS GPU cung cấp sức mạnh xử lý vượt xa so với VPS truyền thống nhờ GPU mạnh mẽ như NVIDIA Tesla, RTX A5000, A100 hoặc các dòng RTX 30xx. Điều này giúp bạn xử lý dữ liệu nhanh chóng, tiết kiệm thời gian và chi phí.
Tối ưu cho AI và Machine Learning: GPU giúp tăng tốc quá trình huấn luyện và thử nghiệm các mô hình học máy. Nhờ đó, thời gian xây dựng sản phẩm AI được rút ngắn đáng kể, đồng thời giảm tải cho CPU trong các tác vụ phức tạp.
Hỗ trợ render và xử lý đồ họa chuyên nghiệp: Các phần mềm đồ họa như Blender, Maya, Adobe After Effects hay Cinema 4D đều tận dụng GPU để xử lý render nhanh và mượt mà hơn. Dịch vụ VPS GPU là lựa chọn hoàn hảo cho các studio thiết kế, nhà làm phim hoặc kỹ sư dựng hình 3D.
Phù hợp với khai thác tiền điện tử: GPU là phần cứng quan trọng trong hoạt động mining các loại tiền ảo như Ethereum (ETH), Ravencoin (RVN) hay Ergo (ERG). Việc thuê VPS GPU giúp bạn khai thác tiền ảo ổn định mà không phải đầu tư chi phí phần cứng ban đầu.
Đối tượng nên sử dụng dịch vụ VPS GPU
Các lập trình viên và nhà khoa học dữ liệu chuyên về AI/ML.
Studio sản xuất phim, kỹ sư đồ họa 3D và kỹ thuật viên dựng hình.
Game developer cần môi trường test đồ họa mạnh.
Nhà đầu tư tiền điện tử cần một hệ thống đào coin ổn định và linh hoạt.
Doanh nghiệp triển khai các hệ thống nhận diện khuôn mặt, phân tích hình ảnh…
Tiêu chí lựa chọn dịch vụ VPS GPU chất lượng
Cấu hình GPU mạnh mẽ: Ưu tiên các VPS sử dụng GPU từ NVIDIA với dung lượng VRAM lớn, hỗ trợ CUDA và Tensor cores để tối ưu cho AI và render.
Hạ tầng ổn định: Chọn nhà cung cấp có trung tâm dữ liệu đặt tại Việt Nam hoặc quốc tế với kết nối mạng ổn định, uptime từ 99.9%.
Hỗ trợ kỹ thuật 24/7: Dịch vụ VPS GPU đòi hỏi kiến thức kỹ thuật. Do đó, nhà cung cấp cần có đội ngũ hỗ trợ chuyên môn cao, phản hồi nhanh chóng.
Giá cả hợp lý, linh hoạt: So sánh nhiều gói VPS GPU để chọn được gói phù hợp nhu cầu và ngân sách. Một số nhà cung cấp còn hỗ trợ thanh toán theo giờ, ngày hoặc tháng.
Các ứng dụng thực tế của VPS GPU
Huấn luyện AI: Dùng TensorFlow, PyTorch, Keras với GPU giúp tăng tốc huấn luyện mô hình nhanh gấp nhiều lần so với CPU.
Render video, mô hình 3D: Dựng và xuất video 4K, 8K nhanh chóng.
Phát triển ứng dụng AR/VR: Tối ưu hóa môi trường test cho ứng dụng thực tế ảo.
Data analytics: Xử lý khối lượng lớn dữ liệu với GPU hỗ trợ tăng tốc phân tích.
Chạy game server chuyên sâu: Tạo môi trường chơi game AAA với đồ họa cao cấp.
Kết luận
Dịch vụ VPS GPU là lựa chọn lý tưởng cho cá nhân, doanh nghiệp và tổ chức đang tìm kiếm một giải pháp máy chủ mạnh mẽ phục vụ các tác vụ yêu cầu cao về đồ họa và tính toán. Dù bạn là lập trình viên AI, chuyên viên thiết kế hay nhà đầu tư crypto, VPS GPU sẽ giúp bạn tăng năng suất làm việc, tiết kiệm thời gian và chi phí hiệu quả. Hãy cân nhắc kỹ các tiêu chí về cấu hình, giá cả và dịch vụ hỗ trợ trước khi chọn nhà cung cấp phù hợp nhất.
Tìm hiểu thêm: https://vndata.vn/vps-gpu/
0 notes
govindhtech · 22 days ago
Text
Dell NERSC 10 Supercomputer With NVIDIA Vera Rubin & DOE
Tumblr media
Our Nobel Prize-Winning Supercomputer Accelerates Science
Dell Technologies will build NERSC 10, the next flagship supercomputer of the National Energy Research Scientific Computing Centre (NERSC), under a new DOE contract. DOE user site Lawrence Berkeley National Laboratory houses NERSC. Chris Wright, Secretary of Energy, made the announcement at Berkeley Lab.
The DOE Office of Science user facility Nation Energy Research Scientific Computer Centre (NERSC) houses the 10th-generation computer system NERSC 10. HPC facilities at NERSC enable physics, climate, energy, biology, and materials science research.
Overview of NERSC 10
In 2024, NERSC 10 will replace Perlmutter, its flagship supercomputer.
The implementation of NERSC 10 is expected in 2026 or later.
Developed for DOE Office of Science's growing data and computational demands until 2030.
NERSC 10 returns to LBNL in California.
Main features
Main goals of NERSC 10 Extreme Performance:
Near-exascale or exascale computer power.
Support for increasingly complex workloads.
Balanced Architecture:
High memory bandwidth and capacity.
I/O for rapid storage.
Complex interconnects for low-latency communication.
Integration of HPC and AI:
We support AI and ML workloads in addition to simulations.
built for hybrid workflows.
Energy Efficiency:
Boosting FLOP power efficiency.
Green technology like liquid cooling may be investigated.
User-centred design:
Maintaining the NERSC user experience and software stack.
A focus on usability and productivity for many scientists.
Purchase and Development
DOE normally issues an RFP years before system delivery.
Community Engagement: NERSC solicits scientific user community comments throughout system design to ensure practical research needs are met.
Strategic Importance
Supports hundreds of research projects and over 9,000 users throughout DOE mission areas.
Leadership Role: Unlike experimental exascale systems, NERSC systems are easy to use by many scientists.
The 2020 Nobel Prize for Chemistry winner for developing CRISPR, Berkeley Lab biologist Jennifer Doudna, will name the new system “Doudna” in 2026. Secretary Wright was surprised and pleased by Doudna's nomination, praising his biological results and the potential for computational powers to speed illness and tumour cures.
The Dell Technologies Doudna supercomputer will run on NVIDIA's next-generation Vera Rubin platform. Large-scale HPC workloads including high-energy physics, molecular dynamics, and AI training and inference are its focus. Simulation, data, and AI on one platform stabilise cutting-edge science workflows and expedite findings.
The system “represents DOE’s commitment to advancing American leadership in science, AI, and high-performance computing,” Secretary Wright said. He called Doudna a “powerhouse for rapid innovation” that will revolutionise quantum computing and supply cheap, abundant energy. Wright called AI “the Manhattan Project of the time,” emphasising that Doudna will assist American scientists compete globally in AI.
Doudna supercomputer would speed up numerous scientific operations, said NERSC Director Sudip Dosanjh. NERSC is collaborating with NVIDIA and Dell to prepare its 11,000 users for the system's improved workflow. Doudna will be connected to DOE observational and experimental facilities via the Energy Sciences Network (ESnet), allowing scientists to evaluate and stream data in real time. Because of this integration, the supercomputer is no longer a passive workflow player.
Doudna may boost innovation in several areas. Doudna can accelerate the finding of plentiful, usable energy because NERSC finances fusion research. Its strong GPUs will let DOE-funded researchers swiftly integrate large-scale AI into their workflows, speeding up basic physics, biomolecular modelling, and advanced materials design research. The system will support modern quantum simulation tools including NVIDIA's CUDA-Q platform for co-designing next integrated quantum-HPC systems and scalable quantum algorithms.
The Vera-Rubin CPU-GPU platform and Dell's latest ORv3 direct liquid-cooled server technologies are used, according to NVIDIA. It will use Dell Integrated Rack Scalable Systems and PowerEdge servers with NVIDIA accelerators, NVIDIA Quantum-X800 InfiniBand networking, and high-performance data management and storage.
Doudna is expected to exceed NERSC's flagship supercomputer, Perlmutter, by over ten times in scientific output. Two to three times the power of Perlmutter is expected, boosting performance by three to five times per watt. The goal is to substantially reduce the time needed for major scientific breakthroughs.
We’re not just developing a quicker computer', said Nick Wright, NERSC’s Doudna principal architect and advanced technologies group lead. We're creating a framework to help researchers think broadly and discover faster. He added that the system is designed to quickly address global concerns and encourage study in physics, chemistry, and other unimagined fields.
0 notes
digitalmore · 1 month ago
Text
0 notes
sharon-ai · 1 month ago
Text
Supercharge Your AI Workflows with the NVIDIA A40 – Powered by Sharon AI
In the age of artificial intelligence and high-performance computing, organizations demand powerful, flexible, and scalable GPU solutions. The NVIDIA A40 stands out as one of the most versatile graphics and compute accelerators on the market today. At Sharon AI, this next-generation GPU is at the core of enabling businesses to push the boundaries of innovation in machine learning, data science, deep learning, and visualization.
What Makes the NVIDIA A40 Unique?
The NVIDIA A40 is built on the Ampere architecture, delivering incredible performance across various workloads. Designed for professionals, researchers, and developers, the NVIDIA A40 offers unmatched versatility, allowing it to serve roles in data centers, AI development environments, and 3D rendering studios alike.
Equipped with 48 GB of GDDR6 memory, the NVIDIA A40 easily handles massive datasets and intricate models, making it ideal for complex deep learning tasks. Whether you're training neural networks or running inference workloads, this GPU can handle it all with efficiency and precision.
Enterprise-Grade Performance with Ampere Architecture
The Ampere architecture powering the NVIDIA A40 includes 10,752 CUDA cores, making it a compute-intensive powerhouse. With third-generation Tensor Cores and second-generation RT Cores, it accelerates AI training and inference while enabling advanced ray tracing capabilities for high-end rendering.
Professionals in industries such as architecture, medicine, and automotive design benefit from its real-time photorealistic rendering. For AI practitioners, the NVIDIA A40 supports both FP32 and TensorFloat-32 (TF32) formats, significantly increasing throughput in training and inference tasks.
A Perfect Fit for Sharon AI’s Vision
At Sharon AI, the integration of the NVIDIA A40 into its compute infrastructure offers customers access to cutting-edge hardware that is fully optimized for today's demanding AI and ML workflows. By incorporating this GPU, Sharon AI helps enterprises reduce training time, boost throughput, and deliver faster insights.
Sharon AI’s platform is designed for developers and data scientists seeking seamless scalability and enterprise-grade reliability. The NVIDIA A40 supports this mission by delivering the horsepower required for the most computationally heavy AI applications.
Scalable, Secure, and Future-Ready
The NVIDIA A40 is not only powerful—it’s also built with data center efficiency in mind. Its support for PCIe Gen 4 provides double the bandwidth of its predecessor, ensuring faster communication between the GPU and other system components. This results in lower latency and higher overall system performance.
The flexibility of the NVIDIA A40 allows it to be deployed in virtualized environments, supporting NVIDIA’s vGPU software for enterprises looking to deliver GPU-accelerated workloads in multi-user environments. Whether used in a dedicated workstation or virtualized across a fleet of systems, the A40 provides the performance and security that professionals require.
Transforming AI Development at Scale
AI models are growing in size and complexity. The NVIDIA A40 is designed for this new era of AI development, where large language models, generative AI, and transformer networks dominate. With Sharon AI’s cloud infrastructure supporting the A40, businesses can train these advanced models faster and more efficiently, without the need to invest in expensive on-premise hardware.
The fusion of Sharon AI’s advanced platform and the power of the NVIDIA A40 offers a compelling solution for anyone looking to modernize their AI and HPC workflows. From startups to enterprises, the A40 enables faster results, smarter solutions, and a future-proof approach to innovation.
Final Thoughts
As businesses continue to adopt AI at scale, the need for advanced GPU solutions grows. The NVIDIA A40 delivers the performance, memory, and scalability required for the most demanding workloads—and with Sharon AI integrating it into their infrastructure, users get the best of both worlds: powerful hardware and a platform built for AI success.
0 notes
lakshmiglobal · 3 months ago
Text
Best Workstations for Machine Learning & AI
Machine learning (ML) and AI workloads demand high-performance workstations with powerful GPUs, ample memory, and fast storage. Whether you're training deep learning models or running AI inference tasks, the right workstation setup can significantly impact performance.
Key Components of an AI/ML Workstation
Processor (CPU) Recommended: AMD Threadripper, Intel Xeon, or Intel Core i9
High core count for parallel processing Supports multi-threaded AI applications Best balance between clock speed and core count
Graphics Processing Unit (GPU) Recommended: NVIDIA RTX 4090, NVIDIA A100, NVIDIA H100
Essential for deep learning and AI workloads CUDA and Tensor Core support for accelerated computing Multi-GPU support for large datasets
Memory (RAM) Recommended: 64GB to 256GB DDR5 ECC RAM
AI training models require significant memory ECC (Error-Correcting Code) RAM for stability Higher capacity prevents bottlenecks
Storage Recommended: NVMe SSD (2TB–8TB) + HDD for backup
Fast NVMe SSD for model training and data access Additional HDD storage for large datasets
Cooling System Recommended: Liquid cooling or high-performance air cooling
Prevents thermal throttling during intensive computations Best AI Workstation Configurations (2025) Budget-Friendly Option (~$3,000) CPU: AMD Ryzen 9 7950X GPU: NVIDIA RTX 4090 RAM: 64GB DDR5 Storage: 2TB NVMe SSD + 4TB HDD OS: Ubuntu Linux / Windows 11
High-Performance Workstation (~$8,000) CPU: AMD Threadripper PRO 7995WX (96 Cores) GPU: 2× NVIDIA RTX 4090 RAM: 128GB DDR5 ECC Storage: 4TB NVMe SSD + 8TB HDD OS: Linux / Windows 11 Pro
Enterprise-Grade AI Server (~$20,000+) CPU: Dual Intel Xeon Platinum 8490H GPU: 4× NVIDIA H100 Tensor Core RAM: 256GB DDR5 ECC Storage: 8TB NVMe SSD RAID + 20TB HDD OS: Ubuntu Server
Prebuilt AI Workstations (2025) Dell Precision 7960 Tower – High-end workstation for deep learning HP Z8 Fury G5 – Dual CPU support for AI applications Lambda Hyperplane – Designed specifically for AI & ML researchers
Final Thoughts Choosing the right AI workstation depends on your budget, workload size, and future scalability. If you're training large deep-learning models, investing in multiple GPUs and high-speed storage is crucial. However, for smaller projects or prototyping, a single high-end GPU can be sufficient.
Tumblr media
0 notes
gpuservices · 4 months ago
Text
0 notes
craigbrownphd · 1 year ago
Text
Profiling CUDA using Nsight Systems: A Numba Example
#AI #ML #Tech https://towardsdatascience.com/profiling-cuda-using-nsight-systems-a-numba-example-fc65003f8c52?utm_source=dlvr.it&utm_medium=tumblr
0 notes
exeton · 1 year ago
Text
Exeton Launches Vector One, A New Single-GPU Desktop PC
Tumblr media
The Exeton Vector One is now available for order. The new single-GPU desktop PC is built to tackle demanding AI/ML tasks, from fine-tuning Stable Diffusion to handling the complexities of Llama 2 7B. Exeton customers can now benefit from a more compact, quieter desktop PC at a price point of less than $5,500.
Vector One Specs
GPU: 1x NVIDIA GeForce RTX 4090, 24 GB, liquid-cooled
PROCESSOR: AMD Ryzen™ 9 7950X 16-core, 32-thread
SYSTEM RAM: 64 GB or 128 GB DDR5
STORAGE: OS — Up to 3.84 TB M.2 (NVMe) | Data — Up to 3 x 3.84 TB M.2 (NVMe)
NETWORK INTERFACE: 10Gb Ethernet
Key benefits of the Vector One
The Vector One offers Exeton customers a powerful deep learning solution to train neural networks right from their desktops.
Sleek Power that doesn’t Disturb
The Vector One has been meticulously designed with liquid cooling for both the CPU and GPU, ensuring optimal performance without the noise. Even under typical high workloads, it only emits a mere 39 dB SPL of sound, making it perfect for maintaining a quiet workspace.
Next-gen Graphics for Advanced AI/ML Tasks
Equipped with the cutting-edge NVIDIA GeForce RTX 4090 graphics card boasting 24 GB of VRAM, the Vector One stands ready to tackle demanding tasks. From fine-tuning Stable Diffusion to handling the complexities of Llama 2 7B, this machine ensures that high-intensity computations are a breeze.
Experience the Power of future-ready Architecture
At the heart of Vector One lies the state-of-the-art AMD Ryzen 9 7950X CPU, hosted on the advanced X670E chipset. This powerhouse supports both PCIe Gen 5 and DDR5 and offers up to twice the memory bandwidth of its predecessors. Dive into the future of computing with unrivaled speed and efficiency.
Delivering the Optimal Experience for AI/ML
Through rigorous research and experience, our engineers have crafted the ultimate system configuration tailored for AI/ML tasks. No more guesswork or configurations needed: the Vector One is fine-tuned to deliver unparalleled performance right out of the box. Additionally, every Vector One comes with a one-year warranty on hardware, with an option to extend to three years. For added peace of mind, choose to include dedicated technical support for Ubuntu and all ML frameworks and drivers that come pre-installed with your machine.
Pre-installed with the Software you Need
Tumblr media
How to get started with Vector One
The Vector One is now available to purchase. Equipped with a single NVIDIA GeForce RTX 4090 graphics card boasting 24 GB of VRAM and pre-installed with Ubuntu, TensorFlow, PyTorch®, NVIDIA CUDA, and NVIDIA cuDNN, the Vector One is the optimal single-GPU desktop PC for deep learning. At less than $5,500, the desktop solution meets tighter budget requirements without sacrificing performance.
Muhammad Hussnain Facebook | Instagram | Twitter | Linkedin | Youtube
0 notes
jcmarchi · 1 year ago
Text
Mamba Explained
New Post has been published on https://thedigitalinsider.com/mamba-explained/
Mamba Explained
The State Space Model taking on Transformers
Right now, AI is eating the world.
And by AI, I mean Transformers. Practically all the big breakthroughs in AI over the last few years are due to Transformers.
Mamba, however, is one of an alternative class of models called State Space Models (SSMs). Importantly, for the first time, Mamba promises similar performance (and crucially similar scaling laws) as the Transformer whilst being feasible at long sequence lengths (say 1 million tokens). To achieve this long context, the Mamba authors remove the “quadratic bottleneck” in the Attention Mechanism. Mamba also runs fast – like “up to 5x faster than Transformer fast”1.
Mamba performs similarly (or slightly better than) other Language Models on The Pile (source)
Gu and Dao, the Mamba authors write:
Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Here we’ll discuss:
The advantages (and disadvantages) of Mamba (🐍) vs Transformers (🤖),
Analogies and intuitions for thinking about Mamba, and
What Mamba means for Interpretability, AI Safety and Applications.
Problems with Transformers – Maybe Attention Isn’t All You Need
We’re very much in the Transformer-era of history. ML used to be about detecting cats and dogs. Now, with Transformers, we’re generating human-like poetry, coding better than the median competitive programmer, and solving the protein folding problem.
But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions. For this lookback, we cache detailed information about each token in the so-called KV cache.
When using the Attention Mechanism, information from all previous tokens can be passed to the current token
This pairwise communication means a forward pass is O(n²) time complexity in training (the dreaded quadratic bottleneck), and each new token generated autoregressively takes O(n) time. In other words, as the context size increases, the model gets slower.
To add insult to injury, storing this key-value (KV) cache requires O(n) space.  Consequently, the dreaded CUDA out-of-memory (OOM) error becomes a significant threat as the memory footprint expands. If space were the only concern, we might consider adding more GPUs; however, with latency increasing quadratically, simply adding more compute might not be a viable solution.
On the margin, we can mitigate the quadratic bottleneck with techniques like Sliding Window Attention or clever CUDA optimisations like FlashAttention. But ultimately, for super long context windows (like a chatbot which remembers every conversation you’ve shared), we need a different approach.
Foundation Model Backbones
Fundamentally, all good ML architecture backbones have components for two important operations:
Communication between tokens
Computation within a token
The Transformer Block
In transformers, this is Attention (communication) and MLPs (computation). We improve transformers by optimising these two operations2.
We would like to substitute the Attention component3 with an alternative mechanism for facilitating inter-token communication. Specifically, Mamba employs a Control Theory-inspired State Space Model, or SSM, for Communication purposes while retaining Multilayer Perceptron (MLP)-style projections for Computation.
The Mamba Block
Like a Transformer made up of stacked transformer blocks, Mamba is made up of stacked Mamba blocks as above.
We would like to understand and motivate the choice of the SSM for sequence transformations.
Motivating Mamba – A Throwback to Temple Run
Imagine we’re building a Temple Run agent4. It chooses if the runner should move left or right at any time.
To successfully pick the correct direction, we need information about our surroundings. Let’s call the collection of relevant information the state. Here the state likely includes your current position and velocity, the position of the nearest obstacle, weather conditions, etc.
Claim 1: if you know the current state of the world and how the world is evolving, then you can use this to determine the direction to move.
Note that you don’t need to look at the whole screen all the time. You can figure out what will happen to most of the screen by noting that as you run, the obstacles move down the screen. You only need to look at the top of the screen to understand the new information and then simulate the rest.
This lends itself to a natural formulation. Let h be the hidden state, relevant knowledge about the world. Also let x be the input, the observation that you get each time. h’ then represents the derivative of the hidden state, i.e. how the state is evolving. We’re trying to predict y, the optimal next move (right or left).
Now, Claim 1 states that from the hidden state h, h’, and the new observation x, you can figure out y.
More concretely, h, the state, can be represented as a differential equation (Eq 1a):
$h’(t) = mathbfAh(t) + mathbfBx(t)$
Knowing h allows you to determine your next move y (Eq 1b):
$y(t) = mathbfCh(t) + mathbfDx(t)$
The system’s evolution is determined by its current state and newly acquired observations. A small new observation is enough, as the majority of the state can be inferred by applying known state dynamics to its previous state. That is, most of the screen isn’t new, it’s just a continuation of the previous state’s natural downward trajectory. A full understanding of the state would enable optimal selection of the subsequent action, denoted as y.
You can learn a lot about the system dynamics by observing the top of the screen. For instance, increased velocity of this upper section suggests an acceleration of the rest of the screen as well, so we can infer that the game is speeding up5. In this way, even if we start off knowing nothing about the game and only have limited observations, it becomes possible to gain a holistic understanding of the screen dynamics fairly rapidly.
What’s the State?
Here, state refers to the variables that, when combined with the input variables, fully determine the future system behaviour. In theory, once we have the state, there’s nothing else we need to know about the past to predict the future. With this choice of state, the system is converted to a Markov Decision Process. Ideally, the state is a fairly small amount of information which captures the essential properties of the system. That is, the state is a compression of the past6.
Discretisation – How To Deal With Living in a Quantised World
Okay, great! So, given some state and input observation, we have an autoregressive-style system to determine the next action. Amazing!
In practice though, there’s a little snag here. We’re modelling time as continuous. But in real life, we get new inputs and take new actions at discrete time steps7.
We would like to convert this continuous-time differential equation into a discrete-time difference equation. This conversion process is known as discretisation. Discretisation is a well-studied problem in the literature. Mamba uses the Zero-Order Hold (ZOH) discretisation8. To give an idea of what’s happening morally, consider a naive first-order approximation9.
From Equation 1a, we have
$h’(t) = mathbfAh(t) + mathbfBx(t)$
And for small ∆,
$h’(t) approx frach(t+Delta) – h(t)Delta$
by the definition of the derivative.
We let:
$h_t = h(t)$
and
$h_t+1 = h(t + Delta)$
and substitute into Equation 1a giving:
$h_t+1 – h_t approx Delta (mathbfAh_t + mathbfBx_t)$ $Rightarrow h_t+1 approx (I + Delta mathbfA)h_t + (Delta mathbfB)x_t$
Hence, after renaming the coefficients and relabelling indices, we have the discrete representations:
The Discretised Version of the SSM Equation
If you’ve ever looked at an RNN before10 and this feels familiar – trust your instincts:
We have some input x, which is combined with the previous hidden state by some transform to give the new hidden state. Then we use the hidden state to calculate the output at each time step.
Understanding the SSM Matrices
Now, we can interpret the A, B, C, D matrices more intuitively:
A is the transition state matrix. It shows how you transition the current state into the next state. It asks “How should I forget the less relevant parts of the state over time?”
B is mapping the new input into the state, asking “What part of my new input should I remember?”11
C is mapping the state to the output of the SSM. It asks, “How can I use the state to make a good next prediction?”12
D is how the new input passes through to the output. It’s a kind of modified skip connection that asks “How can I use the new input in my prediction?”
Visual Representation of The SSM Equations
Additionally, ∆ has a nice interpretation – it’s the step size, or what we might call the linger time or the dwell time. For large ∆, you focus more on that token; for small ∆, you skip past the token immediately and don’t include it much in the next state.
(source)
And that’s it! That’s the SSM, our ~drop-in replacement for Attention (Communication) in the Mamba block. The Computation in the Mamba architecture comes from regular linear projections, non-linearities, and local convolutions.
Okay great, that’s the theory – but does this work? Well…
Effectiveness vs Efficiency: Attention is Focus, Selectivity is Prioritisation
At WWDC ‘97, Steve Jobs famously noted that “focusing is about saying no”. Focus is ruthless prioritisation. It’s common to think about Attention positively as choosing what to notice. In the Steve Jobs sense, we might instead frame Attention negatively as choosing what to discard.
There’s a classic intuition pump in Machine Learning known as the Cocktail Party Problem13. Imagine a party with dozens of simultaneous loud conversations:
Question:
How do we recognise what one person is saying when others are talking at the same time?14
Answer:
The brain solves this problem by focusing your “attention” on a particular stimulus and hence drowning out all other sounds as much as possible.
Transformers use Dot-Product Attention to focus on the most relevant tokens. A big reason Attention is so great is that you have the potential to look back at everything that ever happened in its context. This is like photographic memory when done right.15
Transformers (🤖) are extremely effective. But they aren’t very efficient. They store everything from the past so that they can look back at tokens with theoretically perfect recall.
Traditional RNNs (🔁) are the opposite – they forget a lot, only recalling a small amount in their hidden state and discarding the rest. They are very efficient – their state is small. Yet they are less effective as discarded information cannot be recovered.
We’d like something closer to the Pareto frontier of the effectiveness/efficiency tradeoff. Something that’s more effective than traditional RNNs and more efficient than transformers.
The Mamba Architecture seems to offer a solution which pushes out the Pareto frontier of effectiveness/efficiency.
SSMs are as efficient as RNNs, but we might wonder how effective they are. After all, it seems like they would have a hard time discarding only unnecessary information and keeping everything relevant. If each token is being processed the same way, applying the same A and B matrices as if in a factory assembly line for tokens, there is no context-dependence. We would like the forgetting and remembering matrices (A and B respectively) to vary and dynamically adapt to inputs.
The Selection Mechanism
Selectivity allows each token to be transformed into the state in a way that is unique to its own needs. Selectivity is what takes us from vanilla SSM models (applying the same A (forgetting) and B (remembering) matrices to every input) to Mamba, the Selective State Space Model.
In regular SSMs, A, B, C and D are learned matrices – that is
$mathbfA = mathbfA_theta$ etc. (where θ represents the learned parameters)
With the Selection Mechanism in Mamba, A, B, C and D are also functions of x. That is $mathbfA = mathbfA_theta(x)$ etc; the matrices are context dependent rather than static.
Mamba (right) differs from traditional SSMs by allowing A,B,C matrices to be selective i.e. context dependent (source)
Making A and B functions of x allows us to get the best of both worlds:
We’re selective about what we include in the state, which improves effectiveness vs traditional SSMs.
Yet, since the state size is bounded, we improve on efficiency relative to the Transformer. We have O(1), not O(n) space and O(n) not O(n²) time requirements.
The Mamba paper authors write:
The efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension.
Humans (mostly) don’t have photographic memory for everything they experience within a lifetime – or even within a day! There’s just way too much information to retain it all. Subconsciously, we select what to remember by choosing to forget, throwing away most information as we encounter it. Transformers (🤖) decide what to focus on at recall time. Humans (🧑) also decide what to throw away at memory-making time. Humans filter out information early and often.
If we had infinite capacity for memorisation, it’s clear the transformer approach is better than the human approach – it truly is more effective. But it’s less efficient – transformers have to store so much information about the past that might not be relevant. Transformers (🤖) only decide what’s relevant at recall time. The innovation of Mamba (🐍) is allowing the model better ways of forgetting earlier – it’s focusing by choosing what to discard using Selectivity, throwing away less relevant information at memory-making time16.
The Problems of Selectivity
Applying the Selection Mechanism does have its gotchas though. Non-selective SSMs (i.e. A,B not dependent on x) are fast to compute in training. This is because the component of
Yt which depends on xi can be expressed as a linear map, i.e. a single matrix that can be precomputed!
For example (ignoring the D component, the skip connection):
$$y_2 = mathbfCmathbfBx_2 + mathbfCmathbfAmathbfBx_1 + mathbfCmathbfAmathbfAmathbfBx_0$$
If we’re paying attention, we might spot something even better here – this expression can be written as a convolution. Hence we can apply the Fast Fourier Transform and the Convolution Theorem to compute this very efficiently on hardware as in Equation 3 below.
We can calculate Equation 2, the SSM equations, efficiently in the Convolutional Form, Equation 3.
Unfortunately, with the Selection Mechanism, we lose the convolutional form. Much attention is given to making Mamba efficient on modern GPU hardware using similar hardware optimisation tricks to Tri Dao’s Flash Attention17. With the hardware optimisations, Mamba is able to run faster than comparably sized Transformers.
Machine Learning for Political Economists – How Large Should The State Be?
The Mamba authors write, “the efficiency vs. effectiveness tradeoff of sequence models is characterised by how well they compress their state”. In other words, like in political economy18, the fundamental problem is how to manage the state.
🔁 Traditional RNNs are anarchic
They have a small, minimal state. The size of the state is bounded. The compression of state is poor.
🤖 Transformers are communist
They have a maximally large state. The “state” is just a cache of the entire history with no compression. Every context token is treated equally until recall time.
🐍Mamba has a compressed state
…but it’s selective about what goes in. Mamba says we can get away with a small state if the state is well focused and effective19.
Language Models and State Size
The upshot is that state representation is critical. A smaller state is more efficient; a larger state is more effective. The key is to selectively and dynamically compress data into the state. Mamba’s Selection Mechanism allows for context-dependent reasoning, focusing and ignoring. For both performance and interpretability, understanding the state seems to be very useful.
Information Flow in Transformer vs Mamba
How do Transformers know anything? At initialization, a transformer isn’t very smart. It learns in two ways:
Training data (Pretraining, SFT, RLHF etc)
In context-data
Training Data
Models learn from their training data. This is a kind of lossy compression of input data into the weights. We can think of the effect of pretraining data on the transformer kinda like the effect of your ancestor’s experiences on your genetics – you can’t recall their experiences, you just have vague instincts about them20.
In Context-Data
Transformers use their context as short-term memory, which they can recall with ~perfect fidelity. So we get In-Context Learning, e.g. using induction heads to solve the Indirect Object Identification task, or computing Linear Regression.
Retrieval
Note that Transformers don’t filter their context at all until recall time. So if we have a bunch of information we think might be useful to the Transformer, we filter it outside the Transformer (using Information Retrieval strategies) and then stuff the results into the prompt. This process is known as Retrieval Augmented Generation (RAG). RAG determines relevant information for the context window of a transformer. A human with the internet is kinda like a RAG system – you still have to know what to search but whatever you retrieve is as salient as short-term memory to you.
Information Flow for Mamba
Training Data acts similarly for Mamba. However, the lines are slightly blurred for in-context data and retrieval. In-context data for Mamba is compressed/filtered similar to retrieval data for transformers. This in-context data is also accessible for look-up like for transformers (although with somewhat lower fidelity).
Transformer context is to Mamba states what short-term is to long-term memory. Mamba doesn’t just have “RAM”, it has a hard drive21 22.
Swapping States as a New Prompting Paradigm
Currently, we often use RAG to give a transformer contextual information.
With Mamba-like models, you could instead imagine having a library of states created by running the model over specialised data. States could be shared kinda like LoRAs for image models.
For example, I could do inference on 20 physics textbooks and, say, 100 physics questions and answers. Then I have a state which I can give to you. Now you don’t need to add any few-shot examples; you just simply ask your question. The in-context learning is in the state.
In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges. And note that “training” a state doesn’t require any backprop. It’s more like a highly specialised one-pass fixed-size compression algorithm. This is unlimited in-context learning applied at inference time for zero-compute or latency23.
The structure of an effective LLM call goes from…
System Prompt
Preamble
Few shot-examples
Question
…for Transformers, to simply…
Inputted state (with problem context, initial instructions, textbooks, and few-shot examples)
Short question
…for Mamba.
This is cheaper and faster than few-shot prompting (as the state is infinitely reusable without inference cost). It’s also MUCH cheaper than finetuning and doesn’t require any gradient updates. We could imagine retrieving states in addition to context.
Mamba & Mechanistic Interpretability
Transformer interpretability typically involves:
understanding token relationships via attention,
understanding circuits, and
using Dictionary Learning for unfolding MLPs.
Most of the ablations that we would like to do for Mamba are still valid, but understanding token communication (1) is now more nuanced. All information moves between tokens via hidden states instead of the Attention Mechanism which can “teleport” information from one sequence position to another.
For understanding in-context learning (ICL) tasks with Mamba, we will look to intervene on the SSM state. A classic task in-context learning task is Indirect Object Identification in which a model has to finish a paragraph like:
Then, Shelby and Emma had a lot of fun at the school. [Shelby/Emma] gave an apple to [BLANK]
The model is expected to fill in the blank with the name that is not repeated in the paragraph. In the chart below we can see that information is passed from the [Shelby/Emma] position to the final position via the hidden state (see the two blue lines in the top chart).
Since it’s hypothesised that much of In-Context Learning in Transformers is downstream of more primitive sequence position operations (like Induction Heads), Mamba being able to complete this task suggests a more general In-Context Learning ability.
What’s Next for Mamba & SSMs?
Mamba-like models are likely to excel in scenarios requiring extremely long context and long-term memory. Examples include:
Processing DNA
Generating (or reasoning over) video
Writing novels
An illustrative example is agents with long-term goals.
Suppose you have an agent interacting with the world. Eventually, its experiences become too much for the context window of a transformer. The agent then has to compress or summarise its experiences into some more compact representation.
But how do you decide what information is the most useful as a summary? If the task is language, LLMs are actually fairly good at summaries – okay, yeah, you’ll lose some information, but the most important stuff can be retained.
However, for other disciplines, it might not be clear how to summarise. For example, what’s the best way to summarise a 2 hour movie?24. Could the model itself learn to do this naturally rather than a hacky workaround like trying to describe the aesthetics of the movie in text?
This is what Mamba allows. Actual long-term memory. A real state where the model learns to keep what’s important. Prediction is compression – learning what’s useful to predict what’s coming next inevitably leads to building a useful compression of the previous tokens.
The implications for Assistants are clear:
Your chatbot co-evolves with you. It remembers.
The film HER is looking better and better as time goes on 😳
Agents & AI Safety
One reason for positive updates in existential risk from AGI is Language Models. Previously, Deep-RL agents trained via self-play looked set to be the first AGIs. Language models are inherently much safer since they aren’t trained with long-term goals25.
The potential for long-term sequence reasoning here brings back the importance of agent-based AI safety. Few agent worries are relevant to Transformers with an 8k context window. Many are relevant to systems with impressive long-term memories and possible instrumental goals.
The Best Collab Since Taco Bell & KFC: 🤖 x 🐍
The Mamba authors show that there’s value in combining Mamba’s long context with the Transformer’s high fidelity over short sequences. For example, if you’re making long videos, you likely can’t fit a whole movie into a Transformer’s context for attention26. You could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency27.
This isn’t the end for Transformers. Their high effectiveness is exactly what’s needed for many tasks. But now Transformers aren’t the only option. Other architectures are genuinely feasible.
So we’re not in the post-Transformer era. But for the first time, we’re living in the post-only-Transformers era28. And this blows the possibilities wide open for sequence modelling with extreme context lengths and native long-term memory.
Two ML researchers, Sasha Rush (HuggingFace, Annotated Transformer, Cornell Professor) and Jonathan Frankle (Lottery Ticket Hypothesis, MosaicML, Harvard Professor), currently have a bet here.
Currently Transformers are far and away in the lead. With 3 years left, there’s now a research direction with a fighting chance.
All that remains to ask is: Is Attention All We Need?
1. see Figure 8 in the Mamba paper. 2. And scaling up with massive compute. 3. More specifically the scaled dot-product Attention popularised by Transformers 4. For people who don’t see Temple Run as the cultural cornerstone it is 🤣 Temple Run was an iPhone game from 2011 similar to Subway Surfer 5. Here we assume the environment is sufficiently smooth. 6. One pretty important constraint for this to be efficient is that we don’t allow the individual elements of the state vector to interact with each other directly. We’ll use a combination of the state dimensions to determine the output but we don’t e.g. allow the velocity of the runner and the direction of the closest obstacle (or whatever else was in our state) to directly interact. This helps with efficient computation and we achieve this practically by constraining A to be a diagonal matrix. 7. Concretely consider the case of Language Models – each token is a discrete step 8. ZOH also has nice properties for the initialisations – we want A_bar to be close to the identity so that the state can be mostly maintained from timestep to timestep if desired. ZOH gives A_bar as an exponential so any diagonal element initialisations close to zero give values close to 1 9. This is known as the Euler discretisation in the literature 10. It’s wild to note that some readers might not have, we’re so far into the age of Attention that RNNs have been forgotten! 11. B is like the Query (Q) matrix for Transformers. 12. C is like the Output (O) matrix for Transformers. 13. Non-alcoholic options also available! 14. Especially as all voices roughly occupy the same space on the audio frequency spectrum Intuitively this seems really hard! 15. Note that photographic memory doesn’t necessarily imply perfect inferences from that memory! 16. To be clear, if you have a short sequence, then a transformer should theoretically be a better approach. If you can store the whole context, then why not!? If you have enough memory for a high-resolution image, why compress it into a JPEG? But Mamba-style architectures are likely to hugely outperform with long-range sequences. 17. More details are available for engineers interested in CUDA programming – Tri’s talk, Mamba paper section 3.3.2, and the official CUDA code are good resources for understanding the Hardware-Aware Scan 18. or in Object Oriented Programming 19. Implications to actual Political Economy are left to the reader but maybe Gu and Dao accidentally solved politics!? 20. This isn’t a perfect analogy as human evolution follows a genetic algorithm rather than SGD. 21. Albeit a pretty weird hard drive at that – it morphs over time rather than being a fixed representation. 22. As a backronym, I’ve started calling the hidden_state the state space dimension (or selective state dimension) which shortens to SSD, a nice reminder for what this object represents – the long-term memory of the system. 23. I’m thinking about this similarly to the relationship between harmlessness finetuning and activation steering. State swapping, like activation steering, is an inference time intervention giving comparable results to its train time analogue. 24. This is a very non-trivial problem! How do human brains represent a movie internally? It’s not a series of the most salient frames, nor is it a text summary of the colours, nor is it a purely vibes-based summary if you can memorise some lines of the film. 25. They’re also safer since they inherently understand (though don’t necessarily embody) human values. It’s not all clear that how to teach an RL agent human morality. 26. Note that typically an image (i.e. a single frame) counts as >196 tokens, and movies are typically 24 fps so you’ll fill a 32k context window in 7 seconds 🤯 27. Another possibility that I’m excited about is applying optimisation pressure to the state itself as well as the output to have models that respect particular use cases. 28. This is slightly hyperbolic, the TS-Mixer for time series, Gradient Boosting Trees for tabular data and Graph Neural Networks for weather prediction exist and are currently used, but these aren’t at the core of AI
Author Bio
Kola Ayonrinde is a Research Scientist and Machine Learning Engineer with a flair for writing. He integrates technology and creativity, focusing on applying machine learning in innovative ways and exploring the societal impacts of tech advancements.
Acknowledgements
This post was originally posted on Kola’s personal blog.
Thanks to Gonçalo for reading an early draft, Jaden for the nnsight library used for the Interpretability analysis and Tessa for Mamba patching visualisations.Also see: Mamba paper, Mamba Python code, Annotated S4, Nathan Labenz podcast
Citation
For attribution in academic contexts or books, please cite this work as
Kola Ayonrinde, "Mamba Explained," The Gradient, 2024
@article{Ayonrinde2024mamba, author = Kola Ayonrinde, title = Mamba Explained, journal = The Gradient, year = 2024, howpublished = urlhttps://thegradient.pub/mamba-explained,
0 notes
vndta-vps · 3 months ago
Text
Cách tối ưu hiệu suất khi sử dụng máy chủ GPU thuê ngoài
Máy chủ GPU thuê ngoài đang trở thành giải pháp phổ biến cho các doanh nghiệp và cá nhân có nhu cầu xử lý dữ liệu lớn, AI, Machine Learning hay đồ họa chuyên sâu. Tuy nhiên, để tận dụng tối đa hiệu suất của máy chủ có GPU, bạn cần có chiến lược tối ưu hợp lý. Bài viết này sẽ hướng dẫn cách tối ưu hiệu suất khi sử dụng máy chủ GPU thuê ngoài.
Lựa chọn cấu hình GPU phù hợp
Không phải mọi dự án đều cần cấu hình GPU cao cấp nhất. Việc lựa chọn cấu hình phù hợp giúp tiết kiệm chi phí và tối ưu hiệu suất.
AI & Machine Learning: Nên chọn GPU có nhiều nhân CUDA, VRAM lớn như NVIDIA A100, RTX 3090, hoặc Tesla V100.
Render đồ họa & CGI: Cần GPU có VRAM cao và băng thông bộ nhớ rộng như RTX 4090, Quadro RTX.
Khai thác dữ liệu lớn: Chọn GPU có hiệu suất xử lý song song cao như NVIDIA H100, AMD Instinct MI250.
Tối ưu phần mềm và trình điều khiển
Việc cài đặt đúng phần mềm và driver giúp cải thiện đáng kể hiệu suất máy chủ GPU.
Cập nhật driver mới nhất: Sử dụng phiên bản driver ổn định từ NVIDIA hoặc AMD.
Cấu hình CUDA/cuDNN hợp lý: Nếu chạy AI/ML, cần kiểm tra CUDA/cuDNN có tương thích với framework (TensorFlow, PyTorch).
Tối ưu môi trường ảo hóa: Nếu chạy trên Docker, đảm bảo GPU được truy cập trực tiếp bằng NVIDIA Container Toolkit.
Sử dụng tài nguyên GPU hiệu quả
Việc tối ưu cách sử dụng tài nguyên GPU sẽ giúp giảm tải và tăng tốc xử lý.
Tận dụng GPU Multi-Instance (MIG): Nếu chạy nhiều tác vụ nhỏ, chia GPU thành nhiều phân vùng độc lập.
Batch Processing: Xử lý dữ liệu theo lô lớn thay vì từng phần nhỏ lẻ để tận dụng tối đa băng thông bộ nhớ.
Parallel Computing: Viết code tối ưu theo mô hình xử lý song song để tận dụng toàn bộ tài nguyên GPU.
Giám sát và quản lý hiệu suất
Để đảm bảo hiệu suất ổn định, cần thường xuyên giám sát và tối ưu hệ thống.
Dùng NVIDIA SMI hoặc AMD ROCm: Theo dõi nhiệt độ, mức sử dụng bộ nhớ và hiệu suất GPU.
Cấu hình Power Limit hợp lý: Giới hạn công suất GPU để tránh quá tải nhiệt.
Tích hợp công cụ giám sát như Prometheus + Grafana: Tạo dashboard theo dõi real-time hiệu suất GPU.
Kết luận
Tối ưu hiệu suất máy chủ GPU thuê ngoài không chỉ giúp tăng tốc xử lý mà còn tiết kiệm chi phí vận hành. Bằng cách lựa chọn cấu hình phù hợp, tinh chỉnh phần mềm, sử dụng tài nguyên hiệu quả và giám sát chặt chẽ, bạn có thể khai thác tối đa sức mạnh của GPU. Nếu bạn cần hỗ trợ thêm về tối ưu hệ thống, đừng ngần ngại liên hệ với các chuyên gia về GPU!
Xem thêm: https://vndata.vn/may-chu-do-hoa-gpu/
0 notes
govindhtech · 1 month ago
Text
AMD Radeon RX 7900 XTX Vs 4090 Performance Benchmarks
Tumblr media
AMD RX 7900 XTX versus 4090
RTX 4090 and RX 7900 XTX lead GPU performance. This GPU optimises AI, 4K gaming, 3D rendering, and science. Their cost, feature sets, and performance efficiency vary per user category.
Features and Tech
Tracking Rays
More advanced and effective NVIDIA ray tracing. RTX 4090 has more ray tracing cores and a better efficient design. AMD's RDNA 3 has improved, but the RX 7900 XTX still struggles in ray-traced workloads.
DLS vs. FSR upscaling
DLSS 3.5 (RTX 4090) includes frame generation for smoother low-FPS gameplay. many AAA games use it.
FSR 3 (RX 7900 XTX) is more open but less popular and performs poorly in frame creation than DLSS.
AI and compute tasks
Tensor cores let the RTX 4090 to outperform the RX 7900 XTX in deep learning, AI inference, and CUDA-optimized professional content creation apps like Blender, DaVinci Resolve, and TensorFlow. AMD lacks an AI-specific feature set.
Efficiency of Power
RX 7900 XTX: 355W TDP
TDP 450W RTX 4090
AMD's better efficiency per watt reduces heat and PSU load.
Use Case Ideas
RTX 4090:
Best 4K or 8K performance is needed.
Modern games use DLSS 3.5 and ray tracing.
You utilise Blender, CUDA, or AI/ML software.
RX 7900 XTX:
You want top-notch gaming under $1,000.
Rasterised 4K or 1440p settings dominate your gaming.
You desire power savings and DisplayPort 2.1 support.
In GPU-intensive games and ray-traced environments, the RTX 4090 boosts FPS by 20–35%. Rasterised (non-ray-traced) applications close the performance gap, however the RX 7900 XTX offers high frame rates at a lower price.
More AI-optimized CUDA and Tensor cores give the RTX 4090 compute-intensive advantages. DisplayPort 2.1 and power efficiency are better than the 4090's 1.4a on the RX 7900 XTX.
In conclusion
The NVIDIA RTX 4090 dominates performance. It excels in raw power, ray tracing, upscaling (DLSS), and AI workloads, making it ideal for pros and amateurs on any budget. For pure gaming performance without ray tracing, the RX 7900 XTX offers 80–90% of the RTX 4090's performance for 60% less money.
If your task uses advanced ray tracing, CUDA, or NVIDIA's ecosystem, the 4090 is worth having. The RX 7900 XTX is a great choice for under-$1,000 gamers that want to play modern games at high resolutions.
0 notes
erpinformation · 1 year ago
Link
0 notes
bigdataschool-moscow · 1 year ago
Link
0 notes
shalcool15 · 1 year ago
Text
Leveraging Machine Learning in Python with TensorFlow 2 and PyTorch
In the vast and ever-evolving landscape of machine learning (ML), Python stands as a beacon for developers and researchers alike, offering an intuitive syntax coupled with a robust ecosystem of libraries and frameworks. Among these, TensorFlow 2 and PyTorch have emerged as frontrunners, each with its unique strengths and community of supporters. This blog “Leveraging Machine Learning in Python with TensorFlow 2 and PyTorch” delves into how TensorFlow 2 and PyTorch can be harnessed to drive innovation and efficiency in ML projects, providing a comprehensive guide for practitioners leveraging these powerful tools.
Introduction to TensorFlow 2
Developed by Google, TensorFlow 2 is an open-source library for research and production. It offers an ecosystem of tools, libraries, and community resources that allow developers to build and deploy ML-powered applications. TensorFlow 2 made significant improvements over its predecessor, making it more user-friendly and focusing on simplicity and ease of use. Its eager execution mode, by default, allows businesses looking to hire python developers develop a more intuitive coding and immediate feedback, essential for debugging and experimentation.
Key Features of TensorFlow 2
Eager Execution: TensorFlow 2 executes operations immediately, making it easier to start with and debug, providing a more pythonic feel.
Keras Integration: Tight integration with Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow. This simplifies model creation and experimentation.
Distributed Training: TensorFlow 2 supports distributed training strategies out of the box, enabling models to be trained on multiple CPUs, GPUs, or TPUs without significant code changes.
Model Deployment: TensorFlow offers various tools like TensorFlow Serving, TensorFlow Lite, and TensorFlow.js for deploying models across different platforms easily.
Introduction to PyTorch
PyTorch, developed by Facebook's AI Research lab, has rapidly gained popularity for its ease of use, efficiency, and dynamic computation graph that offers flexibility in ML model development. It is particularly favored for academic research and prototyping, where its dynamic nature allows for iterative and exploratory approaches to model design and testing.
Key Features of PyTorch
Dynamic Computation Graphs: PyTorch uses dynamic computation graphs, meaning the graph is built on the fly as operations are performed. This offers unparalleled flexibility in changing the way your network behaves on the fly and with minimal code.
Pythonic Nature: PyTorch is deeply integrated with Python, making it more intuitive for developers who are already familiar with Python.
Extensive Libraries: It has a rich ecosystem of libraries and tools, such as TorchVision for computer vision tasks, making it easier to implement complex models.
Strong Support for CUDA: PyTorch offers seamless CUDA integration, ensuring efficient use of GPUs for training and inference, making it highly scalable and fast.
Comparing TensorFlow 2 and PyTorch
While both TensorFlow 2 and PyTorch are powerful in their rights, they cater to different preferences and project requirements.
Ease of Use: PyTorch is often praised for its more intuitive and straightforward syntax, making it a favorite among researchers and those new to ML. TensorFlow 2, with its integration of Keras, has significantly closed the gap, offering a much simpler API for model development.
Performance and Scalability: TensorFlow 2 tends to have an edge in deployment and scalability, especially in production environments. Its comprehensive suite of tools for serving models and performing distributed training is more mature.
Community and Support: Both top Python frameworks boast large and active communities. TensorFlow, being older, has a broader range of resources, tutorials, and support. However, PyTorch has seen rapid growth in its community, especially in academic circles, due to its flexibility and ease of use.
Practical Applications
Implementing ML projects with TensorFlow 2 or PyTorch involves several common steps: data preprocessing, model building, training, evaluation, and deployment. Here, we’ll briefly outline how a typical ML project could be approached with both frameworks, focusing on a simple neural network for image classification.
TensorFlow 2 Workflow
Data Preprocessing: Utilize TensorFlow’s tf.data API to load and preprocess your dataset efficiently.
Model Building: Leverage Keras to define your model. You can use a sequential model with convolutional layers for a simple image classifier.
Training: Compile your model with an optimizer, loss function, and metrics. Use the model.fit() method to train it on your data.
Evaluation and Deployment: Evaluate your model’s performance with model.evaluate(). Deploy it using TensorFlow Serving or TensorFlow Lite for mobile devices.
PyTorch Workflow
Data Preprocessing: Use torchvision.transforms to preprocess your images. torch.utils.data.DataLoader is handy for batching and shuffling.
Model Building: Define your neural network class by extending torch.nn.Module. Implement the forward method to specify the network's forward pass.
Training: Prepare your loss function and optimizer from torch.nn and torch.optim, respectively. Iterate over your dataset, and use backpropagation to train your model.
Evaluation and Deployment: Evaluate the model on a test set. For deployment, you can export your model using TorchScript or convert it for use with ONNX for cross-platform compatibility.
Conclusion
Both TensorFlow 2 and PyTorch offer unique advantages and have their place in the ML ecosystem. TensorFlow 2 stands out for its extensive deployment tools and scalability, making it ideal for production environments. PyTorch, with its dynamic computation graph and intuitive design, excels in research and rapid prototyping.
Your choice between TensorFlow 2 and PyTorch may depend on specific project needs, your comfort with Python, and the ecosystem you're most aligned with. Regardless of your choice to hire top python development companies, both frameworks are continuously evolving, driven by their vibrant communities and the shared goal of making ML more accessible and powerful.
In leveraging these frameworks, practitioners are equipped with the tools necessary to push the boundaries of what's possible with ML, driving innovation and creating solutions that were once deemed futuristic. As we continue to explore the potential of ML, TensorFlow 2 and PyTorch will undoubtedly play pivotal roles in shaping the future of technology.
0 notes