#Fastvideo Nvidia Cuda NVIDIA Quadro JPEG compression | Explore Tumblr posts and blogs

fastcompression · 6 years ago

Text

Fastvideo SDK benchmarks on NVIDIA Quadro RTX 6000

Fastvideo SDK for Image and Video Processing on NVIDIA GPU offers super fast performance and high image quality. Now we've done testing of Fastvideo SDK on NVIDIA® Quadro RTX™ 6000 which is powered by the NVIDIA Turing™ architecture and NVIDIA RTX™ platform. That new technology brings the most significant advancement in computer graphics in over a decade to professional workflows. That new hardware is intended to boost the performance of image and video processing dramatically. To check that, we've done benchmarks for mostly frequently utilized image processing features.

We've done time measurements for most frequently used image processing algorithms like demosaic, resize, denoise, jpeg encoder and decoder, jpeg2000, etc. This is just a small part of Fastvideo SDK modules, though they could be valuable to understand the performance speedup on the new hardware.

To evaluate more complicated image processing pipelines we would suggest to download and to test Fast CinemaDNG Processor software which is based on Fastvideo SDK. With that software you will be able to create your own pipeline and to check the benchmarks for your images.

How we do benchmarking

As usual, performance benchmarks can just give an idea about the speed of processing, though exact values depend on OS, hardware, image content, resolution and bit depth, processing parameters, an approach of time measurements, etc. The origin of the particular image processing task could imply any specific type of benchmarking.

To get maximum performance for any GPU software, we need to ensure maximum GPU occupancy, which is not easy to accomplish. That's why we could evaluate max performance by the following ways:

Repetition for particular function to get averaged computation time

Multithreading with copy/compute overlap

Software profiling on NVIDIA Visual Profiler to get total GPU time for all kernels for particular image processing module

Hardware and software

CPU Intel Core i7-5930K (Haswell-E, 6 cores, 3.5–3.7 GHz)

GPU NVIDIA Quadro RTX 6000

OS Windows 10 (x64), version 1803

CUDA Toolkit 10

Fastvideo SDK 0.14.0

Demosaicing benchmarks

In the Fastvideo SDK we have three different GPU-based demosaicing algorithms at the moment:

HQLI - High Quality Linear Interpolation, window 5×5

DFPD - Directional Filtering and a Posteriori Decision, window 11×11

MG - Multiple Gradients, window 23×23

All these algorithms are implemented for 8-bit and 16-bit workflows, and they take into account pixels new image borders. To demonstrate the performance, we imply that initial and processed data reside in GPU memory. This is the case for complicated pipelines in raw image processing applications.

Demosaicing algorithm | 2K (1920 × 1080) | 4K (3840 × 2160)

HQLI (8-bit) | 30,000 fps | 9,300 fps

HQLI (16-bit) | 13,000 fps | 4,400 fps

DFPD (8-bit) | 12,600 fps | 4,700 fps

DFPD (16-bit) | 7,100 fps | 2,700 fps

MG (16-bit) | 3,400 fps | 1,200 fps

To check image quality for each demosaicing algorithm in real case, you can download Fast CinemaDNG Processor software from www.fastcinemadng.com together with sample DNG image series for evaluation.

JPEG encoding and decoding benchmarks

JPEG codec from Fastvideo SDK offers very high performance both for encoding and decoding. To get better results, we need to have more data to achieve maximum GPU occupancy. This is very important issue to get good results. Here we present results for the best total kernel time for JPEG encoding and decoding. JPEG compression quality q=90%, subsampling 4:2:0 (visually lossless compression), optimum number of restart markers.

JPEG Encoding | JPEG Decoding

2K (1920 × 1080) | 3,500 fps1,380 fps

4K (3840 × 2160) | 1,900 fps860 fps

5K (5320 × 3840) | 1,100 fps520 fps

JPEG2000 encoding benchmarks

We have high performance JPEG codec on GPU in the Fastvideo SDK and this is the algorithm which is partially utilizing CPU, so total performance is also CPU-dependent, but still it's much faster than any CPU-based J2K codecs like OpenJPEG. In the tests we utilized optimal number of threads, compression ratio corresponded to visually lossless compression.

JPEG2000 encoding parameters | Lossy encoding | Lossless encoding

2K image, 24-bit, cb 32×32 | 504 fps | 281 fps

4K image, 24-bit, cb 32×32 | 160 fps | 85 fps

8K image, 24-bit, cb 32×3 | 256 fps | 23 fps

Image resize

This is frequently utilized feature and here we present our results for GPU-based resize according to Lanczos algorithm.

"1/2 resolution" means 960 × 540 for 2K and 1920 × 1080 for 4K. "1 pixel" means 1919 × 1079 for 2K and 3839 × 2159 for 4K.

Resize BMP/PPM | 1/2 resolution | 1 pixel

2K image, 24-bit | 4,200 fps | 3,300 fps

4K image, 24-bit | 1,700 fps | 1,120 fps

Apart from that, we have done benchmarks for the following pipeline: jpeg decoding - resize - jpeg encoding, which is utilized in web applications.

Decode JPEG - Resize - Encode JPEG | 1/2 resolution | 1 pixel

2K jpg image, 24-bit | 996 fps | 845 fps

4K jpg image, 24-bit | 586 fps | 425 fps

To summarize, Fastvideo SDK benchmarks are quite fast, though we can see possibilities to make them better by further optimization of our CUDA kernels for Turing architecture.

Original article see at: https://www.fastcompression.com/blog/fastvideo-sdk-benchmarks-quadro-rtx-6000.htm

#Fastvideo Nvidia Cuda NVIDIA Quadro JPEG compression

0 notes

fastcompression · 6 years ago

Text

Web Resize on-the-fly: up to one thousand images per second on Tesla V100 GPU

Fastvideo company has been developing GPU-based image processing SDK since 2011, and we also got some outstanding results for software performance on NVIDIA GPU (mobile, laptop, desktop, server). We’ve implemented the first JPEG codec on CUDA, which is still the fastest solution on the market. Apart from JPEG, we’ve also released JPEG2000 codec on GPU and SDK with high performance image processing algorithms on CUDA. Our SDK offers just exceptional speed for many imaging applications, especially in situations when CPU-based solutions are unable to offer either sufficient performance or latency. Now we would like to introduce our Resize image on the fly solution.

JPEG Resize on-the-fly

In various imaging applications we have to do image resize and quite often we need to resize JPEG images. In such a case the task gets more complicated, as soon as we can't do resize directly, because images are compressed. Solution is not difficult, we just need to decompress the image, then do resize and encode it to get resized image. Nevertheless, we can face some difficulties if we assume that we need to resize many millions of images every day and here comes questions concerning performance optimization. Now we need not only to get it right, we have to do that very fast. And there is a good news that it can be done this way.

In the standard set of demo applications from Fastvideo SDK for NVIDIA GPUs there is a sample application for JPEG resize. It's supplied both in binaries and with source codes to let users integrate it easily into their software solutions. This is the software to solve the problem of fast resize (JPEG resize on-the-fly), which is essential for many high performance applications, including high load web services. That application can do JPEG resize very fast, and user can test the binary to check image quality and performance.

If we consider high load web application as an example, we can formulate the following task: we have big database of images in JPEG format, and we need to perform fast resize for these images with minimum latency. This is also a problem for big sites with responsive design: how to prepare set of images with optimal resolutions to minimize traffic and to do that as fast as possible?

At first we need to answer the question “Why JPEG?”. Modern internet services get most of such images from their users, which create them with mobile phones or cameras. For such a situation JPEG is a standard and reasonable choice. Other formats on mobile phones and cameras do exist, but they are not so widespread as JPEG. Many images are stored as WebP, but that format is still not so popular as JPEG. Moreover, encoding and decoding of WebP images are much slower in comparison with JPEG, and this is also very important.

Quite often, such high load web services utilize sets of multiple image copies of the same image with different resolutions to get low latency response. That approach leads to extra expenses on storage, especially for high performance applications, web services and big image databases. The idea to implement better solution is quite simple: we can try to store just one JPEG image at the database instead of image series and to transform it to desired resolution on the fly, which means very fast and with minimum latency.

How to prepare image database

We will store all images in the database at JPEG format, but this is not a good idea to utilize them “as is”. It’s important to prepare all images from the database for future fast decoding. That is the reason why we need to pre-process at off-line all images in the database to insert so called “JPEG restart markers” into each image. JPEG Standard allows such markers and most of JPEG decoders can easily process JPEG images with these markers without problem. Most of smart phones and cameras don’t produce JPEGs with restart markers, that’s why we can add these markers with our software. This is lossless procedure, so we don’t change image content, though file size will be slightly more after that.

To make full solution efficient, we can utilize some statistics about user device resolutions which are most frequent. As soon as users utilize their phones, laptops, PCs to see pictures, and quite often these pictures need just a part of the screen, then image resolutions should not too big and this is the ground to conclude that most of images from our database could have resolutions not more than 1K or 2K. We will consider both choices to evaluate latency and performance. In the case if we need bigger resolution at user device, we just can do resize with upscaling algorithm. Still, there is a possibility to choose bigger default image resolution for the database, general solution will be the same.

For practical purposes we consider JPEG compression with parameters which correspond to “visually lossless compression”. It means JPEG compression quality around 90% with subsampling 4:2:0 or 4:4:4. To evaluate time of JPEG resize, for testing we choose downscaling to 50% both for width and height. In real life we could utilize various scaling coefficients, but 50% could be considered as standard case for testing.

Algorithm description for JPEG Resize on-the-fly software

This is full image processing pipeline for fast JPEG resize that we've implemented in our software:

Copy JPEG images from database to system memory

Parse JPEG and check EXIF sections (orientation, color profile, etc.)

If we see color profile at JPEG image, we read it from file header and save it for future use

Copy JPEG image from CPU to GPU memory

JPEG decoding

Image resize according to Lanczos algorithm (50% downscaling as an example)

Sharp

JPEG encoding

Copy new image from GPU to system memory

Add previously saved color profile to the image header (to EXIF)

We could also implement the same solution with better precision. Before resize we could apply reverse gamma to all color components of the pixel, in order to perform resize in linear space. Then we will apply that gamma to all pixels right after sharp. Visual difference is not big, though it's noticeable, computational cost for such an algorithm modification is low, so it could be easily done. We just need to add reverse and forward gamma to image processing pipeline on GPU.

There is one more interesting approach to solve the same task of JPEG Resize. We can do JPEG decoding on multicore CPU with libjpeg-turbo software. Each image could be decoded in a separate CPU thread, though all the rest of image processing is done on GPU. If we have sufficient number of CPU cores, we could achieve high performance decoding on CPU, though the latency will degrade significantly. If the latency is not our priority, then that approach could be very fast as well, especially in the case when original image resolution is small.

General requirements for fast jpg resizer

The main idea is to avoid storing of several dozens copies of the same image with different resolutions. We can create necessary image with required resolution immediately, right after receiving external request. This is the way to reduce storage size, because we need to have just one original image instead of series of copies.

We have to accomplish JPEG resize task very quickly. That is the matter of service quality due to fast response to client’s requests.

Image quality of resized version should be high.

To ensure precise color reproduction, we need to save color profile from EXIF of original image.

Image file size should be as small as possible and image resolution should coincide with window size on the client’s device: а) If image size is not the same as window size, then client’s device (smart phone, tablet, laptop, PC) will apply hardware-based resize right after image decoding on the device. In OpenGL such a resize is always bilinear, which could create some artifacts or moire on the images with high-frequency detail. b) Screen resize consumes extra energy from the device. c) If we consider the situation with multiple image copies at different resolutions, then in most cases we will not be able to match exactly image resolution with window size, and that's why we will send more traffic than we could.

Full pipeline for web resize, step by step

We collect images from users in any format and resolution

At off-line mode with ImageMagick which supports various image formats, we transform original images to standard 24-bit BMP/PPM format, apply high quality resize with downscale to 1K or 2K, then do JPEG encoding which should include restart markers embedding. The last action could be done either with jpegtran utility on CPU or with Fastvideo JPEG Сodeс on GPU. Both of them can work with JPEG restart markers.

Finally, we create database of such 1K or 2K images to work with further.

After receiving user’s request, we get full info about required image and its resolution.

Find the required image from the database, copy it to system memory and notify resizing software that new image is ready for processing.

On GPU we do the following: decoding, resizing, sharpening, encoding. After that the software copies compressed image to system memory an adds color profile to EXIF. Now the image is ready to be sent to user.

We can run several threads or processes for JPEG Resize application on each GPU to ensure performance scaling. This is possible because GPU occupancy is not high, while working with 1K and 2K images. Usually 2-4 threads/processes are sufficient to get maximum performance at single GPU.

The whole system should be built on professional GPUs like NVIDIA Tesla P40 or V100. This is vitally important, as soon as NVIDIA GeForce GPU series is not intended to 24/7 operation with maximum performance during years. NVIDIA Quadro GPUs have multiple monitor outputs which are not necessary in the task of fast jpg resize. Requirements for GPU memory size are very low and that's why we don’t need GPUs with big size of GPU memory.

As additional optimization issue, we can also create a cache for most frequently processed images to get faster access for such images.

Software parameters for JPEG Resize

Width and height of the resized image could be arbitrary and they are defined with one pixel precision. It's a good idea to preserve original aspect ratio of the image, though the software can also work with any width and height.

We utilize JPEG subsampling modes 4:2:0 and 4:4:4.

Maximum image quality we can get with 4:4:4, though minimum file size corresponds to 4:2:0 mode. We can do subsampling because human visual system better recognizes luma image component, rather than chroma.

JPEG image quality and subsampling for all images the database.

We do sharpening with 3×3 window and we can control sigma (radius).

We need to specify JPEG quality and subsampling mode for output image as well. It’s not necessary that these parameters should be the same as for input image. Usually JPEG quality 90% is considered to be visually lossless and it means that user can’t see compression artifacts at standard viewing conditions. In general case, one can try JPEG image quality up to 93-95%, but then we will have bigger file sizes both for input and output images.

Important limitations for Web Resizer

We can get very fast JPEG decoding on GPU only in the case if we have built-in restart markers in all our images. Without these restart markers one can’t make JPEG decoding parallel algorithm and we will not be able finally to get high performance at the decoding stage. That’s why we need to prepare the database with images which have sufficient number of restart markers.

At the moment, as we believe, JPEG compression algorithm is the best choice for such a task because performance of JPEG Codec on GPU is much faster in comparison with any competitive formats/codecs for image compression and decompression: WebP, PNG, TIFF, JPEG2000, etc. This is not just the matter of format choice, that is the matter of available high-performance codecs for these image formats.

Standard image resolution for prepared database could be 1K, 2K, 4K or anything else. Our solution will work with any image size, but total performance could be different.

Performance measurements for resize of 1K and 2K jpg images

We’ve done testing on NVIDIA Tesla V100 (OS Windows Server 2016, 64-bit, driver 24.21.13.9826) on 24-bit images 1k_wild.ppm and 2k_wild.ppm with resolutions 1K and 2K (1280×720 and 1920×1080). Tests were done with different number of threads, running at the same GPU. To process 2K images we need around 110 MB of GPU memory per one thread, for four threads we need up to 440 MB.

At the beginning we've encoded test images to JPEG with quality 90% and subsampling 4:2:0 or 4:4:4. Then we ran test application, did decoding, resizing, sharpening and encoding with the same quality and subsampling. Input JPEG images resided at system memory, we copied the processed image from GPU to the system memory as well. We measured timing for that procedure.

Command line example to process 1K image: PhotoHostingSample.exe -i 1k_wild.90.444.jpg -o 1k_wild.640.jpg -outputWidth 640 -q 90 -s 444 -sharp_after 0.95 -repeat 200

Performance for 1K images

1 | 90% | 4:4:4 / 4:2:0 | 2 times | 1 | 868 / 682

2 | 90% | 4:4:4 / 4:2:0 | 2 times | 2 | 1039 / 790

3 | 90% | 4:4:4 / 4:2:0 | 2 times | 3 | 993 / 831

4 | 90% | 4:4:4 / 4:2:0 | 2 times | 4 | 1003 / 740

Performance for 2K images

1 | 90% | 4:4:4 / 4:2:0 | 2 times | 1 | 732 / 643

2 | 90% | 4:4:4 / 4:2:0 | 2 times | 2 | 913 / 762

3 | 90% | 4:4:4 / 4:2:0 | 2 times | 3 | 891 / 742

4 | 90% | 4:4:4 / 4:2:0 | 2 times | 4 | 923 / 763

JPEG subsampling 4:2:0 for input image leads to slower performance, but image sizes for input and output images are less in that case. For subsampling 4:4:4 we get better performance, though image sizes are bigger. Total performance is mostly limited by JPEG decoder module and this is the key algorithm to improve to get faster solution in the future.

Resume

From the above tests we see that on just one NVIDIA Tesla V100 GPU, resize performance could reach 1000 fps for 1K images and 900 fps for 2K images at specified test parameters for JPEG Resize. To get maximum speed, we need to run 2-4 threads on the same GPU.

Latency around just one millisecond is very good result. To the best of our knowledge, one can’t get such a latency on CPU for that task and this is one more important vote for GPU-based resize of JPEG images at high performance professional solutions.

To process one billion of JPEG images with 1K or 2K resolutions per day, we need up to 16 NVIDIA Tesla V100 GPUs for JPEG Resize on-the-fly task. Some of our customers have already implemented that solution at their facilities, the others are currently testing that software.

Please note that GPU-based resize could be very useful not only for high load web services. There are much more high performance imaging applications where fast resize could be really important. For example, it could be utilized at the final stage of almost any image processing pipeline before image output to monitor. That software can work with any NVIDIA GPU: mobile, laptop, desktop, server.

Benefits of GPU-based JPEG Resizer

Reduced storage size

Less infrastructure costs on initial hardware and software purchasing

Better quality of service due to low latency response

High image quality for resized images

Min traffic

Less power consumption on client devices

Fast time-to-market software development on Linux and Windows

Outstanding reliability and speed of heavily-tested resize software

We don't need to store multiple image resolutions, so we don't have additional load to file system

Fully scalable solution which is applicable both to a big project and to a single device

Better ROI due to GPU usage and faster workflow

To whom it may concern

Fast resize of JPEG images is definitely the issue for high load web services, big online stores, social networks, online photo management and sharing applications, e-commerce services and enterprise-level software. Fast resize can offer better results at less time and less cost.

Software developers could benefit from GPU-based library with latency in the range of several milliseconds to resize jpg images on GPU.

That solution could also be a rival to NVIDIA DALI project for fast jpg loading at training stage of Machine Learning or Deep Learning frameworks. We can offer super high performance for JPEG decoding together with resize and other image augmentation features on GPU to make that solution useful for fast data loading at CNN training. Please contact us concerning that matter if you are interested.

Roadmap for jpg resize algorithm

Apart from JPEG codec, resize and sharp we can also add crop, color correction, gamma, brightness, contrast, rotations to 90/180/270 degrees - these modules are ready.

Advanced file format support (JP2, TIFF, CR2, DNG, etc.)

Parameter optimizations for NVIDIA Tesla P40 or V100.

Further JPEG Decoder performance optimization.

Implementation of batch mode for image decoding on GPU.

Useful links

Full list of features from Fastvideo Image Processing SDK

Benchmarks for image processing algorithms from Fastvideo SDK

Update

The latest version of the software offers 1400 fps performance on Tesla V100 for 1K images at the same testing conditions.

Original article see here: https://www.fastcompression.com/blog/web-resize-on-the-fly-one-thousand-images-per-second-on-tesla-v100-gpu.htm

#fastcompression fastvideo SDK JPEG Resize

1 note · View note

fastcompression · 5 years ago

Text

JPEG Optimizer Library on CPU and GPU

Fastvideo has implemented the fastest JPEG Codec and Image Processing SDK for NVIDIA GPUs. That software could work at maximum performance with full range of NVIDIA GPUs, starting from mobile Jetson to professional Quadro and Tesla server GPUs. Now we've extended these solutions to be able to offer various optimizations to Standard JPEG algorithm. This is vitally important issue to get better image compression while retaining the same perceived image quality within existing JPEG Standard.

Our expert knowledge in JPEG Standard and GPU programming are proved by performance benchmarks of our JPEG Codec. This is also a ground for our custom software design to solve various time-critical tasks in connection with JPEG images and corresponding services.

Our customers have been utilizing that GPU-based software for fast JPEG encoding and decoding, JPEG resize for high load web applications and they asked us to implement more optimizations which are indispensable for web solutions. These are the most demanding tasks:

JPEG recompression to decrease file size without loosing perceived image quality

JPEG optimization to get better user experience while loading JPEG images via slow connection

JPEG processing on users' devices

JPEG resize on-demand:

Implementations of JPEG Baseline, Extended, Progressive and Lossless parts of the Standard

Other tasks related to JPEG images

to store just one source image (to cut storage costs)

to match resolution of user's device (to exclude JPEG Resize on user's device)

to minimize traffic

to ensure minimum server response time

to offer better user experience

The idea about image optimization is very popular and it really makes sense. As soon as JPEG is so widespread at web, we need to optimize JPEG images for web as well. By decreasing image size, we can save space for image storage, minimize traffic, improve latency, etc. There are many methods of JPEG optimization and recompression which could bring us better compression ratio while saving perceptual image quality. In our products we strive to combine all of them with the idea about better performance on multicore CPUs and on modern GPUs.

There is a great variety of image processing tasks which are connected with JPEG handling. They could be solved either on CPU or on GPU. We are ready to offer custom software design to meet special requirements that our customers could have. Please fill the form below and send us your task description.

JPEG Optimizer Library and other software from Fastvideo

JPEG Optimizer Library (SDK for GPU/CPU on Windows/Linux) to recompress and to resize JPEG images for corporate customers: high load web services, photo stock applications, neural network training, etc.

Standalone JPEG optimizer application - in progress

Projects under development

JPEG optimizer SDK on CPU and GPU

Mobile SDK on CPU for Android/IOS for image decoding and visualization on smartphones

JPEG recompression library that runs inside your web app and optimizes images before upload

JPEG optimizer API for web

Online service for JPEG optimization

Fastvideo publications on the subject

JPEG Optimization Algorithms Review

Web resize on-the-fly on GPU

JPEG resize on-demand: FPGA vs GPU. Which is the fastest?

Jpeg2Jpeg Acceleration with CUDA MPS on Linux

JPEG compress and decompress with CUDA MPS

Original article see at: https://www.fastcompression.com/products/jpeg-optimizer-library.htm

Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression

#JPEG #Optimizer #Library #CPU #GPU

0 notes

fastcompression · 6 years ago

Text

Fast RAW Compression on GPU

Author: Fyodor Serzhenko

Recording performance for RAW data acquisition task is essential issue for 3D/4D, VR and Digital Cinema applications. Quite often we need to do realtime recordings to portable SSD and here we face questions about throughput, compression ratio, image quality, recording duration, etc. As soon as we need to store RAW data from a camera, the general approach for raw image encoding is not exactly the same as for color. Here we review several methods to solve that matter.

Why do we need Raw Image Compression on GPU?

We need to compress raw stream from a camera (industrial, machine vision, digital cinema, scientific, etc.) in realtime at high fps, for example 4K (12-bit raw data) at 60 fps, 90 fps or faster. This is vitally important issue for realtime applications, external raw recorders and for in-camera raw recordings. As an example we can consider RAW or RAW-SDI format to send data from a camera to PC or to external recorder.

As soon as most of modern cameras have 12-bit dynamic range, it's a good idea to utilize JPEG compression which could be implemented for 12-bit data. For 14-bit and 16-bit cameras this is not the case and for high bit depth cameras we would recommend to utilize either Lossless JPEG encoding or JPEG2000. These algorithms are not fast, but they can process high bit depth data.

Lossy methods to solve the task of Fast RAW Compression

Standard 12-bit JPEG encoding for grayscale images

Optimized 12-bit JPEG encoding (double width, half height, Standard 12-bit JPEG encoding for grayscale images)

Raw Bayer encoding (split RGGB pattern to 4 planes and then apply 12-bit JPEG encoding for each plane)

The problem with Standard JPEG for RAW encoding is evident - we don't have slowly varying changes in pixel values at the image and this could cause problems with image quality due to Discrete Cosine Transform which is the part of JPEG algorithm. In that case the main idea of JPEG compression is questionable and we expect to get higher level of distortion for RAW images with JPEG compression.

The idea about "double width" is also well-known. It's working well at Lossless JPEG compression for RAW bayer data. After such a transform we get the same colors for vertical pixel neighbours for two adjacent rows and it could decrease high-frequency values after DCT for Standard JPEG. That method is also utilized in Blackmagic Design BMD RAW 3:1 and 4:1 formats.

If we split RAW image into 4 planes according to available bayer pattern, we get 4 downsized images, one for each bayer component. Here we can get slowly varying intensity, but for images with halved resolution. That algorithm looks promising, though we could expect slightly slower performance becase of additional split algorithm in the pipeline.

We focus on JPEG-based methods as soon as we have high performance solution for JPEG codec on CUDA. That codec is capable of working with all range of NVIDIA GPUs: mobile Jetson Nano, TK1/TX1/TX2, AGX Xavier, laptop/desktop GeForce series and server GPUs Quadro and Tesla. That codec also supports 12-bit JPEG encoding which is the key algorithm for that RAW compression task.

There is also an opportunity to apply JPEG2000 encoding instead of JPEG for all three cases, but here we will consider JPEG only because of the following reasons:

JPEG encoding on GPU is much faster than JPEG2000 encoding (approximately ×20)

Compression ratio is almost the same (it's bigger for J2K, but not too much)

There is a patent from RED company to implement J2K encoding for splitted channels inside the camera

There are no open patent issues connected with JPEG algorithm and this is serious advantage of JPEG. Nevertheless, the case with JPEG2000 compression is very interesting and we will test it later. That approach could give us GPU lossless raw image compression, which can't be done with JPEG.

To solve the task of RAW image compression, we need to specify both metric and criteria to measure image quality losses. We will try SSIM which is considered to be much more reliable in comparison with PSNR and MSE. SSIM means structural similarity and it's widely used to evaluate image resemblance. This is well known image quality metric.

Quality and Compression Ratio measurements

To find the best solution among chosen algorithms we have done some tests to calculate Compression Ratio and SSIM for standard values of JPEG Quality Factor. We've utilized the same Standard JPEG quantization table and the same 12-bit RAW image. As soon as Compression Ratio is content-dependent, this is just an example of what we could get in terms of SSIM and Compression Ratio.

For the testing we've utilized uncompressed RAW bayer image from Blackmagic Design URSA camera with resolution 4032×2192, 12-bit. Compression Ratio was measured with relation to the packed uncompressed 12-bit image file size, which is equal to 12.6 MB, where two pixel values are stored in 3 Bytes.

Output RGB images were created with Fast CinemaDNG Processor software. Output colorspace was sRGB, 16-bit TIFF, no sharpening, no denoising. SSIM measurements were performed with these 16-bit TIFF images. Source image was compared with the processed image, which was encoded and decoded with each compression algorithm.

Table 1: Results for SSIM for encoding with standard JPEG quantization table

These results show that SSIM metrics is not really suitable for such tests. According to visual estimation, we can conclude that image quality Q = 80 and higher could be considered acceptable for all three algorithms, but the images from the third algorithm look better.

Table 2: Compression Ratio (CR) for encoding with standard JPEG quantization table

Performance for RAW encoding is the same for the first two methods, though for the third it's slightly less (performance drop is around 10-15%) because we need to spend additional time to split raw image to 4 planes according to the bayer pattern. Time measurements have been done with Fastvideo SDK for different NVIDIA GPUs. These are hardware-dependent results and you can do the same measurements for your particular NVIDIA hardware.

How to improve image quality, compression ratio and performance

There are several ways to get even better results in terms of image quality, CR and encoding performance for RAW compression:

Image sensor calibration

RAW image preprocessing: dark frame subtraction, bad pixel correction, white balance, LUT, denoise, etc.

Optimized quantization tables for 12-bit JPEG encoding

Optimized Huffman tables for each frame

Minimum metadata in JPEG images

Multithreading with CUDA Streams to get better performance

Better hardware from NVIDIA

Useful links converning GPU accelerated image compression

High Performance CUDA JPEG Codec

12-bit JPEG encoding on GPU

JPEG2000 Codec on GPU

RAW Bayer Codec on GPU

Lossless JPEG Codec on CPU

Original article see at: https://www.fastcompression.com/blog/fast-raw-compression.htm

#RAW #Compression #GPU #Bayer #Codec

0 notes

fastcompression · 6 years ago

Text

Benchmark comparison for Jetson Nano, TX2, Xavier NX and AGX

Author: Fyodor Serzhenko

NVIDIA has released a series of Jetson hardware modules for embedded applications. NVIDIA® Jetson is the world's leading embedded platform for image processing and DL/AI tasks. Its high-performance, low-power computing for deep learning and computer vision makes it the ideal platform for mobile compute-intensive projects.

We've developed an Image & Video Processing SDK for NVIDIA Jetson hardware. Here we present performance benchmarks for the available Jetson modules. As an image processing pipeline, we consider a basic camera application as a good example for benchmarking.

Hardware features for Jetson Nano, TX2, Xavier NX and AGX Xavier

Here we present a brief comparison for Jetsons hardware features to see the progress and variety of mobile solutions from NVIDIA. These units are aimed at different markets and tasks

Table 1. Hardware comparison for Jetson modules

In camera applications, we can usually hide Host-to-Device transfers by implementing GPU Zero Copy or by overlapping GPU copy/compute. Device-to-Host transfers can be hidden via copy/compute overlap.

Hardware and software for benchmarking

CPU/GPU NVIDIA Jetson Nano, TX2, Xavier NX and AGX Xavier

OS L4T (Ubuntu 18.04)

CUDA Toolkit 10.2 for Jetson Nano, TX2, Xavier NX and AGX Xavier

Fastvideo SDK 0.16.4

NVIDIA Jetson Comparison: Nano vs TX2 vs Xavier NX vs AGX Xavier

For these NVIDIA Jetson modules, we've done performance benchmarking for the following standard image processing tasks which are specific for camera applications: white balance, demosaic (debayer), color correction, resize, JPEG encoding, etc. That's not the full set of Fastvideo SDK features, but it's just an example to see what kind of performance we could get from each Jetson. You can also choose a particular debayer algorithm and output compression (JPEG or JPEG2000) for your pipeline.

Table 2. GPU kernel times for 2K image processing (1920×1080, 16 bits per channel, milliseconds)

Total processing time is calculated for the values from the gray rows of the table. This is done to show the maximum performance benchmarks for a specified set of image processing modules which correspond to real-life camera applications.

Each Jetson module was run with maximum performance

MAX-N mode for Jetson AGX Xavier

15W for Jetson Xavier NX and Jetson TX2

10W for Jetson Nano

Here we've compared just the basic set of image processing modules from Fastvideo SDK to let Jetson developers evaluate the expected performance before building their imaging applications. Image processing from RAW to RGB or RAW to JPEG are standard tasks, and now developers can get detailed info about expected performance for the chosen pipeline according to the table above. We haven't tested Jetson H.264 and H.265 encoders and decoders in that pipeline. As soon as H.264 and H.265 encoders are working at the hardware level, encoding can be done in parallel with CUDA code, so we should be able to get even better performance.

We've done the same kernel time measurements for NVIDIA GeForce and Quadro GPUs. Here you can get the document with the benchmarks.

Software for Jetson performance comparison

We've released the software for a GPU-based camera application on GitHub, and it's available to download both binaries and source codes for our gpu camera sample project. It's implemented for Windows 7/10, Linux Ubuntu 18.04 and L4T. Apart from a full image processing pipeline on GPU for still images from SSD and for live camera output, there are options for streaming and for glass-to-glass (G2G) measurements to evaluate real latency for camera systems on Jetson. The software currently works with machine vision cameras from XIMEA, Basler, JAI, Matrix Vision, Daheng Imaging, etc.

To check the performance of Fastvideo SDK on a laptop/desktop/server GPU without any programming, you can download Fast CinemaDNG Processor software with GUI for Windows or Linux. That software has a Performance Benchmarks window, and there you can see timing for each stage of image processing. This is a more sofisticated method of performance testing, because the image processing pipeline in that software can be quite advanced, and you can test any module you need. You can also perform various tests on images with different resolutions to see how much the performance depends on image size, content and other parameters.

Other blog posts from Fastvideo about Jetson hardware and software

Jetson Image Processing

Jetson Zero Copy

Jetson Nano Benchmarks on Fastvideo SDK

Jetson AGX Xavier performance benchmarks

JPEG2000 performance benchmarks on Jetson TX2

Remotely operated walking excavator on Jetson

Low latency H.264 streaming on Jetson TX2

Performance speedup for Jetson TX2 vs AGX Xavier

Source codes for GPU-Camera-Sample software on GitHub to connect USB3 and other cameras to Jetson

Original article see at: https://www.fastcompression.com/blog/jetson-benchmark-comparison.htm

Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression

#Jetson Jetson Nano Jetson TX1 Jetson TX2 Jetson Xavier Benchmarks

0 notes

fastcompression · 6 years ago

Text

Jetson image processing: ISP libargus and Fastvideo SDK

Jetson image processing for camera applications

Jetson hardware is absolutely unique solution from NVIDIA. This is essentially a mini PC with extremely powerful and versatile hardware. Apart from ARM processor it has a sophisticated high performance GPU with CUDA cores, Tensor cores (on AGX Xavier), software for CPU/GPU and AI.

Below you can see an example of how to build a camera system on Jetson. This is an important task if you want to create realtime solution for mobile imaging application. With a thoughtful design, one can even implement a multicamera system on just a single Jetson, and some NVIDIA partners showcase that this is in fact achievable.

How image processing could be done on NVIDIA Jetson

ISP inside Jetson (libargus library on the top of hardware solution)

V4L2 framework instead of argus/nvcamerasrc to get bayer data like v4l2-ctl

Image processing on CUDA (NPP library, Fastvideo SDK)

Image processing on ARM (C++, Python, OpenCV)

Hardware-based encoding and decoding with NVENC

AI on CUDA and/or Tensor cores

Here we consider just ISP and CUDA-based image processing pipelines to describe how the task could be solved, which image processing algorithms could be utilized, etc. For the beginning we consider NVIDIA camera architecture.

Camera Architecture Stack

The NVIDIA camera software architecture includes NVIDIA components for ease of development and customization:

Fig.1. Diagram from Development Guide for NVIDIA Tegra Linux Driver Package (31.1 Release, Nov.2018)

NVIDIA Components of the camera architecture

libargus - provides low-level API based on the camera core stack

nvarguscamerasrc - NVIDIA camera GStreamer plugin that provides options to control ISP properties using the ARGUS API

v4l2src - standard Linux V4L2 application that uses direct kernel IOCTL calls to access V4L2 functionality

NVIDIA provides OV5693 Bayer sensor as a sample and tunes this sensor for the Jetson platform. The drive code, based on the media controller framework, is available at ./kernel/nvidia/drivers/media/i2c/ov5693.c, NVIDIA further offers additional sensor support for BSP software releases. Developers must work with NVIDIA certified camera partners for any Bayer sensor and tuning support.

The work involved includes:

Sensor driver development

Custom tools for sensor characterization

Image quality tuning

These tools and operating mechanisms are NOT part of the public Jetson Embedded Platform (JEP) Board Support Package release. For more information on sensor driver development, see the NVIDIA V4L2 Sensor Driver Programming Guide.

Jetson includes internal hardware-based solution (ISP) which was created for realtime camera applications. To control these features on Jetson hardware, there is libargus library.

Camera application API libargus offers:

low-level frame-synchronous API for camera applications, with per frame camera parameter control

multiple (including synchronized) camera support

EGL stream outputs

RAW output CSI cameras needing ISP can be used with either libargus or GStreamer plugin. In either case, the V4L2 media-controller sensor driver API is used.

Sensor driver API (V4L2 API) enables:

video decode

encode

format conversion

scaling functionality

V4L2 for encode opens up many features like bit rate control, quality presets, low latency encode, temporal tradeoff, motion vector maps, and more.

Libargus library features for Jetson ISP

Bad pixel correction

Bayer domain hardware noise reduction

Per-channel black-level compensation

High-order lens-shading compensation

3A: AF/AE/AWB

Demosaic

3x3 color transform

Color artifact suppression

Downscaling

Edge enhancement (sharp)

To summarize, ISP is a fixed-function processing block which can be configured through the Argus API, Linux drivers, or the Technical Reference Manual which contains register information for particular Jetson.

All information about utilized algorithms (AF, AE, demosaicing, resizing) is closed and user needs to test them to evaluate quality and performance.

ISP is a hardware-based solution for image processing on Jetson and it was done for mobile camera applications with high performance and low latency.

How to choose the right camera

To be able to utilize ISP, we need a camera with CSI interface. NVIDIA partner - Leopard Imaging company is manufacturing many cameras with that interface and you can choose according to requirements. CSI interface is the key feature to send data from a camera to Jetson with a possibility to utilize ISP for image processing.

If we have a camera without CSI support (for example, GigE, USB-3.x, CameraLink, Coax, 10-GigE, PCIE camera), we need to create CSI driver to be able to work with Jetson ISP.

Even if we don't have CSI driver, there is still a way to connect your camera to Jetson.

You just need to utilize proper carrier board with correct hardware output. Usually this is either USB-3.x or PCIE. There is a wide choice of USB3 cameras on the market and one can easily choose any camera or carrier board you need. For example, from NVIDIA partner - XIMEA GmbH.

Fig.2. XIMEA carrier board for NVIDIA Jetson TX1/TX2

To work further with the camera, you need camera driver for L4T and ARM processor - this is minimum requirement to connect your camera to Jetson via carrier board.

However, keep in mind that in this case ISP is not available. Next part deals with such situation.

How to work with non-CSI cameras on Jetson

Let's assume that we've already connected non-CSI camera to Jetson and we can send data from the camera to system memory on Jetson.

Now we can't access Jetson ISP and we need to consider other ways of image processing. The fastest solution is to utilize Fastvideo SDK for Jetson GPUs.

That SDK actually exists for Jetson TK1, TX1, TX2, TX2i and AGX Xavier.

You just need to send data to GPU memory and to create full image processing pipeline on CUDA. This is the way to keep CPU free and to ensure fast processing due to excellent performance of mobile Jetson GPU on CUDA. Based on that approach you can create multicamera systems on Jetson with Fastvideo SDK together with USB-3.x or PCIE cameras.

For more info about realtime Jetson applications with multiple cameras you can have a look the site of NVIDIA partner XIMEA, which is manufacturing high quality cameras for machine vision, industrial and scientific applications.

Fig.3. NVIDIA Jetson with multiple cameras on TX1/TX2 carrier board from XIMEA

Image processing on Jetson with Fastvideo SDK

Fastvideo SDK is intended for camera applications and it has wide choice of features for realtime raw image processing on GPU. That SDK also exists for NVIDIA GeForce/Quadro/Tesla GPUs and consists of high quality algorithms which require significant computational power.

This is the key difference in comparison with any hardware-based solution. Usually ISP/FPGA/ASIC image processing modules offer low latency and high performance, but because of hardware restrictions, utilized algorithms are relatively simple and have moderate image quality.

Apart from image processing modules, Fastvideo SDK has high speed compression solutions: JPEG (8/12 bits), JPEG2000 (8-16 bits), Bayer (8/12 bits) codecs which are implemented on GPU. These codecs are working on CUDA and they were heavily tested, so they are reliable and very fast.

For majority of camera applications, 12 bits per pixel is a standard bit depth and it makes sense to store compressed images at least in 12-bit format or even at 16-bit.

Full image processing pipeline on Fastvideo SDK is done at 16-bit precision, but some modules that require better precision are implemented with float.

Fig.4. Image processing workflow on CUDA at Fastvideo SDK for camera applications

To check quality and performance of raw image processing with Fastvideo SDK, user can download GUI application which is called Fast CinemaDNG Processor. The software is fully based on Fastvideo SDK and it could be downloaded from www.fastcinemadng.com together with sample image series in DNG format.

That application has benchmarks window to check time measurements for each stage of image processing pipeline on GPU.

High-resolution multicamera system for UAV Aerial Mapping

Application: 5K vision system for Long Distance Remote UAV

Manufacturer: MRTech company

Cameras

One XIMEA 20 MPix PCIe camera MX200CG-CM

Two XIMEA 3.1 MPix PCIe cameras MX031CG-SY

Hardware

NVIDIA Jetson TX2 or TX2i module with custom carrier board

NVMe SSD 960 PRO M.2 onboard

Jetson GPU image processing

Full processing workflow on CUDA: acquisition, black level, white balance, LUT, high quality demosaicing, etc.

H.264/265 encoding, RTSP streaming via radio channel

Streaming of 4K images at 25 fps and 2× Full HD 1080p (1920 × 1080) images at 30 fps simultaneously

Save high resolution snapshot images to SSD

Power usage 35W (including all cameras)

Fig.5. NVIDIA Jetson TX2 with XIMEA MX200CG-CM (20 MPix) and two MX031CG-SY (3.1 MPix) cameras.

More information about MRTech solutions for Jetson image processing you can find here.

AI imaging applications on Jetson

With the arrival of AI solutions, the following task needs to be solved: how to prepare high quality input data for such systems?

Usually we get images from cameras in realtime and if we need high quality images, then choosing a high resolution color camera with bayer pattern is justified.

Next we need to implement fast raw processing and after that we will be able to feed our AI solution with good pictures in realtime.

The latest Jetson AGX Xavier has high performance Tensor cores for AI applications and these cores are ready to receive images from CUDA software. Thus we can send data directly from CUDA cores to Tensor cores to solve the whole task very fast.

Links:

XIMEA cameras for Jetson applications

MRTech software solutions for Jetson imaging systems

Fastvideo Image & Video Processing SDK for NVIDIA Jetson

Low latency H.264 streaming on Jetson TX2

Original article see here: https://www.fastcompression.com/blog/jetson-image-processing.htm

#Jetson CUDA Fastvideo SDK NVIDIA XIMEA

0 notes