Don't wanna be here? Send us removal request.
Text
JPEG XS – modern visually-lossless low-latency lightweight codec
Authors: Fyodor Serzhenko and Anton Boyarchenkov
JPEG XS is a recent image and video coding system developed by the Joint Photographic Experts Group and published as international standard ISO/IEC 21122 in 2019 [1] (the second edition in 2022 [2]). Unlike many former standards developed by the JPEG committee, JPEG XS addresses video compression. What makes it stand out from the rest video compression techniques are different priorities. Improving coding efficiency was the highest priority of previous approaches, while latency and complexity have been, at best, only secondary goals. That’s why the uncompressed video streams have still been used for transmission and storage. But now JPEG XS have emerged as a viable alternative to the uncompressed form.

Background of JPEG XS
There is a continual fight between the benefits of uncompressed video, and very high bandwidth delivery requirement. Network bandwidth continues to increase, but so does resolution and complexity of video. With the emergence of such formats as Ultra-High Definition (4K, 8K), High Dynamic Range, High Frame Rate, panoramic (360), both the storage and bandwidth requirements are rapidly increasing [3] [4].
Instead of costly upgrade or replacement of deployed infrastructure we can consider using transparent compression to reduce video stream sizes of these demanding video formats. Surely, such compression should be visually lossless, low latency and low complexity. However, the existing codecs (see short review in section 3) were not able to satisfy all the requirements simultaneously, because they were mostly designed with the coding efficiency as the main goal.
But solely improving the coding efficiency is not the only motivation for video compression. A lightweight compression scheme can achieve energy savings, when energy required for transmission is greater than energy cost of compression. In addition, the delay could even be reduced if compression overhead is less than the difference of transmission time of uncompressed and compressed frames.
For non-interactive video systems, such as video playback, the latency is not important as long as the decoder provides the required frame rate. On the contrary, interactive video applications require low latency to be useful. When network latency is low enough, the video processing pipeline can become the bottleneck. The latency is even more important in the case of fast-moving and safety-critical applications. Besides, sufficiently low delay will open up space for new applications, such as cloud gaming, extended reality (XR), or internet of skills. [5]
Use cases
The most common example of uncompressed video transport is through standard video links such as SDI and HDMI, or through Ethernet. In particular, massively deployed 3G-SDI was introduced with SMPTE ST 2018 standard in 2006 and have throughput of 2.65 Gbps, which is enough for video stream in 1080p60 format. Video compression with ratio 4:1 would allow sending 4K/60p/422/10 bits video (requiring 10.8 Gbps throughput) through 3G-SDI. 10G Ethernet (SMPTE2022 6) have throughput of 7.96 Gbps, while video compression with ratio 5:1 would allow sending two video streams with 4K/60p/444/12 bits format (requiring 37.9 Gbps) through it [3] [4] [6].
Embedded devices such as cameras use internal storage, which have limited access rates (4 Gbps for SSD drives, 400–720 Mbps for SD cards). Lightweight compression would allow real-time storage of video streams with higher throughput. Omnidirectional video capture system with multiple cameras covering different field of view, transfer video streams to a front-end processing system. Applying lightweight compression to these streams will reduce both the required storage size and throughput demands [3] [4] [6].
Head mounted displays (HMDs) are used for viewing omnidirectional VR and AR content. Given the computational (and power) constraints of such a display, it can not be expected to receive omnidirectional stream and locally process it. Instead, the external source should send to HMD that portion of the media stream which is within the viewer’s field of view. An immersive experience also requires very high-resolution video, and the quality of experience is crucially tied to the latency [3] [4] [6].
Other target use cases include broadcasting and live production, frame buffer compression (inside video processing devices), industrial vision, ultra high frame rate cameras, medical imaging, automotive infotainment, video surveillance and security, low-cost visual sensors in Internet of Things, etc. [6]
Emerging of the new standard
Addressing this challenge, several initiatives have been started. Among them is JPEG XS, launched by the JPEG committee in July 2015 with Call of Proposals issued in March–June 2016 [6]. The evaluation process was structured into three activities: objective evaluations, subjective evaluations, and compliance analysis in terms of latency and complexity requirements. Based on the above-described use cases the following requirements were identified.
Visually lossless quality with imperceptible flickering between original and compressed image.
Multi-generation robustness (no significant quality degradation for up to 10 encoding-decoding cycles).
Multi-platform interoperability. In order to optimally support different platforms (CPU, GPU, FPGA, ASIC) the codec needs to allow for different kinds of parallelism.
Low complexity both in hardware and software.
Low latency. In live production and AR/VR use cases the cumulative delay required by all processing steps should be below the human perception threshold.
It is easy to see that none of the existing standards comply with the above requirements. JPEG and JPEG-XT make a precise rate control difficult and show a latency of one frame. With regard to latency, JPEG 2000 versatility allows configurations with end-to-end latency around 256 lines or even as small as 41 lines in hardware implementations [4], but it still requires many hardware resources. VC-2 is of low complexity, but only delivers limited image quality. ProRes makes a low latency implementation impossible and makes fast CPU implementations challenging.
Out of 6 proposed technologies one was disqualified due to latency and complexity compliance issues, and two proponents was selected for the next step of standardization process. It was decided that JPEG XS coding system will be based on the merge of those two proponents. This new codec provides a precise rate control with a latency below 32 lines and fits in a low-cost FPGA. At the same time its fine-grained parallelism allows optimal implementation on different platforms, while the compression quality is superior to VC-2.
JPEG XS algorithm overview
The JPEG XS coding system is a classical wavelet-based still image codec (see more detailed description in [4] or in the standard [1] [2]). It uses reversible color transformation and reversible discrete wavelet transformation (Le Gall 5/3), which are known from JPEG 2000. But here DWT is asymmetrical: the specification allows up to two vertical decomposition levels and up to eight horizontal levels.
This restriction on number of vertical levels ensures that the end-to-end latency does not exceed maximum allowed value of 32 screen lines. In fact, algorithmic encoder-decoder latency due to DWT alone is 3 or 9 lines for one or two vertical decomposition levels, so there is a latency reserve for any form of rate allocation not specified in the standard.
The wavelet stage is followed by a pre-quantizer which chops off the eight out of 20 least significant bit planes. It is not used for rate-control but ensures that the following data path is 16 bits wide. After that the actual quantization is performed. Unlike JPEG 2000 with a dead-zone quantizer, a data-dependent uniform quantizer can be optionally used.
The quantizer is controlled by the rate allocator, which guarantees compression to an externally given target bit rate, which is strict in many use cases. In order to respect target bit rate together with maximum latency of the 32 lines, JPEG XS divides image into rectangular precincts. While in JPEG 2000 precincts are typically quadratic regions, a precinct in JPEG XS spans across one or two lines of wavelet coefficients for each band.
Due to latency constraints the rate allocator is not precise, but rather heuristic algorithm without actual distortion measurement. Moreover, the specific way of rate allocator operating is not defined in the standard, so different algorithms can be considered. Algorithm is ideal for low-cost FPGA where access to external memory should be avoided, and it can be suboptimal for high-end GPU.
The next stage after rate allocation is entropy coding, which is relatively simple. The quantized wavelet coefficients are combined in coding groups of four coefficients. For each group, the three datasets are formed: bit-plane counts, quantized values themselves and the signs of all nonzero coefficients. From these datasets only bit-plane counts are entropy coded, because they require a major part of the overall rate.
The rate allocator is free to select between four regular prediction modes per wavelet band – prediction on/off, significance coding on/off. Besides, it can select between two significance coding methods, which specifies whether zero predictions or zero counts are coded. “Raw fallback mode” allows disabling bit-plane coding and should be used when the regular coding modes are redundant.
A smoothing buffer ensuring a constant bit rate at the encoder output even if some regions of the input image are easier to compress. This buffer can have different size according to the selected profile. This choice affects the rate control algorithm, which uses it to smooth out rate variations.
JPEG XS profiles
Particular applications may have additional constraints on the codec, such as even lower complexity or buffer size limitation. So, the standard defines several profiles to allow different levels of latency and complexity. In fact, the entire part 2 of the standard (ISO/IEC 21122-2 “Profiles and Buffer Models” [7]) is devoted to specification of profiles, levels and sublevels [4].
Each profile allows to estimate the necessary number of logic elements, the memory footprint, and whether chroma subsampling or an alpha channel is required. They are structured along the maximum bit depth, the quantizer type, the smoothing buffer size, and the number of vertical DWT levels. Other coding tools such as choice of embedded/separate sign coding or insignificant coding groups method insignificantly increase decoder complexity, so they are not restricted by the profile. As such, the standard defines eight profiles, whose characteristics are summarized in Table 1.
The three “Main” profiles target all types of content (natural, CGI, screen) for Broadcast, Pro-AV, Frame Buffers, Display links use cases. The two “High” profiles allow for second vertical decomposition and target all types of content for high-end devices, cinema remote production. The two “Light” profiles are considered suitable for natural content only and target Broadcast, industrial cameras, in-camera compression use cases. Finally, the “Light-subline” with minimal latency (due to zero vertical decomposition and the shortest smoothing buffer) is also suitable for natural content only and target cost-sensitive applications.
Profiles determine the set of coding features, while levels and sublevels limit the buffer sizes. In particular, levels restrict it in the uncompressed image domain and sublevels in the compressed domain. Similar to HEVC levels, JPEG XS levels constrain the frame dimensions and the refresh ratio (e.g., 1920p/60).
Table 1. Configuration of JPEG XS Profiles [4].
Performance evaluation
This section shows experimental results on rate-distortion comparison against other compression technologies with PSNR as the distortion measure. We are focused on RGB 4:4:4 24-bit natural content here, as it was shown that results for subsampled images and images with higher bit depth are similar.
Figure 1. Compression ratio dependence of image quality measured by PSNR for different codecs and profiles.
On Figure 1 we've compared rate-distortion dependencies for JPEG XS and two classical image codecs: JPEG and JPEG 2000. The testing procedure was as follows. Our test image 4k_wild.ppm (3840 × 2160 × 24bpp) with natural content was compressed multiple times with several compression ratios in the range from 2:1 to 20:1. These ratios are equal for JPEG XS and JPEG 2000, which allows the direct comparison. But ratios are different for JPEG, because it has no precise rate control functionality. The highest point of JPEG 2000 curve (with infinite PSNR) shows compression ratio of reversible algorithm. The test image is visually lossless for all cases when PSNR is higher than 40 dB.
As we can see on the figure, among the three image codecs, JPEG 2000 shows the highest quality (visually lossless even for ratio 30:1 with this image), but it comes with much greater computational complexity. The quality of the classical JPEG images is even higher for ratios 6:1 or less (and visually lossless for ratio 14:1), and it has low complexity, but the lack of precise rate control can be critical in some applications, and the minimum latency is one frame. That’s why it cannot substitute uncompressed video and JPEG XS. Although JPEG XS curves lay below curves of other two codecs, the image quality is still high enough to be visually lossless when the ratio is below 10:1.
The average PSNR difference is 5.4 dB between JPEG 2000 and the “high” profile JPEG XS, and 4.5 dB between JPEG 2000 and the “main” profile JPEG XS (for compression ratios up to 10:1). The average difference is 0.75 dB between the “main” and “high” profiles and only 0.45 dB between the “main” and “light” profiles.
Patents and RAND
Please bear in mind that JPEG XS contains patented technology which is made available for licensing via the JPEG XS Patent Portfolio License (JPEG XS PPL). This license pool covers essential patents owned by Licensors for implementing the ISO/IEC 21122 JPEG XS video coding standard and is available under RAND terms. You can find more info at https://www.jpegxspool.com
We've implemented the high-performance JPEG XS decoder on GPU as an accelerated solution for JPEG XS project from iso.org (Part 5 of the international standard ISO/IEC 21122-5:2020 [8]) which is done for CPU and show performance way below real-time processing. That was done to show the potential for GPU-based speedup for such a software. We can offer our customers high performance JPEG XS decoder for NVIDIA GPUs, though all questions concerning licensing of JPEG XS technology the customer must settle with JPEG XS patents owners.
References
1. ISO/IEC 21122-1:2019 Information technology — JPEG XS low-latency lightweight image coding system — Part 1: Core coding system. https://www.iso.org/standard/74535.html
2. ISO/IEC 21122-1:2022 Information technology — JPEG XS low-latency lightweight image coding system — Part 1: Core coding system. https://www.iso.org/standard/81551.html
3. JPEG White paper: JPEG XS, a new standard for visually lossless low-latency lightweight image coding system, Version 2.0 // ISO/IEC JT1/SC29/WG1 WG1N83038 http://ds.jpeg.org/whitepapers/jpeg-xs-whitepaper.pdf
4. A. Descampe, T. Richter, T. Ebrahimi, et al. JPEG XS—A New Standard for Visually Lossless Low-Latency Lightweight Image Coding // Proceedings of the IEEE Vol. 109, Issue 9 (2021) 1559.
5. J. Žádník, M. Mäkitalo, J. Vanne, and P. Jääskeläinen, Image and Video Coding Techniques for Ultra-Low Latency Image and Video Coding Techniques for Ultra-Low Latency // ACM Computing Surveys (accepted paper). https://doi.org/10.1145/3512342
6. WG1 (ed. A. Descampe). Call for Proposals for a low-latency lightweight image coding system // ISO/IEC JTC1/SC29/WG1 N71031, 71th Meeting – La Jolla, CA, USA – 11 March 2016. https://jpeg.org/downloads/jpegxs/wg1n71031-REQ-JPEG_XS_Call_for_proposals.pdf
7. ISO/IEC 21122-2:2022 Information technology — JPEG XS low-latency lightweight image coding system — Part 2: Profiles and buffer models. https://www.iso.org/standard/81552.html
8. ISO/IEC 21122-5:2020 Information technology — JPEG XS low-latency lightweight image coding system — Part 5: Reference software. https://www.iso.org/standard/74539.html
Original article see at: https://fastcompression.com/blog/jpeg-xs-overview.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
GPU HDR Processing for SONY Pregius Image Sensors
Author: Fyodor Serzhenko
The fourth generation of SONY Pregious image sensors (IMX530, IMX531, IMX532, IMX532, IMX535, IMX536, IMX537, IMX487) is capable of working in HDR mode. That mode is called "Dual ADC" (Dual Gain) which means that two raw frames are originated from the same 12-bit raw image which is digitized via two ADCs with different analog gains. If the ratio of these gains is around 24 dB, one can get one 16-bit raw image from two 12-bit raw frames with different gains. This is actually the main idea of HDR for these image sensors - how to get extended dynamic range up to 16 bits from two 12-bit raw frames with the same exposure and with different analog gains. That method guarantees that both frames have been exposured at the same time and they are not spatially shifted.
That Dual ADC feature was originally introduced at the third generation on SONY Pregius image sensors, but HDR processing had to be implemented outside the image sensor. The latest version of that HDR feature is done inside the image sensor which makes it more convenient to work with. Dual ADC mode with on-sensor combination (combined mode) is applicable for high speed sensors only.
fastcompression.com
Fastvideo blog
HDR for SONY Pregius IMX532 image sensor on GPU
GPU HDR Processing for SONY Pregius Image Sensors
Author: Fyodor Serzhenko
The fourth generation of SONY Pregious image sensors (IMX530, IMX531, IMX532, IMX532, IMX535, IMX536, IMX537, IMX487) is capable of working in HDR mode. That mode is called "Dual ADC" (Dual Gain) which means that two raw frames are originated from the same 12-bit raw image which is digitized via two ADCs with different analog gains. If the ratio of these gains is around 24 dB, one can get one 16-bit raw image from two 12-bit raw frames with different gains. This is actually the main idea of HDR for these image sensors - how to get extended dynamic range up to 16 bits from two 12-bit raw frames with the same exposure and with different analog gains. That method guarantees that both frames have been exposured at the same time and they are not spatially shifted.
That Dual ADC feature was originally introduced at the third generation on SONY Pregius image sensors, but HDR processing had to be implemented outside the image sensor. The latest version of that HDR feature is done inside the image sensor which makes it more convenient to work with. Dual ADC mode with on-sensor combination (combined mode) is applicable for high speed sensors only.
In the Dual ADC mode we need to specify some parameters for the image sensor. There are two ways of getting the extended dynamic range from SONY Pregius image sensors:
In the combined mode the image sensor can output one 12-bit raw frame with applied merge feature (when we combine two 12-bit frames with Low gain and High gain) and simple tone mapping (when we apply PWL curve to 16-bit merged data). That approach allows us to have minimum camera bandwidth because in that case the image size is minimal - this is just a 12-bit raw frame.
In the non-combined mode the image sensor outputs two 12-bit raw images which could be processed later outside the image sensor. This is the worst case for the camera bandwidth, but it could be promising for high quality merge and sofisticated tone mapping.
Apart from that, there are two other options:
We can process just Low gain or High gain image, but it's quite evident, that dynamic range in that case will be not better than in the Dual ADC mode.
It's also possible to apply our own HDR algorithm to the results of the combined mode as an attempt to improve image quality and dynamic range.
Dual Gain mode parameteres for image merge
Threshold - this is an intensity level where we should start utilizing Low gain data instead of High gain
Low gain (AD1) and High gain (AD2) - these are values for analog gain (0 dB, 6 dB, 12 dB, 18 dB, 24 dB)
Dual Gain mode parameteres for HDR
Two pairs of knee points for PWL curve (gradation compression from 16-bit range to 12-bit). They actually come from Low gain and High gain values, and from parameters of gradation compression.
Below is the picture with detailed info concerning PWL curve which is applied after image merge, and it's done inside the image sensor. We can see how gradation compression is implemented at the image sensor.
This is an example of real parameters for Dual ADC mode for SONY IMX532 image sensor
Dual ADC Gain Ratio: 12 dB
Dual ADC Threshold: 40%
Compression Region Selector 1:
Compression Region Start: 6.25%
Compression Region Gain: -12 dB
Compression Region Selector 2:
Compression Region Start: 25%
Compression Region Gain: -18 dB
For further testing we will capture frames from IMX532 image sensor at XIMEA camera MC161CG-SY-UB-HDR with exactly the same parameters of Dual ADC mode.
If we compare images with gain ratio 16 (High gain is 16 times greater than Low gain) and exposure ratio 1/16 (long exposure for Low gain and short exposure for High gain), then we clearly see that images are alike, but High gain image has the following two problems: it has more noise and more hot pixels due to strong analog signal amplification. These issues should be taken into account.
Apart from the standard Dual ADC combined mode, there is a quite popular approach which could bring good results with minimum efforts: we can use just Low gain image and apply custom tone mapping instead of PWL curve. In that case dynamic range is less, but that image could have less noise in comparison with images from the combined mode.
Why do we need to apply our own HDR image processing?
It makes sense if on-sensor HDR processing in Dual ADC mode could be improved. That could be the way of getting better image quality due to implementation of more sofisticated algorithms for image merge and tone mapping. GPU-based processing is usually very fast, so we could still be able to process image series with HDR support in realtime, which is a must for camera applications.
HDR image processing pipeline on NVIDIA GPU
We've implemented image processing pipeline on NVIDIA GPU for Dual ADC frames from SONY Pregius image sensors. Actually we've extended our standard pipeline to work with such HDR images. We can process on NVIDA GPU any frames from SONY image sensors in the HDR mode: one 12-bit HDR raw image (combined mode) or two 12-bit raw frames (non-combined mode). Our result could be better not only due to our merge and tone mapping procedures, but also due to high quality debayering which also influences on the quality of processed images. Why we use GPU? This is the key to get much higher performance and image quality which can't be achieved on the CPU.
Low gain image processing
As we've already mentioned, this is the simplest method which is widely accepted and it's actually the same as a switched-off Dual ADC mode. Low gain 12-bit raw image has less dynamic range, but it also has less noise, so we can apply either 1D LUT or more complicated tone mapping algorithm to that 12-bit raw image to get better results in comparison with combined 12-bit HDR image which we can get directly from SONY image sensor. This is a brief info about the pipeline:
Acquisition of 12-bit raw image from a camera with SONY image sensor
BPC (bad pixel correction)
Demosaicing with MG algorithm (23×23)
Color correction
Curves and Levels
Local tone mapping
Gamma
Optional JPEG or J2K encoding
Monitor output, streaming or storage
Fig.1. Low gain image processing for IMX532
Image processing at the Combined mode
Though we can get ready 12-bit raw HDR image from SONY image sensor at Dual ADC mode, there is still a way to improve the image quality. We can apply our own tone mapping to make it better. That's what we've done and the results are consistently better. This is a brief info about the pipeline:
Acquisition of 12-bit raw HDR image from a camera with SONY image sensor
Preprocessing
BPC (bad pixel correction)
Demosaicing with MG algorithm (23×23)
Color space conversion
Global tone mapping
Local tone mapping
Optional JPEG or J2K encoding
Monitor output, streaming or storage
Fig.2. SONY Dual ADC combined mode image processing for IMX532 with a custom tone mapping
Low gain + High gain (non-combined) image processing
To get both raw frames from SONY image sensor, we need to send them to a PC via camera interface. It could cause a problem for interface bandwidth and for some cameras it could be a must to decrease frame rate to cope with camera bandwidth limitations. If we use PCIe, Coax or 10/25/50-GigE cameras, then it could be possible to send both raw images at realtime without frame drops.
As soon as we get two raw frames (Low gain and High gain) for processing, we need to start from preprocessing, then to merge them into one 16-bit linear image and to apply tone mapping algorithm. Usually good tone mapping algorithms are more complicated than just a PWL curve, so we can get better results, though it definitely takes much more time. To solve that issue in a fast way, high performance GPU-based image processing could be the best approach. That's exactly what we've done and we can get better image quality and higher dynamic range in comparison with combined HDR image from SONY and with processed Low gain image as well.
HDR workflow for Dual ADC non-combined image processing on GPU
Acquisition of two raw images in non-combined Dual ADC mode
Preprocessing of two images
BPC (bad pixel correction) for both images
RAW Histogram and MinMax for each frame
Merge for Low gain and High gain raw images
Demosaicing with MG algorithm (23×23)
Color space conversion
Global tone mapping
Local tone mapping
Optional JPEG or J2K encoding
Monitor output, streaming or storage
In that workflow the most important modules are merge, global/local tone mapping and demosaicing. We've implemented that image processing pipeline with Fastvideo SDK which is running very fast on NVIDIA GPU.
Fig.3. SONY Dual ADC non-combined (two-image) processing for IMX532
Resume for Dual ADC mode on GPU
Better image quality
Sofisticated merge for Low gain and High gain images
Global and local tone mapping
High quality demosaicing
Better dynamic range
Less artifacts for brightness and color
Less noise
High performance processing
We believe that the best results for image quality could be achived in the following modes:
Simultaneous processing of two 12-bit raw images in the non-combined mode.
Processing of one 12-bit raw frame in the combined mode with a custom tone mapping algorithm.
If we are working in the non-combined mode, then we can get good image quality, but camera bandwith limitation and processing time could be a problem. If we are working with the results of the combined mode, image quality is comparable, the processing pipeline is less complicated (the performance is better), and we need less bandwidth, so it could be recommended for most use cases. With a proper GPU, image processing could be done in realtime at the max fps.
The above frames were captured from SONY IMX532 image sensor at Dual ADC mode. The same approach is applicable to all high speed SONY Pregius image sensors of the 4th generation which are capable of working at Dual ADC combined mode as well.
Processing benchmarks on Jetson AGX Xavier and GeForce RTX 2080TI in the combined mode
We've done time measurements for kernel times to evaluate the performance of the solution in the combined mode. This is the way to get high dynamic range and very good image quality, so the knowledge about performance could be valuable. Below we publish timings for several image processing modules because full pipeline could be different in general case.
Table 1. GPU kernel time in ms for IMX532 raw frame processing in the combined mode (5328×3040, bayer, 12-bit)
This is just the part of the full image processing pipeline and this is to show a level of how fast it could be on the GPU.
References
Fastvideo SDK for Image & Video Processing on GPU
RAW to RGB conversion on GPU
XIMEA high speed color industrial camera with Sony IMX532 image sensor
Original article see at: https://fastcompression.com/blog/gpu-hdr-processing-sony-pregius-image-sensors.htm
0 notes
Text
Image Processing Framework on Jetson
Author: Fyodor Serzhenko
Nowadays quite a lot of tasks for image sensors and camera applications are solved with the centralized computing architecture for image processing. Just a minor part of image processing features is implemented on the image sensor itself, so all the rest is done on CPU/GPU/DSP/FPGA which could reside very close to the image sensor. The latest achievements at hardware and software solutions allow us to get enhanced performance of computations and to enlarge the scope of tasks to be solved.
From that point of view, NVIDIA Jetson series is suited exactly for the task of high performance image processing from RAW to YUV. Image sensor or camera module can be connected directly to any Jetson via MIPI SCI-2 (2lane/4lane), USB3 or PCIe interfaces. Jetson could offer high performance computations either on ISP or on GPU. Below we show what could be done on GPU. We believe that raw image processing on GPU can offer more flexibility, better performance, quality and ease of management in comparison with hardware-based ISP for many applications.
What is Image Processing Framework for Jetson?
To get high quality and max performance at image processing tasks on Jetson, we've implemented a GPU-based SDK for raw processing. Now we are expanding that approach by creating an effective framework to control all system components, including hardware and software. For example, it means that image sensor control should be included in the workflow at realtime to become a part of general control algorithm.
Image processing framework components
Image Sensor Control (exposure, gain, awb)
Image Capture (driver, hardware/software interface, latency, zero-copy)
RAW Image Processing Pipeline (full raw to rgb workflow)
Image Enhancement
Image/Video Encoding (JPEG/J2K and H.264/H.265)
Compatibility with third-party libraries for image processing, ML, AI, etc.
Image Display Processing (smooth player, OpenGL/CUDA interoperability)
Image/Video Streaming (including interoperability with FFmpeg and GStreamer)
Image/Video Storage
Additional features for the framework
Image Sensor and Lens Calibration
Quality Control for Image/Video Processing
CPU/GPU/SSD balance load, performance optimization, profiling
Implementation of image sensor control at the workflow brings us additional features which are essential. For example, integrated exposure and gain control will allow to get better quality in the case of varying illumination. Apart from that, calibration data usually depend on exposure/gain and it means that we will be able to utilize correct processing parameters at any moment for any viewing conditions.
In general, standard RAW concept eventually lacks internal camera parameters and full calibration data. We could solve that problem by including image sensor control both in calibration and image processing. We can utilize image sensor abstraction layer to take into account full metadata for each frame.
Such a solution depends on utilized image sensor and task to be solved, so we can configure and optimize the Image Processing Framework for a particular image sensor from SONY, Gpixel, CMOSIS image sensors. These solutions on Jetson have already been implemented by teams of Fastvideo and MRTech.
Integrated Image Sensor Control
Exposure time
AWB
Gain
ROI (region of interest)
Full image sensor control also includes bit depth, FPS (frames per second), raw image format, bit packing, mode of operation, etc.
GPU image processing modules on Jetson for 16/32-bit pipeline
Raw image acquisition from image sensor via MIPI/USB3/PCIe interfaces
Frame unpacking
Raw image linearization
Dark frame subtraction
Flat field correction
Dynamic bad pixel removal
White balance
RAW and RGB histograms as an indication to control image sensor exposure time
Demosaicing with L7, DFPD, MG algorithms
Color correction
Denoising with wavelets
Color space and format conversions
Curves and Levels
Flip/Flop, Rotation to 90/180/270 or to arbitrary angle
Crop and Resize (upscale and downscale)
Undistortion via Remap
Local contrast
Tone mapping
Gamma
Realtime output via OpenGL
Trace module for debugging and bug fixing
Stream-per-thread support for better performance
Additional modules: tile support, image split into separate planes, RGB to Gray transform, defringe, etc.
Time measurements for all SDK modules
Image/Video Encoding modules on GPU
RAW Bayer encoding
JPEG encoding (visually lossless image compression with 8-bit or 12-bit per channel)
JPEG2000 encoding (lossy and lossless image compression with 8-16 bits per channel)
H264 encoder/decoder, streaming, integration with FFmpeg (8-bit per channel)
H265 encoder/decoder, streaming, integration with FFmpeg (8/10-bit per channel)
Is it better or faster than NVIDIA ISP for Jetson?
There are a lot of situations where we can say YES to this question. NVIDIA ISP for Jetson is a great product, it's free, versatile, reliable, and it takes less power/load from Jetson, but we have our own advantages which are also of great importance for our customers:
Processing performance
Image quality
Flexibility in building custom image processing pipeline
Wide range of available image processing modules for camera applications
Image processing with 16/32-bit precision
High-performance codecs: JPEG, JPEG2000 (lossless and lossy)
High-performance 12-bit JPEG encoder
Raw Bayer Codec
Dynamic bad pixel suppression
High quality demosaicing algorithms
Wavelet-based denoiser on GPU for Bayer and RGB images
Flexible output with desired image resolution, bit depth, color/grayscale, rotation, according to ML/AI requirements
We've built that software from the scratch and we've been working in that field more than 10 years, so we have an experience and we can offer reliable solutions and support. Apart from that we are offering custom software design to solve almost any problem in a timely manner.
What are benefits of that approach?
That approach allows us to create embedded image processing solutions on Jetson with high quality, exceptional performance, low latency and full image sensor control. Software-based solution in combination with GPU image processing on NVIDIA Jetson could help our customers to create their imaging products with minimum efforts and maximum quality and performance.
Other blog posts about Jetson hardware and software
Benchmark comparison for Jetson Nano, TX2, Xavier NX and AGX
Jetson Image Processing
Jetson Zero Copy
Jetson Nano Benchmarks on Fastvideo SDK
JPEG2000 performance benchmarks on Jetson TX2
Jetson AGX Xavier performance benchmarks
Remotely operated walking excavator on Jetson
Low latency H.264 streaming on Jetson TX2
JPEG2000 performance benchmarks on Jetson TX2
Performance speedup for Jetson TX2 vs AGX Xavier
Fastvideo SDK vs NVIDIA NPP Library
Original article see at: https://fastcompression.com/blog/jetson-image-processing-framework.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Fastvideo SDK vs NVIDIA NPP Library
Author: Fyodor Serzhenko
Why is Fastvideo SDK better than NPP for camera applications?
What is Fastvideo SDK?
Fastvideo SDK is a set of software components which correspond to high quality image processing pipeline for camera applications. It covers all image processing stages starting from raw image acquisition from the camera to JPEG compression with storage to RAM or SSD. All image processing is done completely on GPU, which leads to real-time performance or even multiple times faster for the full pipeline. We can also offer a high-speed imaging SDK for non-camera applications on NVIDIA GPUs: offline raw processing, high performance web, digital cinema, video walls, FFmpeg codecs and filters, 3D, AR/VR, DL/AI, etc.
Who are Fastvideo SDK customers?
Fastvideo SDK is compatible with Windows/Linux/ARM and is mostly intended for camera manufacturers and system integrators developing end-user solutions containing video cameras as a part of their products.
The other type of Fastvideo SDK customers are developers of new hardware or software solutions in various fields: digital cinema, machine vision and industrial, transcoding, broadcasting, medical imaging, geospatial, 3D, AR/VR, DL, AI, etc.
All the above customers need faster image processing with higher quality and better latency. In most cases CPU-based solutions are unable to meet such requirements, especially for multicamera systems.
Customer pain points
According to our experience and expertise, when developing end-user solutions, customers usually have to deal with the following challenges.
Before starting to create a product, customers need to know the image processing performance, quality and latency for the final application.
Customers need reliable software which has already been tested and will not glitch when it is least expected.
Customers are looking for an answer on how to create a new solution with higher performance and better image quality.
Customers need external expertise in image processing, GPU software development and camera applications.
Customers have limited (time/human) resources to develop end-user solutions bound by contract conditions.
They need a ready-made prototype as a part of the solution to demonstrate a proof of concept to the end user.
They want immediate support and answers to their questions regarding the fast image processing software's performance, image quality and other technical details, which can be delivered only by industry experts with many years of experience.
Fastvideo SDK business benefits
Fastvideo SDK as a part of complex solutions allows customers to gain competitive advantages.
Customers are able to design solutions which earlier may have seemed to be impossible to develop within required timeframes and budgets.
The product helps to decrease the time to market of end-user solutions.
At the same time, it increases overall end-user satisfaction with reliable software and prompt support.
As a technology solution, Fastvideo SDK improves both image quality and processing performance at the same time.
Fastvideo serves customers as a technology advisor in the field of fast image processing: the team of experts provides end-to-end service to customers. That means that all customer questions regarding Fastvideo SDK, as well as any other technical questions about fast image processing are answered in a timely manner.
Fastvideo SDK vs NVIDIA NPP comparison
NVIDIA NPP can be described as a general-purpose solution, because the company implemented a huge set of functions intended for applications in various industries, and the NPP solution mainly focuses on various image processing tasks. Moreover, NPP lacks consistency in feature delivery, as some specific image processing modules are not presented in the NPP library. This leads us to the conclusion that NPP is a good solution for basic camera applications only. It is just a set of functions which users can utilize to develop their own pipeline.
Fastvideo SDK, on the other hand, is designed to implement a full 16/32-bit image processing pipeline on GPU for camera applications (machine vision, scientific, digital cinema, etc). Our end-user applications are based on Fastvideo SDK, and we collect customer feedback to improve the SDK’s quality and performance. We are armed with profound knowledge of customer needs and offer an exceptionally reliable and heavily tested solutions.
Fastvideo uses a specific approach in Fastvideo SDK which is based on components (not on functions as in NPP). It is easier to build a pipeline based on components, as the components' input and output are standardized. Every component executes a complete operation, and it can have a complex internal architecture, whereas NPP only uses several functions. It is important to emphasize here that developing an application using built-in Fastvideo SDK is much less complex than creating a solution based on NVIDIA NPP.
The Fastvideo JPEG codec and lots of other SDK features have been heavily tested by our customers for many years with a total performance benchmark of more than million images per second. This is a question of software reliability, and we consider it as one of our most important advantages.
The major part of the Fastvideo SDK components (debayers and codecs) can offer both high performance and image quality at the same time, leaving behind the NPP alternatives. What’s more, this is also true for embedded solutions on Jetson where computing performance is quite limited. For example, NVIDIA NPP only has a bilinear debayer, so it can be regarded as a low-quality solution, best suited only for software prototype development.
Summing up this section, we need to specify the following technological advantages of the Fastvideo SDK over NPP in terms of image processing modules for camera applications:
High-performance codecs: JPEG, JPEG2000 (lossless and lossy)
High-performance 12-bit JPEG encoder
Raw Bayer Codec
Flat-Field Correction together with dark frame subtraction
Dynamic bad pixel suppression in Bayer images
Four high quality demosaicing algorithms
Wavelet-based denoiser on GPU for Bayer and RGB images
Filters and codecs on GPU for FFmpeg
Other modules like color space and format conversions
To summarize, Fastvideo SDK offers an image processing workflow which is standard for digital cinema applications, and could be very useful for other imaging applications as well.
Why should customers consider Fastvideo SDK instead of NVIDIA NPP?
Fastvideo SDK provides better image quality and processing performance for implementing key algorithms for camera applications. The real-time mode is an essential requirement for any camera application, especially for multi-camera systems.
Over the last few years, we've tested NPP intensely and encountered software bugs which weren't fixed. In the meantime, if customers come to us with any bug in Fastvideo SDK, we fix it within a couple of days, because Fastvideo possesses all the source code and the image processing modules are implemented by the Fastvideo development team. Support is our priority: that's why our customers can rely on our SDK.
We offer custom development to meet specific our customers' requirements. Our development team can build GPU-based image processing modules from scratch according to the customer's request, whereas in contrast NVIDIA provides nothing of the kind.
We are focused on high-performance camera applications and we have years of experience, and our solutions have been heavily tested in many projects. For example, our customer vk.com has been processing 400,000 JPG images per second for years without any issue, which means our software is extremely reliable.
Software downloads to evaluate the Fastvideo SDK
GPU Camera Sample application with source codes including SDKs for Windows/Linux/ARM - https://github.com/fastvideo/gpu-camera-sample
Fast CinemaDNG Processor software for Windows and Linux - https://www.fastcinemadng.com/download/download.html
Demo applications (JPEG and J2K codecs, Resize, MG demosaic, MXF player, etc.) from https://www.fastcompression.com/download/download.htm
Fast JPEG2000 Codec on GPU for FFmpeg
You can test your RAW/DNG/MLV images with Fast CinemaDNG Processor software. To create your own camera application, please download the source codes from GitHub to get a ready solution ASAP.
Useful links for projects with the Fastvideo SDK
1. Software from Fastvideo for GPU-based CinemaDNG processing is 30-40 times faster than Adobe Camera Raw:
http://ir-ltd.net/introducing-the-aeon-motion-scanning-system
2. Fastvideo SDK offers high-performance processing and real-time encoding of camera streams with very high data rates:
https://www.fastcompression.com/blog/gpixel-gmax3265-image-sensor-processing.htm
3. GPU-based solutions from Fastvideo for machine vision cameras:
https://www.fastcompression.com/blog/gpu-software-machine-vision-cameras.htm
4. How to work with scientific cameras with 16-bit frames at high rates in real-time:
https://www.fastcompression.com/blog/hamamatsu-orca-gpu-image-processing.htm
Original article see at: https://fastcompression.com/blog/fastvideo-sdk-vs-nvidia-npp.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
GPU vs CPU at Image Processing. Why GPU is much faster than CPU?
Author: Fyodor Serzhenko
I. INTRODUCTION
Over the past decade, there have been many technical advances in GPUs (graphics processing units), so they can successfully compete with established solutions (for example, CPUs, or central processing units) and be used for a wide range of tasks, including fast image processing.
In this article, we will discuss the capabilities of GPUs and CPUs for performing fast image processing tasks. We will compare two processors and show the advantages of GPU over CPU, as well as explain why image processing on a GPU can be more efficient when compared to similar CPU-based solutions.
In addition, we will go through some common misconceptions that prevent people from using a GPU for fast image processing tasks.
II. ABOUT FAST IMAGE PROCESSING ALGORITHMS
For the purposes of this article, we’ll focus specifically on fast image processing algorithms that have such characteristics as locality, parallelizability, and relative simplicity.
Here’s a brief description of each characteristic:
Locality. Each pixel is calculated based on a limited number of neighboring pixels.
Good potential for parallelization. Each pixel does not depend on the data from the other processed pixels, so tasks can be processed in parallel.
16/32-bit precision arithmetic. Typically, 32-bit floating point arithmetic is sufficient for image processing and a 16-bit integer data type is sufficient for storage.
Important criteria for fast image processing
The key criteria which are important for fast image processing are:
Performance
Maximum performance of fast image processing can be achieved in two ways: either by increasing hardware resources (specifically, the number of processors), or by optimizing the software code. When comparing the capabilities of GPU and CPU, GPU outperforms CPU in the price-to-performance ratio. It’s possible to realize the full potential of a GPU only with parallelization and thorough multilevel (both low-level and high-level) algorithm optimization.
Image processing quality
Another important criterion is the image processing quality. There may be several algorithms used for exactly the same image processing operation that differ in resource intensity and the quality of the result. Multilevel optimization is especially important for resource-intensive algorithms and it gets essential performance benefits. After the multilevel optimization is applied, advanced algorithms will return results within a reasonable time period, comparable to the speed of fast but crude algorithms.
Latency
A GPU has an architecture that allows parallel pixel processing, which leads to a reduction in latency (the time it takes to process a single image). CPUs have rather modest latency, since parallelism in a CPU is implemented at the level of frames, tiles, or image lines.
III. GPU vs CPU: KEY DIFFERENCES
Let's have a look at the key differences between GPU and CPU.
1. The number of threads on a CPU and GPU
CPU architecture is designed in such a way that each physical CPU core can execute two threads on two virtual cores. In this case, each thread executes the instructions independently.
At the same time, the number of GPU threads is tens or hundreds of times greater, since these processors use the SIMT (single instruction, multiple threads) programming model. In this case, a group of threads (usually 32) executes the same instruction. Thus, a group of threads in a GPU can be considered as the equivalent of a CPU thread, or otherwise a genuine GPU thread.
2. Thread implementation on CPU and GPU
One more difference between GPUs and CPUs is how they hide instruction latency.
A CPU uses out-of-order execution for these purposes, whereas a GPU uses actual genuine thread rotation, launching instructions from different threads every time. The method used on the GPU is more efficient as a hardware implementation, but it requires the algorithm to be parallel and the load to be high.
Thus it follows that many image processing algorithms are ideal for implementation on a GPU.
IV. ADVANTAGES OF GPU OVER CPU
Our own lab research has shown that if we compare an ideally optimized software for GPU and for CPU (with AVX2 instructions), than GPU advantage is just tremendous: GPU peak performance is around ten times faster than CPU peak performance for the hardware of the same year of production for 32-bit and 16-bit data types. The GPU memory subsystem bandwidth is significantly higher as well.
If we make a comparison with non-optimized CPU software without AVX2 instructions, then GPU performance advantage could reach 50-100 times.
All modern GPUs are equipped with shared memory, or memory that is simultaneously available to all the cores of one multiprocessor, which is essentially a software-controlled cache. This is ideal for algorithms with a high degree of locality. The bandwidth of the shared memory is several times faster than the bandwidth of CPU’s L1 cache.
The other important feature of a GPU compared to a CPU is that the number of available registers can be changed dynamically (from 64 to 256 per thread), thereby reducing the load on the memory subsystem. To compare, x86 and x64 architectures use 16 universal registers and 16 AVX registers per thread.
There are several hardware modules on a GPU for simultaneous execution of completely different tasks: image processing (ISP) on Jetson, asynchronous copy to and from GPU, computations on GPU, video encoding and video decoding (NVENC and NVDEC), tensor kernels for neural networks, OpenGL, DirectX, and Vulkan for rendering.
Still, all these advantages of a GPU over a CPU involve a high demand for parallelism of algorithms. While tens of threads are sufficient for maximum CPU load, tens of thousands are required to fully load a GPU.
Embedded applications
Another type of task to consider is embedded solutions. In this case, GPUs are competing with specialized devices such as FPGAs (Field-Programmable Gate Arrays) and ASICs (Application-Specific Integrated Circuits).
The main advantage of GPUs over these devices is significantly greater flexibility. A GPU is a serious alternative for some embedded applications, since powerful multi-core processors don’t meet requirements like size and power budget.
V. USER MISCONCEPTIONS
1. Users have no experience with GPUs, so they try to solve their problems with CPUs
One of the main user misconceptions is associated with the fact that 10 years ago GPUs were considered inappropriate for high-performance tasks.
But technologies are developing rapidly, and while GPU image processing integrates well with CPU processing, the best results are achieved when fast image processing is done on a GPU.
2. Multiple data copy to GPU and back kills performance
This is another bias among users regarding GPU image processing.
As it turns out, it’s a misconception as well, since in this case, the best solution is to implement all processing on the GPU within one task. The source data can be copied to the GPU just once, and the computation results are returned to the CPU at the end of the pipeline. In that case the intermediate data remains on the GPU. Copy can be also performed asynchronously, so it could be done in parallel with computations on the next/previous frame.
3. Small shared memory capacity, which is just 96 kB for each multiprocessor
Despite the small capacity of GPU memory, the 96 KB memory size may be sufficient if shared memory is managed efficiently. This is the essence of software optimization for CUDA and OpenCL. It is not possible just to transfer software code from a CPU to a GPU without taking into consideration the specifics of the GPU architecture.
4. Insufficient size of the global GPU memory for complex tasks
This is an essential point, which is first of all solved by manufacturers when they release new GPUs with a larger memory size. Second of all, it’s possible to implement a memory manager to reuse GPU global memory.
5. Libraries for processing on the CPU use parallel computing as well
CPUs have the ability to work in parallel through vector instructions such as AVX or via multithreading (for example, via OpenMP). In most cases, parallelization occurs in the simplest way: each frame is processed in a separate thread, and the software code for processing one frame remains sequential. Using vector instructions involves the complexity of writing and maintaining code for different architectures, processor models, and systems. Vendor specific libraries like Intel IPP, are highly optimized. Issues arise when the required functionality is not in the vendor libraries and you have to use third-party open source or proprietary libraries, which can lack optimization.
Another aspect which is negatively affecting the performance of mainstream libraries is the widespread adoption of cloud computing. In most cases, it’s much cheaper for a developer to purchase additional capacity in the cloud than to develop optimized libraries. Customers request quick product development, so developers are forced to use relatively simple solutions which aren’t the most effective.
Modern industrial cameras generate video streams with extremely high data rates, which often preclude the possibility of transmitting data over the network to the cloud for processing, so local PCs are usually used to process the video stream from the camera. The computer used for processing should have the required performance and, more importantly, it must be purchased at the early stages of the project. Solution performance depends both on hardware and software. During the initial stages of the project, you should also consider what kind of hardware you’re using. If it’s possible to use mainstream hardware, any software can be used. If expensive hardware is to be used as a part of the solution, the price-performance ratio is rapidly increasing, and it requires using optimized software.
Processing data from industrial video cameras involves a constant load. The load level is determined by the algorithms used and camera bitrate. The image processing system should be designed at the initial stages of the project in order to cope with the load within a guaranteed margin, otherwise it will be impossible to process the streams without data loss. This is a key difference from web systems, where the load is unbalanced.
VI. SUMMARY
Summing up, we come to the following conclusions:
1. GPU is an excellent alternative to CPU for solving complex image processing tasks.
2. The performance of optimized image processing solutions on a GPU is much higher than on a CPU. As a confirmation, we suggest that you refer to other articles on the Fastvideo blog, which describe other use cases and benchmarks on different GPUs for commonly used image processing and compression algorithms.
3. GPU architecture allows parallel processing of image pixels which, in turn, leads to a reduction of the processing time for a single image (latency).
4. High GPU performance software can reduce hardware cost in such systems, and high energy efficiency reduces power consumption. The cost of ownership of GPU-based image processing systems is lower than that of systems based on CPU only.
5. A GPU has the flexibility, high performance, and low power consumption required to compete with highly specialized FPGA / ASIC solutions for mobile and embedded applications.
6. Combining the capabilities of CUDA / OpenCL and hardware tensor kernels can significantly increase performance for tasks using neural networks.
Addendum #1: Peak performance comparison for CPU and GPU
We will make the comparison for the float type (32-bit real value). This type suits well the most of image processing tasks. We will evaluate the performance per a single core. In the case of the CPU, everything is simple, we are talking about the performance of a single physical core. For the GPU, everything is somewhat more complicated. What is commonly called a GPU core is essentially an ALU, and according to NVIDIA terminology this is SP (Streaming Processor). The real analog of the CPU core is SM (this is Streaming Multiprocessor in NVIDIA terminology). The number of streaming processors in a single multiprocessor depends on the GPU architecture. For example, NVIDIA Turing graphics cards contain 64 SPs in one SM, while NVIDIA Ampere has 128 SPs. One SP can execute one FMA (Fused Multiply–Add) instruction per each clock cycle. The FMA instruction is selected here just for comparison, as it is used in convolution filters. Its integer counterpart is called MAD. The instruction (one of the variants) performs the following action: B = AX + B, where B is the accumulator that accumulates the convolution values, A is the filter coefficient, and X is the pixel value. By itself, such an instruction performs two operations: multiplication and summation. This gives us performance per clock cycle for SM: Turing - 2*64 = 128 FLOP, Ampere - 2*128 = 256 FLOP
Modern CPUs have the ability to execute two FMA instructions from the AVX2 instruction set per each clock cycle. Each such instruction contains 8 float operands and 16 FLOP operations, respectively. In total, one CPU core performs 2*16 = 32 FLOP per clock cycle.
To get performance per unit of time, we need to multiply the number of instructions per clock cycle by the frequency of the device. On average, the GPU frequency is in the range of 1.5 - 1.9 GHz, and the CPU with a load on all cores has a frequency around 3.5 – 4.5 GHz. The FMA instruction from the AVX2 set is quite heavy for the CPU. When they are performed, a large part of CPU is involved, and heat generation increases greatly. This causes the CPU to lower the frequency to avoid overheating. For different CPU series, the amount of frequency reduction is different. For example, according to this article, we can estimate a decrease to the level of 0.7 from the maximum. Next, we will take the coefficient 0.8, it corresponds to newer generations of CPUs.
We can assume that the CPU frequency is 2.5 times higher than that of the GPU. Taking into account the frequency reduction factor when working with AVX2 instructions, we get 2.5*0.8 = 2. In total, the relative performance in FLOP for the FMA instruction when compared with the CPU core is the following: Turing SM = 128 / (2.0*32) = 2 and for Ampere SM it is 256 / (2.0*32) = 4 times, i.e. one SM is more powerful than one CPU core.
Let's estimate the performance of L1 for the CPU core. Modern CPUs can load two 256-bit registers from the L1 cache in parallel or 64 bytes per clock cycle. The GPU has a unified shared memory/L1 block. The shared memory performance is the same both for Turing and Ampere architectures, and is equal to 32 float values per clock cycle, or 128 bytes per clock cycle. Taking into account the frequency ratio, we get a performance ratio of 128 (bytes per clock) / (2 (CPU frequency is greater than GPU) * 64 (bytes per clock)) = 1.
Also compare the L1 and shared memory sizes for CPU and GPU. For the CPU, the standard size of the L1 data cache is 32 kB. Turing SM has 96 kBytes of unified shared memory/L1 (shared memory takes 64 kBytes), and Ampere SM has 128 kBytes of unified shared memory/L1 (shared memory takes 100 kBytes).
To evaluate the overall performance, we need to calculate the number of cores per one device or socket. For desktop CPUs, we consider the option of 16 cores (AMD Ryzen, Intel i9). It could be considered as average core number for high performance CPUs on the market. NVIDIA Quadro RTX 6000 with Turing architecture has 72 SMs. NVIDIA Quadro RTX A6000 which is based on Ampere architecture has 84 SMs. The total ratio of the number of GPU SMs to CPU cores are 72/16 = 4.5 (Turing) and 84/16 = 5.25 (Ampere).
Based on this, we can evaluate the overall performance for the float type. For top Turing graphics cards, we get: 4.5 (the ratio of the number of GPU/CPU cores) * 2 (the ratio of single SM performance to the performance of one CPU core) = 9 times. For Ampere graphics cards we get: 5.25 (the ratio of the number of GPU/CPU cores) * 4 (the ratio of one SM performance to the performance of one CPU core) = 21 times.
Let's estimate the relative throughput of CPU L1 and GPU shared memory. We get 4.5 (the ratio of the number of GPU/CPU cores) * 1 (the ratio of single SM throughput to the one of single CPU core) = 4.5 times for Turing, and 5.25 * 1 = 5.25 for Ampere respectively. The ratios can vary slightly for a specific CPU model, but an order of magnitude will be the same.
We've obtained the result that reflects a significant advantage of the GPU over the CPU in both performance and on-chip memory throughput in computations, which are related to image processing.
It is very important to bear in mind that these results are obtained for the CPU only in the case of using AVX2 instructions. In the case of using scalar instructions, the CPU performance is reduced by 8 times, both in arithmetic operations and in the memory throughput. Therefore, for modern CPUs, software optimization is of particular importance.
Let's say a few words about the new AVX-512 instruction set for the CPU. This is the next generation of SIMD instructions with a vector length increased to 512 bits. Performance is expected to double in the future compared to AVX2. Modern versions of the CPU provide a speedup of up to 1.6 times, as they require even more frequency reduction than the instructions from the AVX2. The AVX-512 has not yet been widely distributed in the mass segment, but this is likely to happen in the future. The disadvantages of this approach will be the need to adapt the algorithms to a new vector length and recompile the code for support.
Let's try to compare the system memory bandwidth. Here we can see a significant spread of values. For the CPU, the initial numbers are 50 GB/s (2-channel DDR4 3200 controller) for mass-market CPUs. In the workstation segment the CPUs with four-channel controllers dominate with a bandwidth around 100 GB/s. For servers we can see CPUs with 6-8 channel controllers and a performance could be more than 150 GB/s.
For the GPU, the value of global memory bandwidth could vary in a wide range. It starts from 450 GB/s for the Quadro RTX 5000 and it could reach 1550 GB/s for the latest A100. As a result, we can say that the throughputs in comparable segments differ significantly, the difference could be up to an order of magnitude.
From all of the above, we can conclude that the GPU is significantly (sometimes almost by an order of magnitude) superior to the CPU that executes optimized code. In the case of non-optimized code for the CPU, the difference in performance can be even greater, up to 50-100 times. All this creates serious prerequisites for increasing productivity in widespread image processing applications.
Addendum #2 - memory-bound and compute bound algorithms
When we are talking about these types of algorithms, it is important to understand that we imply a specific implementation of the algorithm on a specific architecture. Each processor has some peak arithmetic performance. If the implementation of the algorithm can reach the peak performance of the processor on the target instructions, then it is compute-bound, otherwise the main limitation will be memory access and the implementation becomes memory-bound.
The memory subsystem of all processors is hierarchical, consisting of several levels. The closer the level is to the processor, the smaller it is in volume and the faster it is. The first level is the L1 data cache, and at the last level is the RAM or the global memory.
The algorithm can initially be compute-bound at the first level of the hierarchy, and then to become memory-bound at higher levels of the hierarchy.
We could consider a task to sum two arrays and to write the result to the third one. You can write this as X = Y + Z, where X, Y, Z are arrays. Let's say we use the AVX instructions to implement the solution on the processor. Then we will need two reads, one summation, and one write per element. A modern CPU can perform two reads and one write simultaneously to the L1 cache. But at the same time, it can also execute two arithmetic instructions, and we can only use just one. This means that the array summation algorithm is memory-bound at the first level of the memory hierarchy.
Let's consider the second algorithm, which is image filtering in a 3×3 window. Image filtering is based on the operation of convolution of the pixel neighborhood with filter coefficients. The MAD (or FMA, depending on the architecture) instruction is used to calculate the convolution. For window 3×3 we need 9 of these instructions. We actually need to do the following: B = AX + B, where B is the accumulator to store the convolution values, A is the filter coefficient, and X is the pixel value. The values of A and B are in registers, and the pixel values are loaded from memory. In this case, one load is required per FMA instruction. Here, the CPU is able to supply two FMA ports with data due to two loads, and the processor will be fully loaded. The algorithm can be considered compute-bound.
Let's look at the same algorithm at the RAM access level. We can take the most memory-efficient implementation, when a single reading of a pixel updates all 9 the windows it belongs. In this case, there are 9 FMA instructions per read operation. Thus, a single CPU core processing float data at 4 GHz requires 2 (instructions per clock cycle) × 8 (float in AVX register) × 4 (Bytes in float) × 4 (GHz) / 9 = 28.5 GB/s. The dual-channel controller with DDR4-3200 has a peak throughput of 50 GB/s and could serve as a data source just for two CPU cores in this task. Therefore, such an algorithm running on an 8-16 core processor is memory-bound. This is despite the fact that at the lower level it is balanced.
Now we consider the same algorithm when implemented on the GPU. It is immediately clear that the GPU has a less balanced architecture at the SM level with a bias to computing. For the Turing architecture, the ratio of the speed of arithmetic operations (in float) to the load throughput from shared memory is 2:1, for the Ampere architecture this is 4:1. Due to the larger number of registers on the GPU, you can implement the above optimization on the GPU registers. This allows us to balance the algorithm even for Ampere. And at the shared memory level, the implementation remains compute-bound. From the point of view of the top level memory (global memory), the calculation for the Quadro RTX 5000 (Turing) gives the following results: 64 (operations per clock cycle) × 4 (Bytes in float) × 1.7 (GHz) / 9 = 48.3 GB/s per SM. The ratio of total throughput to SM throughput is 450 / 48.3 = 9.3 times. The total number of SMs in the Quadro RTX 5000 is 48. That is, for the GPU, the high-level filtering algorithm is memory-bound.
As the window size grows, the algorithm becomes more complex and shifts towards compute-bound accordingly. Most image processing algorithms are memory-bound at the global memory level. And since the global memory bandwidth of the GPU is in many cases an order of magnitude greater than that of the CPU, this provides a comparable performance gain.
Addendum #3: SIMD and SIMT models, or why there are so many threads on GPU
To improve CPU performance, SIMD (single instruction, multiple data) instructions are used. One such instruction allows you to perform several similar operations on a data vector. The advantage of this approach is that it increases performance without significantly modifying the instruction pipeline. All modern CPUs, both x86 and ARM, have SIMD instructions. The disadvantage of this approach is the complexity of programming. The main approach to SIMD programming is to use intrinsic. Intrinsic are built-in compiler functions that contain one or more SIMD instructions, plus instructions for preparing parameters. Intrinsic forms a low-level language very close to assembler, which is extremely difficult to use. In addition, for each instruction set, each compiler has its own Intrinsic set. As soon as a new set of instructions comes out, we need to rewrite everything. If we switch to a new platform (from x86 to ARM) you need to rewrite all the software. If we start using another compiler - again, we need to rewrite the software.
The software model for the GPU is called SIMT (Single instruction, multiple threads). A single instruction is executed synchronously in multiple threads. This approach can be considered as a further development of SIMD. The scalar software model hides the vector essence of the hardware, automating and simplifying many operations. That is why it is easier for most software engineers to write the usual scalar code in SIMT than vector code in pure SIMD.
CPU and GPU have different ways to solve the issue of instruction latency when executing them on the pipeline. The instruction latency is how many clock cycles the next instruction wait for the result of the previous one. For example, if the latency of an instruction is 3 and the CPU can run 4 such instructions per clock cycle, then in 3 clock cycles the processor can run 2 dependent instructions or 12 independent ones. To avoid pipeline stalling, all modern processors use out-of-order execution. In this case, the processor analyzes data dependencies between instructions in out-of-order window and runs independent instructions out of the program order.
The GPU uses a different approach which is based on multithreading. The GPU has a pool of threads. Each clock cycle, one thread is selected and one instruction is chosen from that thread, then that instruction is sent for execution. On the next clock cycle, the next thread is selected, and so on. After one instruction has been run from all the threads in the pool, GPU returns to the first thread, and so on. This approach allows us to hide the latency of dependent instructions by executing instructions from other threads.
When programming the GPU, we have to distinguish two levels of threads. The first level of threads is responsible for SIMT generation. For the NVIDIA GPU, these are 32 adjacent threads, which are called warp. SM for Turing is known to support 1024 threads. This number is divided into 32 real threads, within which SIMT execution is organized. Real threads can execute different instructions at the same time, unlike SIMT.
Thus, the Turing streaming multiprocessor is a vector machine with a vector size of 32 and 32 independent real threads. The CPU core with AVX is a vector machine with a vector size of 8 and two independent threads.
Original article see at: https://www.fastcompression.com/blog/gpu-vs-cpu-fast-image-processing.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Part 2: JPEG2000 solutions in science and healthcare. JP2 format limitations
Author: Fyodor Serzhenko
In the first part of the article, JPEG 2000 in science, healthcare, digital cinema and broadcasting, we discussed the key technologies of JPEG2000 and focused on its application in digital cinema.
In this second part, we will continue examining the functions of JPEG2000, as well as review its main drawback and talk about the other application areas where the format turned out to be in high demand. At the end we will present a solution which simplifies and makes the process of working with the format much more convenient.
1. JPEG2000 in science and medicine
Window mode support is one of the handy features that makes JPEG2000 attractive. Scientists often have to work with files of enormous resolution, the width and height of which can exceed 40,000 pixels, but only a small part of which is of interest. Standard JPEG would have to decode the entire image to work with it, while JPEG2000 allows you to decode only a selected area.
JP2 is also used for space photography. Those wonderful pictures of Mars taken, for example, with a HiRISE camera, are available in JP2 format. Still, the data link from space to Earth is subject to interference, so errors may occur during the transfer or even entire data packets may be lost. However, when the special mode is enabled, it is somewhat error-resilient, which can be helpful when communication or storage devices are unreliable. This mode allows you to detect errors that occur when data is lost during transmission. It is important to note that the image is divided into small blocks (for example, 32x32 or 64x64 pixels), where, after preliminary transformations, each bit plane is encoded separately. Thus, a lost bit most likely spoils only some of the less significant bit planes, and this usually has little effect on overall quality. By the way, in JPEG, the loss of a bit can lead to significant distortions of a big part of or even the entire image.
Regarding the operation of the special mode with the integrity check in the JPEG2000 format file, additional information is added to the compressed file to check the correctness of the data. Without this information, we often can’t determine during decoding whether there’s an error or not, and we continue the process as if nothing had happened. As a result, it’s still possible that even one erroneous bit will spoil quite a large part of the image. If this mode is enabled, however, then we detect any error when it appears and can limit its effect on other parts of the image.
The JPEG2000 format also plays important role in healthcare. In this application area, it is extremely important to maintain a sufficient bit depth of the source data to make it possible to fix all the subtleties of each area of the body under examination. JPEG2000 is used in CTs, X-rays, MRIs, etc.
Also, in accordance with FDA (Food and Drug Administration) requirements, images acquired by means of medical imaging must be stored in the original format (without loss). The JPEG2000 format is an ideal solution in this case.
Another interesting feature of JPEG2000 is the compression of three-dimensional data arrays. This can be highly relevant both in science and in medicine (for example, three-dimensional tomography results). The 10th part of the JPEG2000 standard is devoted to the compression of such data: JP3D (volumetric imaging).
2. JP2 format limitations
Unfortunately, JP2 (JPEG2000) isn’t so simple — in fact, it’s not supported by most web browsers (with the exception of Safari). The format is computationally complex, and existing open source codecs have been too slow for active use over the years. Even now, when the speed of processors is increasing with each new generation, and codecs are being optimized and accelerated, their capabilities still leave something to be desired. To illustrate the importance of codec speed, let's return to the topic of digital cinema for a moment: specifically, to the creation of DCPs (Digital Cinema Packages), the same set of files that we enjoy in cinemas. Again, JPEG2000 is the standard for digital cinema and, accordingly, is required to create a DCP package. Unfortunately, its computational complexity makes this task quite resource-intensive and time-consuming. Moreover, existing open source codecs don't allow decoding movies at the required rate of 25, 30 or 60 fps for 12-bit data at resolutions already in 2K or 4K.
3. How to speed up processing with the JP2 format
JPEG2000 provides modes for operating at a higher speed, but this is achieved at the expense of a slight reduction in quality or compression ratio. However, even the slightest reduction in image quality can be unacceptable for some application areas.
To speed up the process with JPEG2000, we at Fastvideo have developed our own implementation of the JPEG 2000 codec. Our solution is based on NVIDIA CUDA technology, thanks to which it’s now possible to make a parallel implementation of the coder and decoder using all CPU and GPU cores.
As a consequence, the Fastvideo solution performs much better in comparison to the competition and provides fundamentally new capabilities for users. We believe that our solution will encourage more people to use JP2 format, as well as significantly speed up JP2 processing for people who already use it. Our goal is to make high-quality images much more accessible for specialists in application areas where the original image quality is required by default (e.g., science and healthcare).
Other info from Fastvideo concerning JPEG 2000 solutions
JPEG2000 codec on GPU
JPEG2000 vs JPEG vs PNG: What's the Difference?
J2K encoding benchmarks
J2K decoding benchmarks
Fast FFmpeg J2K decoder on NVIDIA GPU
MXF Player
Original article see at: https://www.fastcompression.com/blog/jpeg2000-applications-part2.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
JPEG2000 in science, healthcare, digital cinema and broadcasting
Author: Fyodor Serzhenko
This article is devoted to the JPEG2000 algorithm and will be presented in two parts. In the first part, we will discuss the key technologies of the algorithm and explain why it has become so popular in digital cinema and broadcasting. In the second part, we will talk about other application areas and important features of JPEG2000. We will also discuss its main drawback and present a solution that can significantly improve the usability of JPEG2000.
Part 1: JPEG2000 in digital cinema and broadcasting. Features of JP2 format
The cinema captured the hearts and minds of people all over the world from the very beginning. Comedy movies by Charlie Chaplin and horror films by Alfred Hitchcock left no one indifferent. It took just a little bit more than a century for the industry to evolve from black-and-white silent cinema to IMAX movies, the quality of which leaves a deep impression the moment a spectator watches one for the first time.
Okay, but are you aware of what makes IMAX movies so captivating? And why does it differ so much in video quality from what we used to watch on standard TV channels? The answer is the compression algorithm and image format used.
JP2 is the file format for images compressed with the JPEG2000 algorithm
1. JPEG2000 in digital cinema
The JP2 format (among others) has been actively used in digital cinema for a long time. It was developed in 2000 and selected as a digital cinema standard by the Digital Cinema Initiatives (DCI) group, which includes Disney, Fox, Paramount, MGM, Sony Pictures Entertainment, Universal, and Warner Bros. Studios, in 2004. The same year, some amendments relating to digital cinema were added to the first part of the JPEG2000 standard.
Good compression for digital cinema was simply necessary. An hour-and-a-half movie in 2K or 4K resolution with 12-bit color channels and 24 fps, compressed using JPEG2000 at a standard bitrate of 250 Mbit/s takes up to 160 Gigabytes.
The JPEG2000 compression algorithm, thanks to which we can enjoy vivid images in IMAX, is based on two key technologies — a discrete wavelet transform (DWT) and embedded block coding with optimal truncation (EBCOT), each of which has its own role:
DWT creates a multi-scale image representation to select the spatial and frequency components of the image. It makes it possible, for example, to watch a 4K movie in 2K resolution.
EBCOT arranges the data about the pixels of each coded block by importance, providing a smooth degradation of the picture quality as the compression ratio increases.
2. Format features: 12-bit and lossless compression option
In this section, we will discuss the JPEG2000 format itself, its features and applications. So how come images in JP2 format are so fascinating? The answer is simple: the color depth. One of the most important advantages of the format is working with high-bit data. In other words, the JP2 format is designed to describe one pixel of an image using more bits than a monitor that is not designed for professional color work, and thereby store more information about color. If you compare a standard JPEG image (8 bits per channel) with images in the IMAX format (12 bits per channel), you’ll see that an 8-bit image simply cannot convey such a range of color and brightness as a 12-bit image. As a consequence, IMAX image quality differs fundamentally.
Another important advantage of the JPEG2000 algorithm is the relationship between the compression ratio and the image quality (measured by any metric). The image file size and the transmission speed depend on the compression ratio. What’s more, the quality of the restored image depends on the compression as well. It’s quite clear that the presence of artifacts does not delight anyone.
Thanks to the use of wavelets (DWT), images in JP2 don’t acquire such conspicuous artifacts at high compression ratios as in its predecessor JPEG — when compressing an image with JPEG, the boundaries of 8x8-pixel squares become visible. It’s impossible to completely avoid artifacts, but visually they’re much less noticeable. As a result, JPEG2000 allows you to compress images more, and lose much less quality than JPEG allows with the same compression ratios. You can find a more detailed comparison of JPEG2000 with JPEG in one of our articles.
It’s worth noting that JPEG2000 was developed to provide both lossy and mathematically lossless compression in a single compression architecture. Depending on its contents, an image can be compressed up to 2.5 times without any quality loss, while its data footprint is decreased to 60%. However, there are always exceptions: some images can’t be reduced in size using lossless compression or compression ratio would be close to 1, but it’s quite achievable for the majority of them. Anyway, such compression capabilities are in great demand wherever it’s necessary to store a large amount of data in a compressed form for a long time (e.g., documentation, images, and video), while maintaining the possibility of lossless recovery. For example, it can be quite useful in libraries, museums, etc.
Lossless compression is of great use in the following situations:
when advanced image analysis or multi-stage processing is performed or is supposed to be performed, and each stage can introduce an additional quality loss.
when minor details captured at the camera's sensitivity limit can be of great significance.
For example, early detection of diseases, research of nano-objects and processes at the sensitivity limit of a microscope, study of extremely distant space objects.
3. JPEG2000 in broadcasting
One more use case of the JPEG2000 format which is worth mentioning is sports broadcasting, such as football and basketball tournaments. During broadcasting, the still-uncompressed video is transmitted from the camera to an add-on device, which compresses the images using JPEG2000. Subsequently, they are transmitted in JP2 format to the server where the re-encoding is performed to create a video suitable for an audience. In this case, both fast image transmission and quality preservation are essential. JPEG2000 uses EBCOT coding, which makes it possible to select the order of alternation of resolutions, quality layers, color components and positions within compressed bytestream. Thanks to EBCOT coding, JPEG2000 supports dynamic quality distribution. In other words, it allows you to automatically adjust the amount of transmitted data depending on the bandwidth of the channel. Thus, images of the highest possible quality for a given IP channel are quickly transmitted to the servers.
To be continued…
Other info from Fastvideo concerning JPEG2000
JPEG2000 codec on GPU
JPEG2000 vs JPEG vs PNG: What's the Difference?
J2K encoding benchmarks
J2K decoding benchmarks
Fast FFmpeg J2K decoder on NVIDIA GPU
MXF Player
Remote color grading
Original article see at: https://www.fastcompression.com/blog/jpeg2000-applications-part1.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Remote Color Grading and Editorial Review
When you are shooting a movie at tight schedule and need to accelerate your post production, then remote collaborative approach is a good choice. You don't need to have all professionals on-site because via remote approach you can collaborate with your teammates wherever they are located. Industry trend to remote solutions is quite clear and it happens not just due to the coronavirus. The idea to accelerate post production via remote operation is viable and companies tend to remove various limitations of conventional workflow - now the professionals could choose a place and a time to work remotely.
Nowadays, there are quite a lot of software solutions which could offer reliable remote access via local networks or via public internet. Most of them are built without an idea about professional usage at post production. Nevertheless, in color grading and editorial reviewing we need to utilize professional hardware which can visualize 10-bit and 12-bit frames. Most of existing video conference solutions (Skype, ZOOM, OBS) are not capable of doing that, so we've implemented a software to solve that task.
Remote color grading with existing hardware appliances
There are quite a lot of hardware units (encoding-decoding and IP streaming solutions) which together with software could offer high performance and low latency workflow to solve the task of remote color grading. These are fully-managed remote collaboration solutions for high-quality, realtime color grading, editing, digital intermediates and approvals:
Sohonet ClearView
Nevion Virtuoso
Streambox Chroma HD HDR, 4K HDR and DCI
Nimbra Media Gateway
VF-REC (Village Island)
These fast and quite expensive hardware appliances are not always available, especially if you are working from home. Below we present a software solution which is capable of running on conventional PC and be able to meet all requirements for remote color grading in terms of image quality, performance and latency.
How we do Remote Color Grading?
User has two screens: for the shared content and for video conferencing. The first screen is able to visualize 10/12-bit images to see the result of color grading, the other is necessary for access to remote PC, where color grading software is running.
We offer cost-effective software solution which is able to record, encode, transmit, receive, decode and visualize various transport streams and SDI signals. To ensure 24/7 operation with an ability to create and to process 2K live SDI streams with visually lossless encoding, we've applied the JPEG2000 (J2K) compression algorithm, which could be very fast on NVIDIA GPUs.
This is our basic workflow for remote color grading: Video Source (Baseband Video) -> Capture device (Blackmagic DeckLink or AJA Kona) -> SDI unpacking on GPU -> J2K Encoder on GPU -> Facility Firewall -> Public Internet -> Remote Firewall -> J2K Decoder on GPU -> SDI packing on GPU -> Output device (Blackmagic DeckLink or AJA Kona) -> Video Display (Baseband Video).
Here you can see more info for live workflow
Capture baseband video streams via HD-SDI or 3G-SDI frame grabber (Blackmagic DeckLink 8K Pro, AJA Kona 4 or Kona 5)
Live encoding with J2K codec that supports 10-bit YUV 4:2:2 and 8/10/12-bit 4:4:4 RGB
Send the encoded material to the receiver/decoder - point-to-point transmission over ethernet or public internet
Stream decoding - Rec.709/Rec.2020, 10-bit 4:2:2 YUV or 10/12-bit 4:4:4 RGB
Send stream to baseband video playout device (Blackmagic or AJA frame grabber) to display 10-bit YUV 4:2:2 or 8/10/12-bit 4:4:4 RGB material on external professional display
That basic workflow covers just the task of precise color visualization. Color grading is actually done via remote access to a PC with installed grading software. This is not difficult to do, though we need to be able to check image quality at remote professional monitor with high bit depth.
Values for Remote Color Grading
Reduce the cost of remote production
Cut down of travel and rent costs for the team
Low cost and high quality solution on conventional PC to work from home for videographers and editors
Your team will work on multiple projects (time saving and multi-tasking)
Remote work will allow to choose the best professionals to work with
Technical requirements
High speed acquisition and realtime processing of SD, HD and 3G-SDI streams
Input and output SDI formats: RGB, RGBA, v210, R10B, R10L, R12L
Fast JPEG2000 encoding and decoding (lossy or lossless) on NVIDIA GPU
High image quality
Color control and preview on professional monitor
Maximum possible bit depth (10-bit or 12-bit per channel)
Fast and reliable data transmission over internal or public network
Low latency
OS Linux Ubuntu/Debian/CentOS (Windows version is coming soon)
Recommended grabbers:
- Blackmagic 6G SDI: DeckLink Studio 4K, DeckLink SDI 4K
- Blackmagic 12G SDI: DeckLink 4K Extreme 12G, DeckLink 8K Pro
Recommended GPU: NVIDIA Quadro RTX 4000 / 5000 / 6000
J2K Streamer: j2k encoder - transmitter - receiver - j2k decoder
High performance implementation of lossless or lossy J2K algorithm
8-bit, 10-bit and 12-bit color depth
4:2:2 and 4:4:4 color subsampling
Color spaces Rec.709, DCI P3, Rec.2020
SD, HD, 3G 2K resolutions and frame rates (support for UHD and 4K is also available)
Security and content protection
AES 128-bit encryption with symmetric key both for video and audio
It's possible to encrypt both video and audio with 128-bit AES encryption with symmetric key without any increase in a stream latency. Please note that the encryption currently ensures only confidentiality. As a hash, CRC-32 is used, so it doesn't guarantee cryptographical integrity.
Low latency transport for realtime streaming
From 300 ms to 1 sec end-to-end latency
Public internet and/or fiber networking for remote sessions
10 to 250 Mbps bit rates
Maximum performance for JPEG2000 compression and decompression features could be achieved with multithreading at batch mode. This should be done to implement massive parallel processing according to J2K algorithm. At batch processing mode we need to collect several images, which is not good for end-to-end latency. Here we have a trade-off between performance and latency for the task of JPEG2000 encoding and decoding. For example, at remote color grading application we would be interested to have minimum latency, so we need to process each J2K frame separately, without batch. Though in most cases it's better to choose acceptable latency and get the best performance with batch and multithreading.
Performance measurements
Currently, our J2K encoder is faster than J2K decoder, so total performance is limited by J2K decoding. On NVIDIA Quadro RTX 6000 the software can offer 24 fps and more for 4K resolution at 12-bit with 4:4:4 subsampling. In the case with 2K resolution, the software could achieve more than 60 fps. The performance depends on GPU model and on parameters of J2K encoding, etc. We suggest to test network bandwidth and software latency to choose the best parameters.
Competitors
Our software is offering an approach for low-latency remote color grading. Please note that this is not actually the software for color grading. This is the solution to work remotely with conventional grading, VFX and post production software like Blackmagic Davinci Resolve, Adobe Premiere Pro, AVID Media Composer, Baselight, etc. We don't compete with these color grading applications at all.
We would recommend to utilize TeamViewer, AnyDesk, Google Remote Desktop, Ammyy Admin, Mikogo, ThinVNC, UltraVNC, WebEx Meetings, LogMeIn Pro, which could offer remote access to color grading software, but they are able to work with just 8-bit color frames instead of 12-bit color. This is the key difference. As soon as for high quality post production we need to work with 12-bit color, that requirement is essential and the task or low latency solution with acceptable compression ratio is very important. Still, any software from the above list is useful to ensure an access to remote PC.
Hardware-based solutions like Nevion, Streambox, Sohonet are our competitors as well. These are reliable and very expensive solutions. Our approach needs less hardware and could offer high quality, low latency and cheaper solution for remote color grading and post production.
Other info from Fastvideo about J2K and digital cinema applications
JPEG2000 codec on GPU
Fast FFmpeg J2K decoder on NVIDIA GPU
MXF Player
Fast CinemaDNG Processor software on GPU
BRAW Player and Converter for Windows and Linux
Original article see at: https://www.fastcompression.com/products/remote-color-grading.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
How to break the speed barrier for machine vision image processing?
Author: Fyodor Serzhenko
What do we usually overlook to speed up real-time image processing?
Machine vision cameras are widely used in industry, science, and robotics. However, when working with them, the same question invariably arises: "How to process the data received?" And that's a really good question. But why does it arise at all?
The point is that cameras usually transmit raw data (RAW) at high frame rate, which takes up a lot of memory and needs to be converted to the required image format in realtime. Image processing algorithms must provide the quality and speed necessary for the task at hand. Unfortunately, it is sometimes not easy to ensure both quality and speed at the same time.
That’s why, whenever there’s a task which requires processing a lot of images in real time, experts put a high priority on optimizing computer vision-related algorithms. It’s even more important when there’s a limited budget, or the physical size or power consumption of the device is constrained for practical reasons.
Generally, high-quality algorithms that perform computations on Intel/AMD processors do well with this task. However, there are special cases:
Case 1. Processing of images from high data rate machine vision cameras, which is the case for high image resolution or a high frame rate.
Case 2. Multi-camera system with real-time image processing.
For such situations, the capabilities of a CPU are not enough. CPU just can't handle the huge data stream quickly enough (for example, when it’s dealing with gigapixels per second), and this leads to the unavoidable loss of some data. Unfortunately, it’s difficult to speed things up further on a CPU without a trade-off for quality.
So, how can we speed up image processing without losing quality? The main idea for the solution was to transfer most of the computations from the central processor (CPU) to the graphics processor (GPU). To solve that task, we utilized our in-house developed Fastvideo SDK, which works on NVIDIA GPU. This approach has significantly accelerated the necessary algorithms and greatly simplified the software architecture, because computations in this case no longer interfere with system functions based on the CPU.
Let's look at the advantages of image processing on a GPU instead of a CPU:
A graphics card is a more specialized device than a CPU, and due to its architecture, it can perform many tasks much faster.
Memory access speed — A graphics processor can offer significantly faster memory access than a central processor.
The algorithms can be run in parallel — Graphics cards are known to have a much greater ability to perform parallel computing than a central processor.
Transferring computations to the graphics card does not mean that the CPU is completely free. The CPU is responsible for I/O and system control. The proposed solution is heterogeneous, since it uses all the available resources of both the CPU and GPU for image processing, which in turn leads to high performance.
In addition to increasing the speed of image processing, using a graphics processor has allowed us to implement more complicated algorithms to increase the image quality and color reproduction. Our workflow is similar to that used in filmmaking, where the colors in the frame are given special attention.
Fig.1. XIMEA xiB high performance machine vision cameras
One of the best examples where this solution can be applied is image processing for XIMEA cameras. XIMEA manufactures high-quality industrial cameras with the latest image sensors, which provide exceptionally high data rates. The Fastvideo SDK solution offers an excellent approach for real-time image processing for high performance cameras.
Fig.2. Menzi Muck walking excavator
XIMEA cameras are used, for example, in the Menzi Muck remote-controlled walking excavator. For this particular project, the Fastvideo SDK solution allowed:
up to 60 fps of synchronized image acquisition from each of the two 3.1 MP XIMEA cameras;
real-time processing with H.264/H.265 encoding and streaming (including black level, white balance, demosaicing, auto exposure, etc.);
glass-to-glass video latency over 4G/5G network ~50 ms.
Fig.3. Wind turbine inspection drone from Alerion with XIMEA camera
Let's take as an example another project using XIMEA cameras: the wind turbine inspection drone from Alerion. This drone is intended to fully automate the inspection of wind turbines for damage. For this task, it is very important to ensure good quality of images, based on which a 3D model is subsequently built. Using XIMEA cameras in conjunction with the GPU image processing solution made it possible to achieve the required image quality and high processing speed, which in turn made it possible to automate the inspection process. As a result, the time spent on inspection of one turbine was reduced from 2-3 hours to 10 minutes. Here, of course, process automation played a big role. However, this would not have been possible without high processing speed and excellent image quality that allows even very small damage to be noticed.
In conclusion, it’s worth noting the versatility of the Fastvideo SDK for GPU image processing: it can work on both stationary and mobile graphics cards. However, when choosing a solution for your task, don’t forget about the price-performance ratio. If you configure the solution to meet your needs (download source codes from GitHub), you’ll get high performance and high quality software for real time applications and avoid unnecessary costs for hardware.
Original article see at: https://www.fastcompression.com/blog/high-performance-machine-vision-image-processing.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
J2K codec performance on Jetson TX2
NVIDIA Jetson TX2 hardware is very promising for imaging and other embedded applications. That high-performance and low-power hardware is utilized in autonomous solutions, especially the industrial version Jetson TX2i. Since J2K compression is a common task for UAV (Unmanned Aerial Vehicle) applications, here we evaluate such a solution and its limitations.
Detailed info concerning our testing approach for JPEG2000 encoding and decoding on desktop/server NVIDIA GPUs you can find at the corresponding links. Here we follow exactly the same procedure, but it's applied to the Jetson hardware.
J2K encoding/decoding parameters
File format – JP2
Lossy JPEG2000 compression with CDF 9/7 wavelet
Lossless JPEG2000 compression with CDF 5/3 wavelet
Compression ratio (for lossy algorithm) ~ 12.0:1 which corresponds to visually lossless encoding
Subsampling mode – 4:4:4
Number of DWT resolutions – 7
Codeblock size – 32×32
MCT – on
PCRD – off
Tiling – off
Window – off
Quality layers – one
Progression order – LRCP (L = layer, R = resolution, C = component, P = position)
Modes of operation – single or multithreaded batch
2K test image (24-bit) – 2k_wild.ppm
4K test image (24-bit) – 4k_wild.ppm
It's obvious that in many cases compression ratio for visually lossless encoding could be much higher for JPEG2000 algorithm. So we would suggest testing different parameters to achieve the best compression ratio with an acceptable image quality. Decreasing the quality coefficient one can get not only better compression, but also higher framerate both for encoding and decoding. Our benchmarks show the performance results for the above images and parameters. It's not the maximum performance, which could be better in many other cases.
Hardware and software
NVIDIA Jetson TX2
CUDA Toolkit 10.2
JPEG2000 codec benchmarks on NVIDIA Jetson TX2
Jetson TX2 has 4-core ARM Cortex-A57 @ 2 GHz and 2-core Denver2 @ 2 GHz. These two types of cores have different performance, which should be taken into account. Since Tier-2 stage of JPEG2000 algorithm is implemented on CPU, the performance of both CPU and GPU cores determine the framerate. From that point of view, multithreading can be useful (we use up to 12 threads), but in the single mode we could get different performance depending on the CPU core used. So in the single mode we need to set affinity mask to ensure utilizing the fastest CPU core.
In the tests discussed we've restricted memory usage to 2 GB. This was done under an assumption that Jetson TX2 can have only 4 GB memory, so this is important limitation for the whole image processing solution.
Here we haven't considered the task of J2K transcoding to H.264 on Jetson. That task requires additional tests, though from our previous experience with desktop/server GPUs, performance of the transcoding should not differ significantly, because Jetson has hardware support of H.264 encoding (separate from GPU), which is accessible via V4L2 interface and can be used simultaneously with JPEG2000 decoder.
By request we could offer Fastvideo SDK for Jetson for evaluation - please fill the form below and send it to us.
Other info from Fastvideo concerning JPEG2000 and Jetson
JPEG2000 codec on GPU
JPEG2000 vs JPEG vs PNG: What's the Difference?
J2K encoding benchmarks
J2K decoding benchmarks
Fast FFmpeg J2K decoder on NVIDIA GPU
MXF Player
Jetson Benchmark Comparison: Nano vs TX2 vs Xavier
Jetson image processing for camera applications
Original article see at: https://www.fastcompression.com/blog/j2k-codec-on-jetson-tx2.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Low-latency software for remote collaborative post production
Fastvideo company is a team of professionals in GPU image processing, realtime camera applications, digital cinema, high performance imaging solutions. Fastvideo has been helping production companies for quite a long time and recently we've implemented low-latency software to offer collaborative post production.
Today, with restrictions on in-person collaboration, delays in shipping and limitations on travel, single point of ingest and delivery for an entire production becomes vitally important. The main goal is to offer all services both on-premises and remotely. We believe that in the near future we will see virtual and distributed post production finishing.
When you are shooting a movie at tight schedule and you need to accelerate your post production workflow, then remote collaborative approach is a right solution. You don't need to have all professionals on-site, via remote approach you can collaborate at realtime wherever your teammates are located. Industry trend to remote production solutions is clear and it happens not just due to the coronavirus. The idea to accelerate post via remote operation is viable and companies strive to remove various limitations of conventional workflow - now the professionals could choose a place and a time to work remotely on post production.
Nowadays, there are quite a lot of software solutions to offer reliable remote access via local networks or via public internet. Still, most of them were built without an idea about professional usage in tasks like colour grading, VFX, compositing and much more. In post production we need to utilize professional hardware which could visualize 10-bit or 12-bit footages. Skype, ZOOM and many other video conference solutions are not capable of doing that, so we've implemented the software to solve that matter.
Business goals to achieve at remote collaborative post production
You will share content in realtime for collaborative workflows in post production
Lossless or visually lossless encoding guarantees high image quality and exact colour reproduction
Reduced travel and rent costs for the team due to remote colour grading and reviewing
Remote work will allow to choose the best professionals for the production
Your team will work on multiple projects (time saving and multi-tasking)
Goals from technical viewpoint
Low latency software
Fast and reliable data transmission over internal or public network
Fast acquisition and processing of SD/HD-SDI and 3G-SDI streams (unpacking, packing, transforms)
Realtime J2K encoding and decoding (lossy or lossless)
High image quality
Precise colour reproduction
Maximum bit depth (10-bit or 12-bit per channel)
Task to be solved
Post industry needs low-latency, high quality video encode/decode solution for remote work according to the following pipeline:
Capture baseband video streams via HD-SDI or 3G-SDI frame grabber (Blackmagic DeckLink 8K Pro, AJA Kona 4 or Kona 5)
Live encoding with J2K codec that supports 10-bit YUV 4:2:2 and 10/12-bit 4:4:4 RGB
Send the encoded material via TCP/UDP packets to a receiver/decoder - point-to-point transmission over ethernet or public internet
Decode from stream at source colorspace/bit-depth/resolution/subsampling - Rec.709/Rec.2020, 10-bit 4:2:2 YUV or 10/12-bit 4:4:4 RGB
Send stream to baseband video playout device (Blackmagic/AJA frame grabber) to display 10-bit YUV 4:2:2 or 10/12-bit 4:4:4 RGB material on external display
Latency requirements: sub 300 ms
Basic hardware layout: Video Source (Baseband Video) -> Capture device (DeckLink) -> SDI unpacking on GPU -> J2K Encoder on GPU -> Facility Firewall (IPsec VPN) -> Public Internet -> Remote Firewall (IPsec VPN) -> J2K Decoder on GPU -> SDI packing on GPU -> Output device (DeckLink) -> Video Display (Baseband Video)
Hardware/software/parameters
HD-SDI or 3G-SDI frame grabbers: Blackmagic DeckLink 8K Pro, AJA Kona 4, AJA Kona 5
NVIDIA GPU: GeForce RTX 2070, Quadro RTX 4000 or better
OS: Windows-10 or Linux Ubuntu/CentOS
Frame Size: 1920×1080 (DCI 2K)
Frame Rates: 23.976, 24, 25, 29.97, 30 fps
Bit-depth: 8/10/12 (encode - ingest), 8/10/12 (decode - display)
Pixel formats: RGB or RGBA, v210, R12L
Frame compression: lossy or lossless
Colour Spaces for 8/10-bit YUV or 8/10/12-bit RGB: Rec.709, DCI-P3, P3-D65, Rec.2020 (optional)
Audio: 2-channel PCM or more
How to encode/decode J2K images fast?
CPU-based J2K codecs are quite slow. For example, if we consider FFmpeg-based software solutions, they are working with J2K codec from libavcodec (mj2k) or with OpenJPEG, which are far from being fast. Just test that software to check the latency and the performance. It's not surprizing, as soon as J2K algorithm has very high computational complexity. If we implement multiple threads/processes on CPU, the performance of J2K solution from libavcodec is still unsuffcient. This is the problem even for 8-bit frames with 2K resolution, though for 4K images (12-bit, 60 fps) the performance is much worse.
The reason why FFmpeg and other software are not fast at that task is obvious - they are working on CPU and they are not optimized to be high performance software. Here you can see benchmarks comparison for J2K encoding and decoding for OpenJPEG, Jasper, Kakadu, J2K-Codec, CUJ2K, Fastvideo codecs to check the performance for images with 2K and 4K resolutions (J2K lossy/lossless algorithms).
Maximum performance for J2K encoding and decoding at streaming applications could be achieved at multithreaded batch mode. This is a must to ensure massive parallel processing according to JPEG2000 algorithm. If we do batch processing, it means that we need to collect several images, which is not good for latency. If we implement batch with multithreading, it improves the performance, but the latency gets worse. This is actually a trade-off between performance and latency for the task of J2K encoding and decoding. For example, at remote color grading application we need minimum latency, so we need to process each J2K frame separately, without batch and without multithreading. Though in most cases it's better to choose acceptable latency and get the best performance with batch and multithreading.
Other info from Fastvideo about J2K and digital cinema applications
JPEG2000 codec on GPU
Fast FFmpeg J2K decoder on NVIDIA GPU
MXF Player
Fast CinemaDNG Processor software on GPU
BRAW Player and Converter for Windows and Linux
Original article see at: https://www.fastcompression.com/blog/remote-post-production-software.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Fastvideo SDK vs NVIDIA NPP Library
Author: Fyodor Serzhenko
Why is Fastvideo SDK better than NPP for camera applications?
What is Fastvideo SDK?
Fastvideo SDK is a set of software components which correspond to high quality image processing pipeline for camera applications. It covers all image processing stages starting from raw image acquisition from the camera to JPEG compression with storage to RAM or SSD. All image processing is done completely on GPU, which leads to real-time performance or even faster for the full pipeline. We can also offer a high-speed imaging SDK for non-camera applications on NVIDIA GPUs: offline raw processing, high performance web, digital cinema, video walls, FFmpeg codecs and filters, 3D, AR/VR, AI, etc.
Who are Fastvideo SDK customers?
Fastvideo SDK is compatible with Windows/Linux/ARM and is mostly intended for camera manufacturers and system integrators developing end-user solutions containing video cameras as a part of their products.
The other type of Fastvideo SDK customers are developers of new hardware or software solutions in various fields: digital cinema, machine vision and industrial, transcoding, broadcasting, medical, geospatial, 3D, AR/VR, AI, etc.
All the above customers need faster image processing with higher quality and better latency. In most cases CPU-based solutions are unable to meet such requirements, especially for multicamera systems.
Customer pain points
According to our experience and expertise, when developing end-user solutions, customers usually have to deal with the following obstacles.
Before starting to create a product, customers need to know the image processing performance, quality and latency for the final application.
Customers need reliable software which has already been tested and will not glitch when it is least expected.
Customers are looking for an answer on how to create a new solution with higher performance and better image quality.
Customers need external expertise in image processing, GPU software development and camera applications.
Customers have limited (time/human) resources to develop end-user solutions bound by contract conditions.
They need a ready-made prototype as a part of the solution to demonstrate a proof of concept to the end user.
They want immediate support and answers to their questions regarding the fast image processing software's performance, image quality and other technical details, which can be delivered only by industry experts with many years of experience.
Fastvideo SDK business benefits
Fastvideo SDK as a part of complex solutions allows customers to gain competitive advantages.
Customers are able to design solutions which earlier may have seemed to be impossible to develop within required timeframes and budgets.
The product helps to decrease the time to market of end-user solutions.
At the same time, it increases overall end-user satisfaction with reliable software and prompt support.
As a technology solution, Fastvideo SDK improves image quality and processing performance.
Fastvideo serves customers as a technology advisor in the field of fast image processing: the team of experts provides end-to-end service to customers. That means that all customer questions regarding Fastvideo SDK, as well as any other technical questions about fast image processing are answered in a timely manner.
Fastvideo SDK vs NVIDIA NPP comparison
NVIDIA NPP can be described as a general-purpose solution, because the company implemented a huge set of functions intended for applications in various industries, and the NPP solution mainly focuses on various image processing tasks. Moreover, NPP lacks consistency in feature delivery, as some specific image processing modules are not presented in the NPP library. This leads us to the conclusion that NPP is a good solution for basic camera applications only. It is just a set of functions which users can utilize to develop their own pipeline.
Fastvideo SDK, on the other hand, is designed to implement a full 16/32-bit image processing pipeline on GPU for camera applications (machine vision, scientific, digital cinema, etc). Our end-user applications are based on Fastvideo SDK, and we collect customer feedback to improve the SDK’s quality and performance. We are armed with profound knowledge of customer needs and offer an exceptionally reliable and heavily tested solution.
Fastvideo uses a specific approach in Fastvideo SDK which is based on components (not on functions as in NPP). It is easier to build a pipeline based on components, as the components' input and output are standardized. Every component executes a complete operation, and it can have a complex architecture, whereas NPP only uses several functions. It is important to emphasize here that developing an application using built-in Fastvideo SDK is much less complex than creating a solution based on NVIDIA NPP.
The Fastvideo JPEG codec and lots of other SDK features have been heavily tested by our customers for many years with a total performance benchmark of more than million images per second. This is a question of software reliability, and we consider it as one of our most important advantages.
The major part of the Fastvideo SDK components (debayer and codecs) can offer both high performance and image quality, leaving behind the NPP alternatives. What’s more, this is also true for embedded solutions on Jetson where computing performance is quite limited. For example, NVIDIA NPP only has a bilinear debayer, so it can be regarded as a low-quality solution, best suited only for software prototype development.
Summing up this section, we need to specify the following technological advantages of the Fastvideo SDK over NPP in terms of image processing modules for camera applications:
High-performance codecs: JPEG, JPEG2000 (lossless and lossy)
High-performance 12-bit JPEG encoder
Raw Bayer Codec
Flat-Field Correction together with dark frame subtraction
Dynamic bad pixel suppression in Bayer images
Four high quality demosaicing algorithms
Wavelet-based denoiser on GPU for Bayer and RGB images
Filters and codecs on GPU for FFmpeg
Other modules like color space and format conversions
To summarize, Fastvideo SDK offers an image processing workflow which is standard for digital cinema applications, and could be very useful for other imaging applications as well.
Why should customers consider Fastvideo SDK instead of NVIDIA NPP?
Fastvideo SDK provides better image quality and processing performance for implementing key algorithms for camera applications. The real-time mode is an essential requirement for any camera application, especially for multi-camera systems.
Over the last few years, we've tested NPP intensely and encountered software bugs which weren't fixed. In the meantime, if customers come to us with any bug in Fastvideo SDK, we fix it within a couple of days, because Fastvideo possesses all the source code and the image processing modules are implemented by the Fastvideo development team. Support is our priority: that's why our customers can rely on our SDK.
We offer custom development to meet specific our customers' requirements. Our development team can build GPU-based image processing modules from scratch according to the customer's request, whereas in contrast NVIDIA provides nothing of the kind.
We are focused on high-performance camera applications and we have years of experience, and our solutions have been heavily tested in many projects. For example, our customer vk.com has been processing 400,000 JPG images per second for years without any issue, which means our software is extremely reliable.
Software downloads to evaluate the Fastvideo SDK
GPU Camera Sample application with source codes including SDKs for Windows/Linux/ARM - https://github.com/fastvideo/gpu-camera-sample
Fast CinemaDNG Processor software for Windows and Linux - https://www.fastcinemadng.com/download/download.html
Demo applications (JPEG and J2K codecs, Resize, MG demosaic, MXF player, etc.) from https://www.fastcompression.com/download/download.htm
Fast JPEG2000 Codec on GPU for FFmpeg
You can test your RAW/DNG/MLV images with Fast CinemaDNG Processor software. To create your own camera application, please download the source codes from GitHub to get a ready solution ASAP.
Useful links for projects with the Fastvideo SDK
1. Software from Fastvideo for GPU-based CinemaDNG processing is 30-40 times faster than Adobe Camera Raw:
http://ir-ltd.net/introducing-the-aeon-motion-scanning-system
2. Fastvideo SDK offers high-performance processing and real-time encoding of camera streams with very high data rates:
https://www.fastcompression.com/blog/gpixel-gmax3265-image-sensor-processing.htm
3. GPU-based solutions from Fastvideo for machine vision cameras:
https://www.fastcompression.com/blog/gpu-software-machine-vision-cameras.htm
4. How to work with scientific cameras with 16-bit frames at high rates in real-time:
https://www.fastcompression.com/blog/hamamatsu-orca-gpu-image-processing.htm
Original article see at: https://www.fastcompression.com/blog/fastvideo-sdk-vs-nvidia-npp.htm Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Fast FFmpeg J2K decoder on NVIDIA GPU
FFmpeg is great software which is offering just a huge amount of options for image and video processing, handling of multimedia files and streams. It supports many formats, codecs and filters for various tasks. This is the reason why it's so widespread in the world. Many applications are based on FFmpeg and their flexibility and performance are really impressive. FFmpeg is actually a command-line application which is also capable of video transcoding and video post production. The name of FFmpeg comes from MPEG video standards group, together with "FF" which means "fast forward".
To carry on with FFmpeg, user need to download the software from ffmpeg.org or zeranoe.com. To build own solution, user has to go to Git to download source codes for the latest version and to build FFmpeg with all necessary options.
How FFmpeg can decode J2K?
For a start we could answer the following very simple question - which JPEG2000 codec is working at FFmpeg by default? Surprisingly, this is not OpenJPEG codec. FFmpeg has its own J2K codec. In the FFmpeg documentation we can see the following: "The native jpeg2000 encoder is lossy by default, the -q:v option can be used to set the encoding quality. Lossless encoding can be selected with -pred 1".
This is not a good choice, so we can install OpenJPEG library (libopenjpeg) as default FFmpeg codec for J2K encoding and decoding on CPU. OpenJPEG is quite reliable and sofisticated solution with wide set of features from JPEG2000 Standard. OpenJPEG codec is very interesting product, but it's working on CPU only. As soon as J2K algorithm has very high computational complexity, OpenJPEG is running not fast even with multithreading. OpenJPEG is still very slow even after recent boost with optimization and multithreading. Here you can see JPEG2000 benchmarks on CPU and GPU for J2K encoding and decoding with OpenJPEG, Jasper, Kakadu, J2k-Codec, CUJ2K, Fastvideo codecs to check the performance for images with 2K and 4K resolutions (both for lossy and lossless algorithms).
How FFmpeg is working internally?
FFmpeg usage is based on the idea of consequtive software modules which are applied to your data. As soon as most of FFmpeg codecs and filters are working on CPU, both input and output of each processing module are at CPU memory, though currently FFmpeg is also capable to work with GPU-based NVENC encoder and NVDEC decoder on NVIDIA GPUs. That NVIDIA codec supports H.264 and H.265 codecs and much more.
To create conventional FFmpeg codec for fast J2K encoding or decoding on GPU, we've taken into account architectures of FFmpeg applications and FFmpeg codecs. We've implemented FFmpeg J2K decoder which is working on GPU with batch of images in multithreaded mode to achieve maximum performance. Externally it looks like conventional decoder with internal multithreading. Now that J2K decoder could be utilized in FFmpeg and it could be included in FFmpeg processing workflow as a part of any complicated task.
That FFmpeg decoder is fully based on Fastvideo J2K decoder which is implemented on NVIDIA GPU. That J2K decoder could be used in many FFmpeg applications in a standard way. To follow that standard approach, user just needs to build FFmpeg with that J2K library.
How to build FFmpeg with Fastvideo J2K decoder
1. Download FFmpeg source for Ubuntu. Version 4.2.4 has been used for testing on Ubuntu 18.04 https://ffmpeg.org/download.html#get-sources
You can retrieve the source code through Git by using the following command: git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg
To get Fastvideo SDK, please send your request via form below at the bottom of that page.
This is the link to download test video file - SNOWMAN-DCP3D.rar
2. Install NVENC headers by install_nvenc.sh. NVIDIA driver is 440.33.01. NVENC version is 9.1.23. FFmpeg version of headers required to interface with NVIDIA's codec APIs. Corresponds to NVIDIA Video Codec SDK version 9.1.23.
3. Install yasm package for FFmpeg build
4. Copy folder fastvideo_sdk (including inc and lib folders) in root of FFmpeg source folder. Copy make_sl.sh from root of archive to fastvideo_sdk/lib and execute it to make all symbolic links for *.so.
5. Copy Refresh the following files:
- libavcodec/allcodecs.c extern AVCodec ff_jpeg2000_cuda_decoder; - libavcodec/avcodec.h After AV_CODEC_ID_JPEG2000, Insert AV_CODEC_ID_JPEG2000_CUDA,
- libavcodec/codec_desc.c After { .id = AV_CODEC_ID_JPEG2000, .type = AVMEDIA_TYPE_VIDEO, .name = "jpeg2000", .long_name = NULL_IF_CONFIG_SMALL("JPEG 2000"), .props = AV_CODEC_PROP_INTRA_ONLY | AV_CODEC_PROP_LOSSY | AV_CODEC_PROP_LOSSLESS, .mime_types= MT("image/jp2"), .profiles = NULL_IF_CONFIG_SMALL(ff_jpeg2000_profiles), },
Insert { .id = AV_CODEC_ID_JPEG2000_CUDA, .type = AVMEDIA_TYPE_VIDEO, .name = "jp2k_cuda", .long_name = NULL_IF_CONFIG_SMALL("JPEG 2000 (Fastvideo)"), .props = AV_CODEC_PROP_INTRA_ONLY | AV_CODEC_PROP_LOSSY | AV_CODEC_PROP_LOSSLESS, .mime_types= MT("image/jp2"), .profiles = NULL_IF_CONFIG_SMALL(ff_jpeg2000_cuda_profiles), }, - libavcodec/profiles.h extern const AVProfile ff_jpeg2000_cuda_profiles[]; - libavcodec/profiles.c After const AVProfile ff_jpeg2000_profiles[] = { { FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_0, "JPEG 2000 codestream restriction 0"}, { FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_1, "JPEG 2000 codestream restriction 1"}, { FF_PROFILE_JPEG2000_CSTREAM_NO_RESTRICTION, "JPEG 2000 no codestream restrictions"}, { FF_PROFILE_JPEG2000_DCINEMA_2K, "JPEG 2000 digital cinema 2K"}, { FF_PROFILE_JPEG2000_DCINEMA_4K, "JPEG 2000 digital cinema 4K"}, { FF_PROFILE_UNKNOWN }, };
Insert const AVProfile ff_jpeg2000_cuda_profiles[] = { { FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_0, "JPEG 2000 codestream restriction 0"}, { FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_1, "JPEG 2000 codestream restriction 1"}, { FF_PROFILE_JPEG2000_CSTREAM_NO_RESTRICTION, "JPEG 2000 no codestream restrictions"}, { FF_PROFILE_JPEG2000_DCINEMA_2K, "JPEG 2000 digital cinema 2K"}, { FF_PROFILE_JPEG2000_DCINEMA_4K, "JPEG 2000 digital cinema 4K"}, { FF_PROFILE_UNKNOWN}, }; - libavcodec/Makefile After OBJS-$(CONFIG_JPEG2000_DECODER) += jpeg2000dec.o jpeg2000.o jpeg2000dsp.o \ Insert jpeg2000dec_cuda.o \
Refresh the following files to install fastvideo resizer for 10/16 bits video:
- libavfilter/allfilters.c extern AVFilter ff_vf_scale_fastvideo; - libavfilter/Makefile OBJS-$(CONFIG_SCALE_FASTVIDEO_FILTER) += vf_scale_fastvideo.o 6. Copy src/include folder to fastvideo_sdk/inc. Copy src/libavcodec folder to libavcodec to install j2k decoder from Fastvideo. Copy src/libavfilter folder to libavfilter to install resize filter.
7. Configure FFmpeg with listed below minimum options. This list can be extended by end user. CUDA path is default for 10.1 version. ./configure --cc="gcc -m64" --enable-ffplay --enable-ffmpeg --disable-doc --enable-shared --disable-static --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --prefix=./bin/ --arch=amd64 --extra-cflags="-MD -I/usr/local/cuda-10.1/include/ -I./fastvideo_sdk/inc/" --extra-ldflags="-L/usr/local/cuda-10.1/lib64/ -L./fastvideo_sdk/lib/" --extra-libs="-lcudart -lfastvideo_sdk -lfastvideo_j2kFfmpegWrapper -lfastvideo_decoder_j2k"
Or you can copy default.configure.sh from root of archive to root of FFmpeg folder and run it.
8. make
9. make install
10. Update LD_LIBRARY_PATH for FFmpeg and Fastvideo libraries
Or copy export.library.path.sh from root of archive to FFmpeg folder and run it. Script prints export LD_LIBRARY_PATH command with correct path.
11. Copy video folder to bin folder of FFmpeg and run run.snowman.sh to test.
Fastvideo J2K decoder parameters: threads and fv_batch_size
Fastvideo J2K decoder on GPU for FFmpeg has two additional parameters that influence on the performance. These are -threads and -fv_batch_size.
Parameter "threads" is FFmpeg parameter. It defines the number of concurrent CPU threads for processing. This option is accessible for J2K decoder with frame-level multithreading.
Parameter "fv_batch_size" defines the number of frames, processed by one decoder in parallel. FFmpeg does not support batch mode for multiple decoders. FFmpeg supports batch mode only for a single decoder. This is not enough for Fastvideo J2K decoder to get the best performance.
To discard this limitation, Fastvideo JPEG2000 decoder for FFmpeg uses internal client-server architecture. Client is FFmpeg worker thread that takes bytestream from FFmpeg and sends it to J2K decoder. The amount of real J2K decoders is a number of FFmpeg worker threads divided to batch size. Therefore, the number of worker threads has to be divisible by batch size (fv_batch_size).
To get the best performance, batch size (fv_batch_size) has to be at least 4 and the number of worker threads has to be at least 8. This results in two real J2K decoders. If we increase fv_batch_size and the number of J2K decoders, we improve GPU memory usage.
Fast J2K transcoding with FFmpeg and NVENC from MXF to MP4
These are examples of command-line how we could decode snowman.mxf file from the current folder and create MP4 video file with H.264 or H.265 encoding at the same pipeline with different sets of parameters:
#source: XYZ 12 bit -> dest: h265 444 10 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 1 -fv_batch_size 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.10bits.mp4
#source: XYZ 12 bit -> dest: h265 444 10 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.10bits.mp4
#source: XYZ 12 bit -> dest: h265 420 10 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_420 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.420.10bits.mp4
#source: XYZ 12 bit -> dest: h265 444 8 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.8bits.mp4
#source: XYZ 12 bit -> dest: h265 420 8 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -fv_convert_to_420 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.420.8bits.mp4
#source: XYZ 12 bit -> dest: h264 444 8 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -i snowman.mxf -c:v h264_nvenc -b:v 5M out.h264.444.8bits.mp4
#source: XYZ 12 bit -> dest: h264 420 8 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -fv_convert_to_420 1 -i snowman.mxf -c:v h264_nvenc -b:v 5M out.h264.420.8bits.mp4
Basically, we read and parse frames from snowman.mxf file, decode them on GPU with id = 0 (fv_batch_size = 2, four CPU threads) and encode that stream to MP4 at 5 Mbit/s and save it to *.mp4 file in the current folder.
fv_batch_size and number of threads depend on the size of free GPU memory. If utilized parameters are too big, then user will get a warning to make fv_batch_size less or to utilize better GPU with more memory.
Simple benchmarks for FFmpeg J2K transcoding to H.264 on GPU
The task of J2K transcoding to H.264 is quite common. Though it's not possible to get realtime performance with OpenJPEG codec from FFmpeg. Fastvideo JPEG2000 decoder together with NVIDIA NVENC could solve the full task of J2K transcoding on GPU and it will be much faster than realtime. Resulted performance depends on many factors, but here we just indicate a standard case:
According to our preliminary benchmarks on NVIDIA GeForce RTX 2080ti and on Quadro RTX 6000, such a solution can transcode MXF (J2K, 10-bit, 4:2:2, 200 Mbit/s, TR-01 compliant) or TS files/streams to MP4 (H.264, 15 Mbit/s) with the performance around 320-350 fps. Full processing is done on GPU (apart from audio processing which is on CPU), both for J2K decoding (Fastvideo J2K codec) and H.264 encoding (NVENC).
Fast J2K decoding with FFmpeg from MXF to RGB or YUV frames
Fastvideo J2K decoder supports multiple output formats. These are NV12, P010, YUV444, YUV444P10, RGB24, RGB48. Formats NV12, P010, YUV444, YUV444P10 are native for NVENC. By default, decoded frame is placed into device buffer that can not be consumed by most of FFmpeg filters and codecs. There is a parameter -fv_export_to_host 1 to force J2K decoder to place a frame to host buffer. Device buffer is used for NVENC to remove additional device-to-host and host-to-device copies. Host buffer is used for integration with other FFmpeg codecs and filters. Formats NV12, P010, YUV444, YUV444P10 support both buffer types. Formats RGB24 and RGB48 support only host buffer type.
Format NV12 is native NVENC format. It contains mixed UV plane in contrast to classic YUV420. Format P010 is 16-bit per element version of NV12 format.
This is an example of how we could decode snowman.mxf file from the current folder and create a series of RGB or YUV images:
#source: XYZ 12 bit -> dest: RGB 8 bits ./ffmpeg -y -report -c:v jp2k_cuda -fv_convert_to_8bit 1 -fv_convert_to_rgb 1 -fv_export_to_host 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf out%d.8.t1.b1.ppm
#source: XYZ 12 bit -> dest: YUV 444 10 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.yuv444.10.yuv
#source: XYZ 12 bit -> dest: YUV 444 8 bits ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_8bit 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.yuv444.8.yuv
#source: XYZ 12 bit -> dest: YUV P010 ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_420 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.P010.yuv
#source: XYZ 12 bit -> dest: YUV NV12 ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_420 1 -fv_convert_to_8bit 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.NV12.yuv
We could possibly need such a solution if we are going to do final video encoding on CPU. If we compare CPU-based H.264 or H.265 encoding performance with J2K on GPU, we can see that performance of J2K decoding is much higher, so we could decode multiple streams on GPU and then encode them on CPU. Usually we will need one CPU thread per stream for encoding. Multicore CPU is a must here. This is actually a task of live transcoding where we could combine GPU and CPU to build high performance solution.
If we don't have an output format that you need, please let us know and we will add it. We do both J2K decoding and format conversions on GPU to improve the total performance. This is very important in processing of multiple streams.
Fast J2K decoding with FFmpeg for MXF Player
If you have ever tried to play MXF or TS files (150-200 Mbit/s) with J2K content at VLC player, you probably know the result. Unfortunately, any high-bitrate MXF or TS video with J2K frames is too complicated for CPU-based VLC software and you could hardly achieve viewing at 1 fps, which is not acceptable.
DCP package contains J2K frames inside and it's quite difficult task to offer smooth preview for that content on CPU via VLC. Now you can decode J2K frames on GPU and show the results via ffplay or with any other player which is connected to FFmpeg output.
Apart from J2K decoding on GPU, we will soon release FFmpeg-based J2K encoder on GPU which could be utilized to create DCP with very high performance. It's much faster in comparison with OpenJPEG.
Original article see at: https://www.fastcompression.com/blog/jetson-benchmark-comparison.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
High Performance GPU Software for Machine Vision Cameras
Cameras for machine vision, industrial and scientific applications are quite widespread nowadays. These solutions utilize USB3, GigE, 5GigE, 10GigE, 25GigE, CameraLink, Coax, Thunderbolt, PCI-Express interfaces to send data from cameras to PC. Usually cameras transfer RAW data and we need to implement quite complicated image processing pipeline to convert RAW to RGB in realtime. This is computationally heavy task, especially for high speed and high data rate cameras.
Realtime image processing for machine vision cameras could be done on Intel/AMD CPUs, but that solution is difficult to speedup further, which is the case for multicamera systems. To overcome such a bottleneck, one could implement simplified algorithms on CPU to be on time, but it can't be a good solution. Most of high quality algorithms are slow even at multicore CPUs. The slowest algorithms for CPU are demosaicing, denoising, color grading, undistortion, resize, rotation to an arbitrary angle, compression, etc.
To solve that task we utilize Fastvideo SDK which is working on NVIDIA GPU. With that SDK we can implement a software solution where full image processing pipeline is done on graphics processing unit (GPU). In that case the software architecture is much more simple, because processing part is fully done on GPU and it doesn't interfere any more with CPU-based features.
Fast CinemaDNG Processor software
As a good example of high performance and high quality raw image processing we could recommend Fast CinemaDNG Processor software for Windows, where all computations are done on NVIDIA GPU and its core is based on Fastvideo SDK engine. With that software we can get high quality image processing according to digital cinema workflow. One should note that Fast CinemaDNG Processor is offering image quality which is comparable to the results of raw processing at Raw Therapee, Adobe Camera Raw and Lightroom Photo Editor software, but significantly faster. Total speedup could be estimated as 10-20 times or even more.
To check GPU-based performance for your machine vision camera on Fast CinemaDNG Processor software, we can convert RAW images to DNG format with our open source PGM2DNG converter, which could be downloaded from Github. After such a convertion the software will be able to work both with digital cinema cameras like Blackmagic Design and with machine vision / industrial cameras.

GPU Camera Sample Project
Here we will show how to implement a software with GPU image processing pipeline for any machine vision camera. To accomplish the task we need the following:
Camera SDK (XIMEA, MATRIX VISION, Basler, JAI, Daheng Imaging, Imperx, Baumer, Flir, etc.) for Windows
Optional GenICam package with camera vendor GenTL producer (.cti)
Fastvideo SDK (demo) ver.0.15.0.0 for Windows
NVIDIA CUDA-10.1 for Windows and the latest NVIDIA driver
Qt ver.5.13.1 for Windows
Compiler MSVC 2017
Source codes and links to supplementary libraries for GPU Camera Sample project you can find on Github.
As a starting point we've implemented the software to capture raw images from XIMEA cameras. We have utilized sample application from XIMEA SDK and we have incorporated that code into our software. In the software we set default camera parameters to focus on GPU-based image processing, though you can add any GUI to control camera parameters as well.
You can download binaries for Windows to work with XIMEA cameras or with raw images in PGM format with GPU image processing.
Simple image processing pipeline on GPU for machine vision applications
Raw image capture (8-bit, 12-bit packed/unpacked, 16-bit, monochrome or bayer)
Import to GPU
Raw data convertion and unpacking
Linearization curve
Bad pixel removal
Dark frame subtraction
Flat-field correction
White Balance
Exposure correction (brightness control)
Debayer with HQLI (5×5 window), DFPD (11×11), MG (23×23) algorithms
Wavelet-based denoising
Gamma
JPEG compression
Output to monitor with minimum latency
Export from GPU to CPU memory
Storage of compressed data to SSD or streaming via FFmpeg RTSP
It's possible to modify that image processing pipeline according to your needs, as soon as source codes are available. There are much more image processing options in the Fastvideo SDK to implement at such a software for GPU-based image processing in camera applications.
The software has the following architecture
Thread for GUI and visualization (app main thread)
Thread for image acquisition from a camera
Thread to control CUDA-based image processing
Thread for OpenGL rendering
Thread for async data writing to SSD or streaming
With that software one could also build a multi-camera solution for any machine vision or industrial cameras with image processing on NVIDIA GPU. In the simplest case, user can run several processes (one per camera) at the same time to accomplish the task. In more sofisticated approach it would be better to create one image loader which will collect frames from different cameras for further processing on GPU.
There is also an opportunity to utilize different compression options on GPU at the end of the pipeline. We can utilize JPEG (Motion JPEG), JPEG2000 (MJ2K), H.264 and H.265 encoders on GPU. Please note that H.264 and H.265 are implemented via hardware-based NVIDIA NVENC encoder and that video encoding could be done in parallel with CUDA code.
From the benchmarks on NVIDIA GeForce RTX 2080ti we can see that GPU-based raw image processing is very fast and it could offer very high quality at the same time. The total performance could reach 4 GPix/s. The performance strongly depends on complexity of the pipeline. Multiple GPU solutions could significanly improve the performance.
The software could also work with raw images in PGM format (bayer or grayscale) which are stored on external SSD. This is a good method for software evaluation and testing without a camera. User can download the source codes from Github or ready binaries for Windows from GPU Camera Sample project page.
We've done some tests with raw frames in PGM format from Gpixel GMAX 3265 image sensor. That image sensor has resolution 9433 × 7000. For 8-bit mode we've got total processing time on GPU around 12 ms which is more than 4 GPix/s. The pipeline includes data copy from host to device, dark frame, FFC, linearization, BPC, white balance, debayer HQLI, gamma sRGB, 16/8-bit transform, JPEG compression (subsampling 4:2:0, quality 90), viewport texture copy, monitor output at 30 fps. The same solution for 12-bit raw image could give us 14 ms processing time.
Here we can see absolutely incredible performance for JPEG encoding on NVIDIA GPU: 65 MPix color image (24-bit) could be compressed within 3.3 ms on NVIDIA GeForce RTX 2080ti.
We recommend to utilize that software as a testing tool to evaluate image quality and performance. User can also test different NVIDIA GPUs to choose the best hardware in terms of price, quality and performance for a particular task.
Glass-to-Glass Time Measurements
To check system latency we've implemented the software to run G2G tests in the gpu-camera-sample application.
We have the following choices for G2G tests:
Camera captures frames with the info about current time from high resolution timer at the monitor, we send data from camera to the software, do image processing on GPU and then show processed image at the same monitor close to the window with the timer. If we stop the software, we see one the screen two different times and their difference is system latency.
We have implemented more compicated solution: after image processing on GPU we've done JPEG encoding (MJPEG on CPU or on GPU), then send MJPEG stream to receiver process, where we do MJPEG parcing and decoding, then frame output to the monitor. Both processes (sender and receiver) are running at the same PC.
The same solution as in the previous approach, but with H.264 encoding/decoding (CPU or GPU), both processes are at the same PC.
We can also measure the latency for the case when we stream compressed data from one PC to another over network. Latency depends on camera frame rate, monitor fps, NVIDIA GPU performance, network bandwidth, complexity of image processing pipeline, etc.
Custom software design
GPU Camera Sample project is just a simple application which shows how user can quickly integrate Fastvideo SDK into real project with machine vision, industrial and scientific cameras. We can create custom solutions with specified image processing pipeline on NVIDIA GPU (mobile, laptop, desktop, server), which are much more complicated in comparison with that project. We can also build custom GPU-based software to handle multicamera systems for machine vision or industrial applications.
Roadmap for GPU Camera Sample project
GPU pipeline for monochrome cameras - done
GenICam Standard support - done
Support for XIMEA, MATRIX VISION, Basler, JAI and Daheng Imaging cameras (USB3) - done
Video streaming option (MJPEG via RTSP) - done
Linux version - done
Software for NVIDIA Jetson Nano, TX2, AGX Xavier - done
Glass-to-Glass (G2G) test - done
H.264/H.265 encoders on GPU - done
Support for Imperx, EVT, IDS, Baumer, FLIR, IOI, Mikrotron cameras - in progress
GenICam option for Fast CinemaDNG Processor
CUDA JPEG2000 encoder
Transforms to Rec.601 (SD), Rec.709 (HD), Rec.2020 (4K)
3D LUT for HSV and RGB with cube size 17, 33, 65, 96, 256, etc.
Interoperability with external FFmpeg and GStreamer
References
Source codes and binaries for Windows, Linux and L4T for GPU Camera Sample project
Ximea SDK
Fastvideo SDK for Image & Video Processing
Fast CinemaDNG Processor software
Other blog posts on GPU software for camera applications
Realtime image processing for XIMEA CB500
Fast RAW Compression on GPU
Fastvideo SDK benchmarks on NVIDIA Quadro RTX 6000
Gpixel GMAX3265 Image Sensor Processing
GPU Software for Camera Applications
Software for Hamamatsu ORCA Processing on GPU
0 notes
Text
Software for Hamamatsu ORCA Processing on GPU
Author: Fyodor Serzhenko
Scientific research demands modern cameras with low noise, high resolution, frame rate and bit depth. Such imaging solutions are indispensable in microscopy, experiments with cold atom gases, astronomy, photonics, etc. Apart from outstanding hardware there is a need for high performance software to process streams in realtime with high precision.
Hamamatsu Photonincs company is a world leader in scientific cameras, light sources, photo diodes and advanced imaging applications. For high performance scientific cameras and advanced imaging applications, Hamamatsu introduced ORCA cameras with outstanding features. ORCA cameras are high precision instruments for scientific imaging due to on-board FPGA processing enabling intelligent data reduction, pixel-level calibrations, increased USB 3.0 frame rates, purposeful and innovative triggering capabilities, patented lightsheet read out modes and individual camera noise characterization.
ORCA-Flash4.0 cameras have always provided the advantage of low camera noise. In quantitative applications, like single molecule imaging and super resolution microscopy imaging, fully understanding camera noise is also important. Every ORCA-Flash4.0 V3 is carefully calibrated to deliver outstanding linearity, especially at low light, to offer improved photo response non-uniformity (PRNU) and dark signal non-uniformity (DSNU), to minimize pixel differences and to reduce fixed pattern noise (FPN).
The ORCA-Flash4.0 V3 includes patented Lightsheet Readout Mode, which takes advantage of sCMOS rolling shutter readout to enhance the quality of lightsheet images. When paired with W-VIEW GEMINI image splitting optics, a single ORCA-Flash4.0 V3 camera becomes a powerful dual wavelength imaging device. In "W-VIEW Mode" each half of the image sensor can be exposed independently, facilitating balanced dual color imaging with a single camera. And this feature can be combined with the new and patented "Dual Lightsheet Mode" to offer simultaneous dual wavelength lightsheet microscopy.
Applications for Hamamatsu ORCA cameras
There are quite a lot of scientific imaging tasks which could be solved with Hamamatsu ORCA cameras:
Digital Microscopy
Light Sheet Fluorescence Microscopy
Live-Cell Microscopy and Live-Cell Imaging
Laser Scanning Confocal Microscopy
Biophysics and Biophotonics
Biological and Biomedical Sciences
Bioimaging and Biosensing
Neuroimaging
Hamamatsu ORCA-Flash4.0 V3 Digital CMOS camera (image from https://camera.hamamatsu.com/jp/en/product/search/C13440-20CU/index.html)
Hamamatsu ORCA-Flash4.0 V3 Digital CMOS camera: C13440-20CU
Image processing for Hamamatsu ORCA-Flash4.0 V3 Digital CMOS camera
That camera generates quite high data rate. Maximum performance for Hamamatsu ORCA-Flash4.0 V3 could be evaluated as 100 fps * 4 MPix * 2 Byte/Pix = 800 MByte/s. As soon as these are 16-bit monochrome frames, that high data rate could be a bottleneck to save such streams to SSD for two-camera system for long-term recording, which is quite usual for microscopy applications.
If we consider one-day recoding duration, storage for such a stream could be a problem. That two-camera system generates 5.76 TB data per hour and it could be a good idea to implement realtime compression to cut storage cost. To compress 16-bit frames, we can't utilize either JPEG or H.265 encoding algorithms because they don't support more than 12-bit data. The best choice here is JPEG2000 compression algorithm which is working natively with 16-bit images. On NVIDIA GeForce GTX 1080 we've got the performance around 240 fps for lossy JPEG2000 encoding with compression ratio around 20. This is the result that we can't achieve on CPU because corresponding JPEG2000 implementations (OpenJPEG, Jasper, J2K, Kakadu) are much slower. Here you can see JPEG2000 benchmark comparison for widespread J2K encoders.
JPEG2000 lossless compression algorithm is also available, but it offers much less compression ratio, usually in the range of 2-2.5 times. Still, it's useful option to store original compressed data without any losses which could be mandatory for particular image processing workflow. In any way, lossless compression makes data rate less, so it's always good for storage and performance issues.
Optimal compression ratio for lossy JPEG2000 encoding should be defined by checking different quality metrics and their correspondence to a particular task to be solved. Still, there is no good alternative for fast JPEG2000 compression for 16-bit data, so JPEG2000 looks as the best fit. We would also recommend to add the following image processing modules to the full pipeline to get better image quality:
Dynamic Bad Pixel Correction
Data linearization with 1D LUT
Dark Frame Subtraction
Flat Field Correction (vignette removal)
White/Black Points
Exposure Correction
Curves and Levels
Denoising
Crop, Flip/Flop, Rotate 90/180/270, Resize
Geometric transforms, Rotation to an arbitrary angle
Sharp
Gamma Correction
Realtime Histogram and Parade
Mapping and monitor output
Output JPEG2000 encoding (lossless or lossy)
The above image processing pipeline could be fully implemented on GPU to achieve realtime performance or even faster. It could be done with Fastvideo SDK and NVIDIA GPU. That SDK is supplied with sample applications in source codes, so user can create his own GPU-based application very fast. Fastvideo SDK is avalilable for Windows, Linux, L4T.
There is also a gpu-camera-sample application which is based on Fastvideo SDK. You can download source codes and/or binaries for Windows from the following link on Github - gpu camera sample. Binaries are able to work with raw images in PGM format (8/12/16-bit), even without a camera. User can add support for Hamamatsu cameras to process images in realtime on NVIDIA GPU.
Fastvideo SDK to process on GPU raw images from Hamamatsu ORCA sCMOS cameras
The performance of JPEG2000 codec strongly depends on GPU, image content, encoding parameters and complexity of the full image processing pipeline. To scale the performance, user can also utilize several GPUs for image processing at the same time. Multiple GPU processing option is the part of Fastvideo SDK.
If you have any questions, please fill the form below with your task description and send us your sample images for evaluation.
Links
Hamamatsu ORCA-Flash4.0 V3 Digital sCMOS camera
GPU Software for camera applications
JPEG2000 Codec on NVIDIA GPU
Image and Video Processing SDK for NVIDIA GPUs
GPU Software for machine vision and industrial cameras
Original article see at: https://www.fastcompression.com/blog/hamamatsu-orca-gpu-image-processing.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
JPEG Optimizer Library on CPU and GPU
Fastvideo has implemented the fastest JPEG Codec and Image Processing SDK for NVIDIA GPUs. That software could work at maximum performance with full range of NVIDIA GPUs, starting from mobile Jetson to professional Quadro and Tesla server GPUs. Now we've extended these solutions to be able to offer various optimizations to Standard JPEG algorithm. This is vitally important issue to get better image compression while retaining the same perceived image quality within existing JPEG Standard.
Our expert knowledge in JPEG Standard and GPU programming are proved by performance benchmarks of our JPEG Codec. This is also a ground for our custom software design to solve various time-critical tasks in connection with JPEG images and corresponding services.
Our customers have been utilizing that GPU-based software for fast JPEG encoding and decoding, JPEG resize for high load web applications and they asked us to implement more optimizations which are indispensable for web solutions. These are the most demanding tasks:
JPEG recompression to decrease file size without loosing perceived image quality
JPEG optimization to get better user experience while loading JPEG images via slow connection
JPEG processing on users' devices
JPEG resize on-demand:
Implementations of JPEG Baseline, Extended, Progressive and Lossless parts of the Standard
Other tasks related to JPEG images
to store just one source image (to cut storage costs)
to match resolution of user's device (to exclude JPEG Resize on user's device)
to minimize traffic
to ensure minimum server response time
to offer better user experience
The idea about image optimization is very popular and it really makes sense. As soon as JPEG is so widespread at web, we need to optimize JPEG images for web as well. By decreasing image size, we can save space for image storage, minimize traffic, improve latency, etc. There are many methods of JPEG optimization and recompression which could bring us better compression ratio while saving perceptual image quality. In our products we strive to combine all of them with the idea about better performance on multicore CPUs and on modern GPUs.
There is a great variety of image processing tasks which are connected with JPEG handling. They could be solved either on CPU or on GPU. We are ready to offer custom software design to meet special requirements that our customers could have. Please fill the form below and send us your task description.
JPEG Optimizer Library and other software from Fastvideo
JPEG Optimizer Library (SDK for GPU/CPU on Windows/Linux) to recompress and to resize JPEG images for corporate customers: high load web services, photo stock applications, neural network training, etc.
Standalone JPEG optimizer application - in progress
Projects under development
JPEG optimizer SDK on CPU and GPU
Mobile SDK on CPU for Android/IOS for image decoding and visualization on smartphones
JPEG recompression library that runs inside your web app and optimizes images before upload
JPEG optimizer API for web
Online service for JPEG optimization
Fastvideo publications on the subject
JPEG Optimization Algorithms Review
Web resize on-the-fly on GPU
JPEG resize on-demand: FPGA vs GPU. Which is the fastest?
Jpeg2Jpeg Acceleration with CUDA MPS on Linux
JPEG compress and decompress with CUDA MPS
Original article see at: https://www.fastcompression.com/products/jpeg-optimizer-library.htm
Subscribe to our mail list: https://mailchi.mp/fb5491a63dff/fastcompression
0 notes
Text
Fast RAW Compression on GPU
Author: Fyodor Serzhenko
Recording performance for RAW data acquisition task is essential issue for 3D/4D, VR and Digital Cinema applications. Quite often we need to do realtime recordings to portable SSD and here we face questions about throughput, compression ratio, image quality, recording duration, etc. As soon as we need to store RAW data from a camera, the general approach for raw image encoding is not exactly the same as for color. Here we review several methods to solve that matter.
Why do we need Raw Image Compression on GPU?
We need to compress raw stream from a camera (industrial, machine vision, digital cinema, scientific, etc.) in realtime at high fps, for example 4K (12-bit raw data) at 60 fps, 90 fps or faster. This is vitally important issue for realtime applications, external raw recorders and for in-camera raw recordings. As an example we can consider RAW or RAW-SDI format to send data from a camera to PC or to external recorder.
As soon as most of modern cameras have 12-bit dynamic range, it's a good idea to utilize JPEG compression which could be implemented for 12-bit data. For 14-bit and 16-bit cameras this is not the case and for high bit depth cameras we would recommend to utilize either Lossless JPEG encoding or JPEG2000. These algorithms are not fast, but they can process high bit depth data.
Lossy methods to solve the task of Fast RAW Compression
Standard 12-bit JPEG encoding for grayscale images
Optimized 12-bit JPEG encoding (double width, half height, Standard 12-bit JPEG encoding for grayscale images)
Raw Bayer encoding (split RGGB pattern to 4 planes and then apply 12-bit JPEG encoding for each plane)
The problem with Standard JPEG for RAW encoding is evident - we don't have slowly varying changes in pixel values at the image and this could cause problems with image quality due to Discrete Cosine Transform which is the part of JPEG algorithm. In that case the main idea of JPEG compression is questionable and we expect to get higher level of distortion for RAW images with JPEG compression.
The idea about "double width" is also well-known. It's working well at Lossless JPEG compression for RAW bayer data. After such a transform we get the same colors for vertical pixel neighbours for two adjacent rows and it could decrease high-frequency values after DCT for Standard JPEG. That method is also utilized in Blackmagic Design BMD RAW 3:1 and 4:1 formats.
If we split RAW image into 4 planes according to available bayer pattern, we get 4 downsized images, one for each bayer component. Here we can get slowly varying intensity, but for images with halved resolution. That algorithm looks promising, though we could expect slightly slower performance becase of additional split algorithm in the pipeline.
We focus on JPEG-based methods as soon as we have high performance solution for JPEG codec on CUDA. That codec is capable of working with all range of NVIDIA GPUs: mobile Jetson Nano, TK1/TX1/TX2, AGX Xavier, laptop/desktop GeForce series and server GPUs Quadro and Tesla. That codec also supports 12-bit JPEG encoding which is the key algorithm for that RAW compression task.
There is also an opportunity to apply JPEG2000 encoding instead of JPEG for all three cases, but here we will consider JPEG only because of the following reasons:
JPEG encoding on GPU is much faster than JPEG2000 encoding (approximately ×20)
Compression ratio is almost the same (it's bigger for J2K, but not too much)
There is a patent from RED company to implement J2K encoding for splitted channels inside the camera
There are no open patent issues connected with JPEG algorithm and this is serious advantage of JPEG. Nevertheless, the case with JPEG2000 compression is very interesting and we will test it later. That approach could give us GPU lossless raw image compression, which can't be done with JPEG.
To solve the task of RAW image compression, we need to specify both metric and criteria to measure image quality losses. We will try SSIM which is considered to be much more reliable in comparison with PSNR and MSE. SSIM means structural similarity and it's widely used to evaluate image resemblance. This is well known image quality metric.
Quality and Compression Ratio measurements
To find the best solution among chosen algorithms we have done some tests to calculate Compression Ratio and SSIM for standard values of JPEG Quality Factor. We've utilized the same Standard JPEG quantization table and the same 12-bit RAW image. As soon as Compression Ratio is content-dependent, this is just an example of what we could get in terms of SSIM and Compression Ratio.
For the testing we've utilized uncompressed RAW bayer image from Blackmagic Design URSA camera with resolution 4032×2192, 12-bit. Compression Ratio was measured with relation to the packed uncompressed 12-bit image file size, which is equal to 12.6 MB, where two pixel values are stored in 3 Bytes.
Output RGB images were created with Fast CinemaDNG Processor software. Output colorspace was sRGB, 16-bit TIFF, no sharpening, no denoising. SSIM measurements were performed with these 16-bit TIFF images. Source image was compared with the processed image, which was encoded and decoded with each compression algorithm.
Table 1: Results for SSIM for encoding with standard JPEG quantization table
These results show that SSIM metrics is not really suitable for such tests. According to visual estimation, we can conclude that image quality Q = 80 and higher could be considered acceptable for all three algorithms, but the images from the third algorithm look better.
Table 2: Compression Ratio (CR) for encoding with standard JPEG quantization table
Performance for RAW encoding is the same for the first two methods, though for the third it's slightly less (performance drop is around 10-15%) because we need to spend additional time to split raw image to 4 planes according to the bayer pattern. Time measurements have been done with Fastvideo SDK for different NVIDIA GPUs. These are hardware-dependent results and you can do the same measurements for your particular NVIDIA hardware.
How to improve image quality, compression ratio and performance
There are several ways to get even better results in terms of image quality, CR and encoding performance for RAW compression:
Image sensor calibration
RAW image preprocessing: dark frame subtraction, bad pixel correction, white balance, LUT, denoise, etc.
Optimized quantization tables for 12-bit JPEG encoding
Optimized Huffman tables for each frame
Minimum metadata in JPEG images
Multithreading with CUDA Streams to get better performance
Better hardware from NVIDIA
Useful links converning GPU accelerated image compression
High Performance CUDA JPEG Codec
12-bit JPEG encoding on GPU
JPEG2000 Codec on GPU
RAW Bayer Codec on GPU
Lossless JPEG Codec on CPU
Original article see at: https://www.fastcompression.com/blog/fast-raw-compression.htm
0 notes