#vector iteration is also good for assembling other commands | Explore Tumblr posts and blogs

screamnpixel · 3 years ago

Text

The best drawing and digital art apps available for everyone in 2022

Drawing on paper with a traditional pen or pencil has a certain satisfying feel to it. But command+Z doesn't work when you need to delete mistakes. You can count on technology to help you out. With the best drawing apps on a portable device, artists and designers can make quick changes and iterations on the go.

Which of the top apps for digital drawing should you use?

If you're in the market for one of those amazing drawing apps, it's worth your time to do some research to find the best fit for your creative workflow before making a purchase. Some of the things that modern digital art can do and how it works may come as a pleasant surprise.

Here are the best—there's something for everyone!

Drawing apps For Beginners

Here are some easy apps for beginners. These are easy-to-use drawing apps. They have many resources and a large community to support them as they learn.

1. Procreate

This app runs on an iOS device and can be scary for a designer who is just starting out. Explore the Procreate forums for more information on how to use the program. Anyone interested in digital art can find a lot of resources online, such as tutorials and videos from very skilled artists. Check out the animation tool; it will give your projects a whole new dimension.

2. Adobe Photoshop Sketch

The app is compatible with both iOS and Android devices. This drawing program has been used by artists for years. Like other Adobe products, Photoshop Sketch has a beautiful interface that is also straightforward and easy to use. The brush panel on the left and the few design tools on the right make a large digital canvas for painting. If you're a fine artist looking to incorporate digital elements into your work, you'll love this app! It's made with you in mind, with streamlined features that let you focus on making art.

3. Adobe Illustrator Draw

This program is intended for the graphic designer who already uses other Adobe products. Adobe Illustrator Draw on the iPad makes it simple to transfer your work to an open Illustrator file using CC Libraries.

The app takes a minimalist approach to Adobe Photoshop Sketch, and it's gorgeous to look at, but the learning curve can be steep for those not used to working with design software. The good news is that users of any experience level can take advantage of the app's extensive tutorial platform and active community. It works on both iOS and Android.

4. Adobe Fresco

There are some unique features in this program, such as live brushes that can be dragged across the screen, stunning vector graphics, and 3D-looking blending effects. Adobe Fresco can be sent to Photoshop for more advanced editing, and its simple interface makes it good for designers and illustrators who want to get their ideas down quickly without getting bogged down by extra features.

Digital painting and drawing have never been easier than with this beginner-friendly app. You'll need an Adobe ID to access it; it's available on the App Store and Windows.

5. Inspire Pro

An excellent way to get started with digital drawing is with Inspire Pro. The interface is easy to use and gives you access to a wide range of powerful editing and design tools. This app is a strong contender for the best drawing app on the App Store due to its quick rendering times. This app is designed for Apple's iOS and provides a basic set of tools for digital drawing.

6. Pixelmator Pro

Even though this app is more of an "image editor," it can still be considered one of the best digital drawing apps. It can add different effects to text layers, use vector graphics, change colors, and make clipping masks. This little app can make a graphic designer's or photographer's wildest dreams come true.

Pixelmator Pro is a photo editing app for iOS devices that, like Lightroom and Photoshop, focuses on providing a variety of options for graphic design.

7. Assembly

This app is classified as a "Mobile Vector Design" application. Assembly comes with elegant shape packs and a lot of other useful tools that would be welcome additions to the toolbox of any graphic designer or illustrator. The minimal design makes the most of the available canvas area. There is a sizable group of talented people already familiar with the app who are eager to share their knowledge, and there are also some entertaining tutorials available to help you get started.

With this iOS app, you can make vector graphics and use shapes to make illustrations.

Top-Rated Design Apps for Serious Artists

These apps have more features and are easy to use in a professional setting. People who have never dabbled in design before may find the learning curve more challenging, but the payoff is substantial once they do.

8. Autodesk Sketchbook

This is the app to look into if you need to create technical drawings or focus on minute details. It can scan your artwork and digitize it for you. Many industrial designers, architects, and other professionals use this handy little tool as part of their workflow. It makes it easy to draw and gives new users a smooth start with the Autodesk family.

Architects, industrial designers, and illustrators can all benefit from the precision and accuracy of Sketchbook's technical drawing tools. It's compatible with both iOS and Android operating systems.

9. Affinity Designer

When compared to Adobe's suite of design applications, Affinity Designer holds its own. The possibilities for its use are seemingly endless because of its powerful structure. More and more artists are using this tool in their work, and the tool itself has a huge collection of tutorials. This app has caught the attention of many people because it offers the best value for money among digital drawing apps, requiring only a single payment to unlock both the mobile and desktop versions.

This drawing app is an add-on to the desktop version of Affinity Designer, and it supports both vector and pixel art. It was developed for use with layouts intended for print or digital media.

Discover the best drawing app for your needs!

Whether you're an industrial engineer or a fine artist, you can find a drawing application that suits your needs in today's market. These art apps liberate artists from their desktops so they can create wherever they happen to be. With one of the aforementioned drawing apps, you can now make digital drawings wherever and whenever you like!

#drawing and digital art apps

1 note · View note

dailytechnologynews · 6 years ago

Photo

Musings on Vega / GCN Architecture

Originally posted to /r/AMD, but someone asked that I copy/paste it here to /r/hardware.

In this topic, I'm just going to stream some ideas about what I know about Vega64. I hope I can inspire some programmers to try to program their GPU! Also, If anyone has more experience programming GPUs (NVidia ones even), please chime in!

For the most part, I assume that the reader is a decent C Programmer who doesn't know anything about GPUs or SIMD.

Vega Introduction

Before going further, I feel like its important to define a few things for AMD's Vega Architecture. I will come back later to better describe some concepts.

64 CUs (Compute Units) -- 64 CUs on Vega64. 56 CUs on Vega56.

16kB L1 (Level 1) data-cache per CU

64kB LDS (Local Data Store) per CU

4-vALUs (vector Arithmetic Logic Unit) per CU

16 PE (Processing Elements) per vALU

4 x 256 vGPRs (vector General Purpose Registers) per PE

1-sALU (scalar Arithmetic Logic Unit) per CU

8GB of HBM2 RAM

Grand Total: 64 CUs x 4 vALUs x 16 PEs == 4096 "shaders", just as advertised. I'll go into more detail later what a vGPR or sGPR is, but lets first cover the programmer-model.

GPU Programming in a nutshell

Here's some simple C code. Lets assume "x" and "y" are the input to the problem, and "output" is the output:

for(int i=0; i<1000000; i++){ // One Million Items output[i] = x[i] + y[i]; }

"Work Items", (SIMD Threads in CUDA) are the individual units of work that the programmer wishes to accomplish in parallel with each other. Given the example above, a good work item would be "output[i] = x[i] + y[i]". You would have one-million of these commands, and the programmer instinctively knows that all 1-million of these statements could be executed in parallel. OpenCL, CUDA, HCC, and other grossly-parallel languages are designed to help the programmer specify millions of work-items that can be run on a GPU.

"NDRange" ("Grid" in CUDA) specifies the size of your work items. In the example "for loop" case above, 1000000 would be the NDRange. Aka: there are 1-million things to do. The NDRange or Grid may be 2-dimentional (for 2d images), or 3-dimentional (for videos).

"Wavefronts" ("Warp" in CUDA) are the smallest group of work-items that a GPU can work on at a time. In the case of Vega, 64-work items constitutes a Wavefront. In the case of the for-loop described earlier, a wave-front would execute between [0, 1, 2, 3... 63] iterations together. A 2nd wave front would execute [64, 65, 66, 67, ... 127] together (and probably in parallel).

"Workgroups" ("Thread Blocks" in CUDA) are logical groups that the programmer wants to work together. While Wavefronts are what the system actually executes, the Vega system can combine up to 16-wavefronts together and logically work as a single Workgroup. Vega64 supports workgroups of size 1 through 16 Wavefronts, which correlates to 64, 128, ... 1024 WorkItems (1024 == 16 WaveFronts * 64 Threads per Wavefront).

In summary: OpenCL / CUDA programmers setup their code. First, they specify a very large number of work items (or CUDA Threads) which represents parallelism. For example: perhaps you want to calculate something on every pixel of a picture, or calculate individual "Rays" of a Raytracer. The programmer then groups the work items into workgroups. Finally, the GPU itself splits workgroups into Wavefronts (64-threads on Vega).

SIMD Segue

Have you ever tried controlling multiple characters with only one controller? When you hook up one controller, but somehow trick the computer into thinking it is 8-different-controllers? SIMD: Single Instruction Multiple Data, is the GPU-technique for actually executing these thousands-of-threads efficiently.

The chief "innovation" of GPUs is just this multi-control concept, but applied to data instead. Instead of building these huge CPU cores which can execute different threads, you build tiny GPU cores (or shaders) which are forced to play the same program. Instead of 8x wide (like in the GIF I shared), its 64x wide on AMD.

To handle "if" statements or "loops" (which may vary between work-items), there's an additional "execution mask" which the GPU can control. If the execution-mask is "off", an individual thread can be turned off. For example:

if(foo()){ doA(); // Lets say 10 threads want to do this } else { doB(); // But 54 threads want to do this }

The 64-threads of the wavefront will be forced to doA() first, with the 10-threads having "execution mask == on", and with the 54-remaining threads having "execution mask == off". Then, doB() will happen next, with 10-threads off, and 54-threads on. This means that any "if-else" statement on a GPU will have BOTH LEGS executed by all threads.

In general, this is called the "thread divergence" problem. The more your threads "split up", the more legs of if-statements (and more generally: loops) have to be executed.

Before I reintroduce Vega's Architecture, keep the multiple-characters / one-controller concept in mind.

Vega Re-Introduction

So here's the crazy part. A single Vega CU doesn't execute just one wavefront at a time. The CU is designed to run upto 40 wavefronts (x 64 threads, so 2560 threads total). These threads don't really all execute simultaneously: the 40-wavefronts are there to give the GPU something to do while waiting for RAM.

Vega's main memory controller can take 350ns or longer to respond. For a 1200MHz system like Vega64, that is 420 cycles of waiting whenever something needs to be fetched from memory. That's a long time to wait! So the overall goal of the system, is to have lots of wavefronts ready to run.

With that out of the way, lets dive back into Vega's architecture. This time focusing on CUs, vALUs, and sALUs.

64 CUs (Compute Units) -- 64 CUs on Vega64.

4-vALUs (vector Arithmetic Logic Unit) per CU

16 PE (Processing Elements) per vALU

4 x 256 vGPRs (vector General Purpose Register) per PE

1-sALU (scalar Arithmetic Logic Unit) per CU

The sALU is easiest to explain: sALUs is what handles those "if" statements and "while" statements I talked about in the SIMD section above. sALUs track which threads are "executing" and which aren't. sALUs also handle constants and a couple of other nifty things.

Second order of business: vALUs. The vALUs are where Vega actually gets all of their math power from. While sALUs are needed to build the illusion of wavefronts, vALUs truly execute the wavefront. But how? With only 16-PEs per vALU, how does a wavefront of size 64 actually work?

And btw: your first guess is likely wrong. It is NOT from #vALUs x 16 PEs. Yes, this number is 64, but its an utterly wrong explanation which tripped me up the first time.

The dirty little secret is that each PE repeats itself 4-times in a row, across 4-cycles. This is a hidden fact deep in AMD documentation. In any case, 4-cycles x 16 PE == 64 Workitems per vALU. x4 vALUs == 256 work-items per Compute Unit (every 4 clock cycles).

Why repeat themselves? Because if a simple addition takes 4-clock cycles to operate, then Vega only has to perform ~30 math operations while waiting for RAM (remember: 100ns, or 120-clock cycles, to wait for RAM to respond). Repeating commands over-and-over again helps Vega to hide the memory-latency problem.

Full Occupancy: 4-clocks x 16 PEs x 4 vALUs == 256 Work Items

Full Occupancy, or more like "Occupancy 1", is when each CU (compute unit) has one-work item for each physical thread that could run. Across the 4-clock cycles, 16 PEs, and 4 vALUs per CU, the Compute Unit reaches full occupancy at 256 work items (or 4-Wavefronts).

Alas: RAM is slow. So very, very slow. Even at Occupancy 1 with super-powered HBM2 RAM, Vega would spend too much time waiting for RAM. As such, Vega supports "Occupany 10"... but only IF the programmer can split the limited resources between threads.

In practice, programmers typically reach "Occupancy 4". At occupancy 4, the CU still only executes 256-work items every 4-clock cycles (4-wavefronts), but the 1024 total items (16-wavefronts) give the CU "extra work" to do whenever it notices that one wavefront is waiting for RAM.

Memory hiding problem

Main Memory latency is incredibly slow, but also is variable. RAM may take 350 or more cycles to respond. Even LDS, may respond in a variable amount of time (depending on how many atomic operations are going on, or bank-conflicts).

AMD has two primary mechanisms to hide memory latency.

Instruction Level -- AMD's assembly language requires explicit wait-states to hold the pipeline. The "s_waitcnt lgkmcnt(0)" instruction you see in the assembly is just that: wait for local/global/konstant/message counter to be (zero). Careful use of the s_waitcnt instruction can be used to hide latency behind calculations: you can start a memory load to some vGPRs, and then calculate with other vGPRs before waiting.

Wavefront Level -- The wavefronts at a system-level allow the CU to find other work, just in case any particular wavefront gets stuck on a s_waitcnt instruction.

While CPUs use out-of-order execution to hide latency and search for instruction-level parallelism... GPUs require the programmer (or compiler) to explicitly put the wait-states in. It is far less flexible, but far cheaper an option to do.

Wavefront level latency hiding is roughly equivalent to a CPU's SMT / Hyperthreading. Except instead of 2-way hyperthreading, the Vega GPU supports 10-way hyperthreads.

Misc. Optimization Notes

On AMD Systems, 64 is your magic minimum number. Try to have at least 64 threads running at any given time. Ideally, have your workload evenly divisible by 64. For example, 100 threads will be run as 64 thread wavefront + 36 thread wavefront (with 28 wasted vALU states!). 128 threads is more efficient.

vGPRs (vector General Purpose Registers) are your most precious resource. Each vGPR is a 32-bit of memory that executes at the full speed of Vega (1-operation every 4 clock cycles). Any add, subtract, or multiply in any work-item will have to travel through a vGPR before it can be manipulated. vGPRs roughly correlate to "OpenCL Private Memory", or "CUDA Local Memory".

At occupancy 1, you can use all 256 vGPRs (1024 bytes). However, "Occupancy 1" is not good enough to keep the GPU busy when its waiting for RAM. The extreme case of "Occupancy 10" gives you only 25 vGPRs to work with (256/10, rounded down). A reasonable occupancy to aim for is Occupancy 4 and above (64 vGPRs at Occupancy 4)

FP16 Packed Floats will stuff 2x16-bit floats per vGPR. "Pack" things more tightly to save vGPRs and achieve higher occupancy.

The OpenCL Compiler, as well as HCC, HIP, Vulkan compilers, will overflow OpenCL Private Memory into main-memory (Vega's HBM2) if it doesn't fit into vGPRs. There are compiler flags to tune how many vGPRs the compiler will target. However, your code will be waiting for RAM on an overflow, which is counterproductive. Expect a lot of compiler-tweaking to figure out what the optimal vGPRs for your code will be.

sGPRs (scalar General Purpose Registers) are similarly precious, but Vega has a lot more of them. I believe Vega has around 800 SGPRs per SIMD unit. That is 4x800 SGPRs per CU. Unfortunately, Vega has an assembly-language limit of 102 SGPRs allocated per wavefront. But an occupancy 8 Vega system should be able to hold 100 sGPRs per wavefront.

sGPRs implement OpenCL Constant memory specification (also called CUDA Constant memory). sGPRs are more flexible in practice: as long as they are uniform across the 64-item wavefront, an sGPR can be used instead of 64-individual precious vGPRs. This can implement a uniform loop (like for(int i=0; i<10; i++) {}) without using a precious vGPR.

If you can branch using sGPR registers ("constant" across the whole 64-item wavefront), then you will not need to execute the "else". Effectively, sGPR branching never has a divergence problem. sGPR-based branching and looping has absolutely no penalty on the Vega architecture. (In contrast, vGPR-based branching will cause thread-divergence).

The sALU can operate on 64-bit integers. sGPRs are of size 32-bits, and so any 64-bit operation will use two sGPRs. There is absolutely no floating-point support on the sALU.

LDS (Local Data Store) is the 2nd fastest RAM, and is therefore the 2nd most important resource after vGPRs. LDS RAM correlates to "OpenCL Local" and "CUDA Shared". (Yes, "Local" means different things between CUDA and OpenCL. Its very confusing). There is 64kB of LDS per CU.

LDS can share data between anything within your workgroup. The LDS is the primary reason to use a large 1024-thread workgroup: the workgroup can share the entire LDS space. LDS has full support of atomics (ex: CAS) to provide a basis of thread-safe communications.

LDS is roughly 32-banks (per CU) of RAM which can respond within 2-clock ticks under ideal circumstances. (It may be as slow as 64-clock ticks however). At 1200 MHz (Vega64 base clock), the LDS has 153GBps of bandwidth per compute unit. Across the 64-CUs of Vega64, that's a grand total of 9830.4 GBps bandwidth (and it goes faster as Vega boost-clocks!). Compared to HBM2, which is only 483.8 GBps, you can see why proper use of the LDS can accelerate your code.

Occupancy will force you to split the LDS. The absolute calculation is harder to formulate, because the LDS is shared by Workgroups (and there can be 1 to 16 wavefronts per workgroup). If you have 40 Workgroups (1-wavefront per workgroup), the 64kB LDS must be split into 1638 bytes between workgroups. However, if there are 5 Workgroups (8-wavefronts aka 512 workitems per workgroup), the 64kB LDS only needs to be split into 13107 chunks between the 5-workgroups, even at max occupancy 10.

As a rule of thumb: bigger workgroups that share more data will more effectively use the LDS. However, not all workloads allow you to share data easily.

The minimum workgroup size of 1 wavefront / 64-work items is treated as special. Barriers and synchronization never has to happen! Workgroup size of 1 wavefront (64-work items) by definition executes synchronously with itself. Still, use barrier instructions (and let the compiler figure out that it can turn barriers into no-ops).

A secondary use of LDS is to use it as a manually managed cache. Don't feel bad if you do this: the LDS is faster than L1 cache.

L1 vector data cache is 16kB, and slower than even LDS. In general, any program serious about speed will use the LDS explicitly, instead of relying upon the L1 cache. Still, its helpful to know that 16kB of global RAM will be cached for your CU.

L1 scalar data cache is 16kB, shared between 4 CUs (!!). While this seems smaller than vector L1 Cache, remember that each sALU is running 64-threads / work items. In effect, the 40-wavefronts (x4 == 160 wavefronts max across 4 CUs) represents 10240 threads. But any sALU doesn't store data per-thread... it stores data per wavefront. Despite being small, this L1 scalar data cache can be quite useful in optimized code.

Profile your code. While the theoretical discussion of this thread may be helpful to understanding why your GPGPU code is slow, you only truly understand performance if you read the hard-data.

HBM2 Main Memory is very slow (~120 cycles to respond), and relatively low bandwidth ("only" 480 GBps). At Occupancy 1, there will be a total of 16384 workitems (or CUDA Threads) running on your Vega64. The 8GB of HBM2 main-memory can therefore be split up into 512kB.

As Bill Gates used to say, 640kB should be enough for everyone. Unfortunately, GPUs have such huge amounts of parallelism, you really can't even afford to dedicate that much RAM even in an "Occupancy 1" situation. The secret to GPUs is that your work-items will strongly share data with each other.

Yeah yeah yeah, GPUs are "embarassingly parallel", or at least are designed to work that way. But in practice, you MUST share data if you want to get things done. Even with "Occupancy 1", the 512kB of HBM2 RAM per work-item is too small to accomplish most embarassingly parallel tasks.

References

AMD OpenCL Optimization Guide

AMD GCN Crash Course

Advanced Shader Programming on GCN

GCN Assembly Tutorial -- Seeing the assembly helps understand how sGPR or vGPRs work, and solidify your "wavefront" ideas.

Vega Assembly Language Manual -- 247 pages of dense, raw, assembly language.

#hardware #Computer #Technology #Tech #News

1 note · View note