Hi all, welcome back to part 2 of my Vega pre-release coverage.
Today I will cover some of the more interesting features of the Vega architecture and share
my thoughts on them.
Note that there has been scant information released by AMD on Vega, only basic high level
details and so a lot of this presentation will be my interpretation and speculation
of these new features.
Again, just to remind you, I could be very wrong so please do not use this presentation
as justification for any personal biases and as always, I encourage you to do your own
research for further clarity.
Let's start with a basic fact, a graphics chip has a lot of fixed function hardware,
often referred to as the front-end, back-end or fixed function blocks and when AMD calls
their current architecture GCN, semantics aside, it really only applies to the Compute
Units.
AMD refers to Vega's Compute Units as New Compute Units or NCU, but don't be fooled,
it is still very much GCN, only that it has gained some very useful features.
Where Vega is a revolution compared to previous GCN with major changes to warrant a new architecture
name, actually occurs within the front-end and back-end.
In the last video, I ended it on the most important change with Vega by suggesting it
has a form of Tile-Based Rasterization.
Now, a caveat here, GPUs already perform various Tiling optimizations already, so let me be
very clear, The Draw Stream Binning Rasterizer in Vega is not likely to be Tile-Based Rendering
as mobile GPUs do it due to incompatibilities with PC APIs.
The same also applies to NVIDIA's Tiling technique, so while a lot of folks equate Maxwell & Pascal's
rendering to real Tile-Based Rendering GPUs, without official info from NVIDIA we simply
do not know the extent to which it is implemented.
In general, Tiling optimizations all have an identical purpose: to improve performance
while saving bandwidth and power and it does this by breaking down the workload into very
small fragments.
The principle here is to work on smaller sets of data that can be read/write quicker or
better, fits within on-GPU cache rather than off-GPU VRAM.
When most people think of the graphics capabilities of a GPU, they tend to focus on the Stream
Processors or Cuda Cores, but graphics first occurs in the front-end before primitives
even take shape and only exists as code or draw calls.
The Geometry Processor or PolyMorph Engine fetches vertices & executes Vertex Shaders
to generate triangles for the next next stage, rasterization to generate Pixel information.
These Pixels are then processed with all manner of pixel shaders to to complete the entire
frame then it is sent to the back-end or ROPs for output to display.
This pipeline is known as Immediate Mode Rendering, because as each set of vertexes are transformed,
they then get rasterized to a pixel immediately and it's finished before the next primitives
gets rasterized.
In Tile-Based Rasterization, all of the geometry is processed as a first step, which then gets
culled & binned into separate smaller tiles.
Each tile is then rasterized to become pixels and proceeds to further stages.
As a tile is finished, the next tile proceeds and so on, once all the tiles are done, the
frame is presented to the display.
Now, there's two interesting question here, one, why do we need to process all the geometry
first before Tiling?
Without geometry data, we simply do not know the location of a primitive to even place
it in a tile.
The other question is why do we have to process a tile entirely before moving to the next?
It's all about the on-GPU cache, it's just not big enough.
This is obviously a more convoluted approach so why would it be advantageous at all?
During the pixel formation process, steps such as blending, depth testing and stencil
testing requires access to large pixel information datasets and it has to travel to and from
the VRAM.
The consequence of this is not only performance limitation due to bandwidth, but also increased
power usage due to increased Memory Controller activity.
If bandwidth is a limiting factor, the GPU's design must therefore go with a bigger bus
and higher clocked VRAM and both of this leads to higher power usage.
By processing small tiles instead of the full scene, the data can fit within on-GPU cache.
On GPU cache is both faster and an order of magnitude lower latency and lower power to
utilize.
If the GPU use less of the main VRAM and more of it's on-chip cache, performance is not
as limited by VRAM bandwidth and so to achieve a certain level of performance, a leaner chip
with a smaller bus will suffice.
Now, remember that Tiling techniques can be used for each step, rasterization, pixel shading
& render back-end processing.
In a true Tile-Based Rendering architecture, all of these steps happen in sequence, a tile
from the front-end goes all the way to completion before the next tile is rendered.
NVIDIA GPUs likely have a mix of traditional Immediate Mode & Tiling optimizations for
some steps.
With Vega, AMD is implementing an improved Tiling step for the Rasterizer, or during
pixel formation, an important step for today's high resolution gaming.
For all of the aforementioned advantages, why not implement more Tiling years ago?
There's a few disadvantages to Tiling, particularly the early front-end stages.
The first one is requiring a dedicated tiler and on-chip binning cache which adds to transistor
count and die size.
Here, one could argue the performance benefit will outweigh the added die cost, but if bandwidth
and power is not a limiting factor, why not just go with a simpler Immediate Mode Rendering?
We have to view this in the perspective of a decade ago as the foundations of current
architectures back then were still very lite on power usage, they haven't hit their peak
for power deliver and cooling capability.
The other important point is that back then, the average PC gaming resolution was much
lower and so bandwidth & power was not stressed like today.
Thus it was just easier to continue with Immediate Mode and raise the bandwidth & power limit
as required.
The second major disadvantage of Tiling occurs after the vertex shader step, the output of
geometry processing, per-vertex data and tiler intermediate state is a large dataset that
has to be sent to the VRAM.
This dataset is then read during pixel processing so there's an overhead in terms of VRAM capacity
and bandwidth that is paid during these steps.
These days, gaming resolution has increased along with an emphasis on post effects, the
burden on frame buffer bandwidth is much greater than the Tiling overhead, so this trade-off
makes sense.
The third downside of Tile-Based Rasterization is the immense load it places on the geometry
engines of a GPU because it requires the entire scene of vertices be processed first before
other rendering steps can proceed.
On mobiles and early PC graphics, a typical scene may only have a hundred thousand triangles
and so basic culling along with brute forcing geometry made Tiling possible.
These days, a million primitive per viewable location is on the low end, with modern games
often having tens of millions, even approaching a staggering 220 million as an example in
Deus Ex Mankind Divided.
As such, the pre-requisites for Rasterizer Tiling to be successful are very effective
hardware geometry setup and culling to rapidly handle hugely complex primitive counts.
This requirement was simply not met for current GCN iterations.
The reason is ofcourse due to GCN's focus on being a General Compute GPU and this is
reflected in GCN's weaker ability to process triangles relative to it's shader power.
As an example, most GCN GPUs has a fixed four Geometry Engines capable of processing 4 triangles
per clock.
With NVIDIA, ever since Fermi, geometry processing was scaled based on the Polymorph Engines
with one within each SM or Compute Units.
Typically, a mid-range chip may have 16 Polymorph Engines which processes a triangle every other
clock, so the effective rate is twice that of even AMD's biggest GCN GPUs, even higher
if clock speeds are factored.
But raw throughput itself is simply not enough, the Geometry Engine has to also be smart enough
to remove or cull invisible or degenerate primitives to reduce the workload.
This particular hardware function is one in which NVIDIA has long held a lead since Fermi
and it's again due to their scalable Polymorph Engines.
NVIDIA GPUs have always had a huge advantage in geometry processing benchmarks, both visible
and occluded triangles.
GCN meanwhile, does not cull many, and relies more on brute force which can be quickly lead
to bottlenecks when geometry complexity rises, as with heavy Tessellation usage.
The hardware design differences also manifests itself in lower resolution benchmarks, where
pixel shading becomes lightweight, but geometry load remains similar, as a portion of the
frame rendering time, geometry processing dominates over other tasks.
This factor is why low resolution benchmarks can be misleading, where reviewers assume
the lower AMD GPU performance is due to the mysterious driver overhead.
Inversely, AMD GPUs have tended to perform better relative at higher resolution and again
it's wrongly assumed to be bandwidth or fill rate advantages.
It's simply due to higher resolution shifting the frame's load to pixel shading, where GCN's
potent Stream Processor arrays with their higher performance flex their power.
Let's return to Vega and how the shift to Rasterizer Tiling will force it's front-end
to drastically boost geometry processing.
AMD claims that Vega's Geometry Engines has been improved to offer more than 2x the triangle
throughput.
When we look at the footnotes, it specifies Vega is capable of 11 triangles per clock,
or 2.6x the increase.
Why 11?
It's an odd number, either the Geometry Engines can process 1 or 2 triangles per clock, which
with four engines, results in 4 or 8 triangles per clock.
Unless ofcourse it's still the same 4 triangles per clock, but with a new Primitive Shader
which replaces the Vertex & Geometry Shader steps.
Vega can boost Triangle throughput by tapping into it's big Stream Processor arrays to process
and cull geometry much faster.
Importantly, AMD refers to their Primitive Shader as a Compute Shader.
As such it is not far fetched to suggest parallel dispatching via the Asynchronous Compute Engines.
This should in theory, offload the main Graphics Command Processor for other task while Primitive
Shaders are executing.
One of the concern is whether this new Primitive Shader will require specific game coding to
access, and there's evidence to suggest both yes and no.
Yes in that direct usage by developers could potentially yield the best performance gains,
but the no, even if it's not used, AMD can optimize it within their drivers on a per
application basis.
Being a flexible programmable Compute Shader gives AMD options to extract higher geometry
processing & culling from Vega's new front-end.
Even if we assume lower than expected gains, an increase in triangles per clock is a significant
gain along with improved primitive culling techniques from Polaris.
Regardless of whether Vega increases the rate of triangles per clock from 4 or up to 11,
it's also important to remember that it's per clock and so clock speeds matter.
For Vega, AMD is very vague during their teaser, saying the NCU is optimized for higher clock
speeds and higher IPC.
Let's cover the clock speed first.
Unlike Polaris where I had to guess it's clock speed mostly based on GloFo's claims about
it's 14nm FinFet node, and I had guessed it would be 1.5Ghz, which was wrong as when Polaris
launched last year at around 1.25Ghz.
Since the recent Polaris refresh with the RX 580s with many models at 1.4Ghz and above,
its clear that initial claims from GloFo about their node was off the mark, and they needed
time to fine-tune the process to hit their performance claims.
With Vega, we have more concrete info on the clock speeds directly from AMD.
The Vega powered Radeon Instinct MI25's specs list 25 TFlops of FP16 performance, which
equates to 12.5 TFlops FP32.
With the engineering sample Vega from various leaks and device ID specifications in Linux
drivers, we know it has 4,096 Stream Processors.
As it's a HPC card, to operate at the rated TFlops, that would have to be from it's base
clocks.
Therefore, MI25 would have a base clock of ~ 1,525mhz to achieve it's rated compute performance.
How high would the boost clocks be?
We simply do not know but it is obviously configurable based on the card's TDP limits.
What we do know is that historically, AMD's professional GPUs were down-clocked compared
to the consumer variant.
There's evidence to suggest MI25 has a rated TDP of 225W for better HPC compatibility,
which is still low enough for the consumer Vega to have higher clocks.
My guess is that the high-powered consumer Vega would have around 1.6Ghz base clocks,
with variable boost clocks from there onwards towards a peak of around 1.8Ghz.
Some of you would think that's too high given what you know about GCN and I say to you,
ditch what you know since it does not apply.
GCN since it's debut, the core design has not changed much, there has been no emphasis
on redesigning to operate at higher frequencies.
As an example, the first GCN, Tahiti, had a peak clock of around 1.2Ghz when overclocked.
Later GCN iterations such as Hawaii, Tonga and Fiji have this same peak.
Polaris is clocked higher purely due to the benefits of 14nm FF when compared to the previous
GCN on 28nm.
Vega's being designed for higher clocks tells me one thing: Vega has a longer pipeline.
The trade-off here would be in latency, as well as more register & cache usage, but this
is mostly offset by the higher performance.
As for the increased IPC, don't count on anywhere near 2x increase.
AMD is referring to specific use cases such as Rapid Packed Maths to accelerate certain
operations.
The next Vega feature is the new Hardware Scheduler, now with a cooler name, the Intelligent
Workgroup Distributor.
Almost every single reviewer glossed over it with the same "load balancing to better
utilize resources" non-statement.
My guess, two potential improvements, the first is the simplest & it relates to scaling
performance.
As you may know, GCN was originally designed with the capability to scale up to a max of
4 Shader Engines.
A Shader Engine is similar to NVIDIA's Graphics Processing Clusters, a group of fixed function
units along with Compute Units and the arrays of Stream Processors.
Tahiti & Tonga GCN has 4 Shader Engines each with 512 Stream Processors.
Hawaii, a very capable chip, also has 4 Shader Engines each with 704 Stream Processors.
These GPUs were considered balance, performance scaled relative to their Stream Processors
so there's no work distribution bottlenecks.
Fiji however, also has 4 Shader Engines but each has 1,024 Stream Processors, and unsurprisingly
it resulted in under utilization, certainly in DX11 where the ACEs cannot participate
to distribute work.
Vega has the same Stream Processor count and if it's distributed over 4 Shader Engines,
the depth would make for the old scheduler being bottlenecked again.
As such, an improved scheduler is required to fully tap all the Stream Processors to
allow Vega to scale in performance properly.
The other point relates to something very few people talk about, that GCN was designed
from the start to break up the scene space into quadrants to match it's 4 Geometry Engines.
Yes, GCN actually already use a form of synchronized Screen Tiling for Rasterization.
This however, is not to the same extent as Vega, as the current approach is not about
saving bandwidth or power, as the Tile partitions too big, it is more about work distribution
to the 4 separate Shader Engines.
There's a potential flaw in this approach however, because not all quadrants of a scene
is equal in geometry complexity.
With static partitioning, the 4 Rasterizers in GCN may end up with very uneven workloads
and so one may finish ahead of the others which causes idled or bottlenecked Rasterizers.
A simplified example, imagine your typical first person shooter, during intense firefights,
huge explosions go off on one side of the screen, your frame rate drops due to all the
new transparent smoke particles & effects.
When you analyze the scene, the increase in primitive & pixel complexity is there, but
it's not overwhelming or should not be given the capabilities of your powerful GPU.
But, divide that power by 4, and suddenly there is more potential for these bottlenecks
to occur.
What's happening is that some of the Rasterizers are idling, while the one that's processing
those explosion is now slowing down due to the increased quadrant complexity.
With more coherency between the Shader Engines, the Intelligent Workgroup Distributor is able
to spread the load more evenly to prevent individual Geometry Engines bottlenecking.
The other big feature that AMD claims is revolutionary for Vega is the new High Bandwidth Cache Controller
or HBCC.
What's new with the HBCC?
Firstly, Vega is capable of addressing up to 512 TB of virtual address space.
This in of itself is not a novel development because NVIDIA's Pascal GP100 is also capable
of this feat.
However, it's interesting to note that this feature is absent in lower tier Pascal and
so this could be an advantage for AMD in some HPC markets.
The HBCC is also designed so that the HBM2 memory is capable of acting like a true cache,
and assets can be streamed in, fine-grained, from various external sources including RAM,
Non-Volatile Memory and even Network Storage.
G
P100 is capable of unified memory operation & can access other GPU's VRAM as well as system
RAM, but I do not know whether it's able to access data directly from other sources like
Vega can as it's not something NVIDIA talks about.
With AMD's past GPGPU efforts, they offered similar or better raw compute performance
for less money to these markets, but it has not led to major market penetration.
The reason as I've mentioned, is due to the entrenched nature of CUDA, companies and institutions
are not willing to switch just for similar or a little better performance.
The cost of GPU accelerators aren't a big issue in these markets.
It's all about software and being able to do something special that your competitor
cannot.
Realistic real-time renders of high resolution scenes use hundreds of GBs of assets, and
this is why they are restricted to CPU clusters with large system RAM capacity.
It's also a slideshow.
Vega in my opinion, will dominate this kind of large dataset acceleration.
In fact, it's something the HPC market has often demanded, for GPU accelerators to remove
the shackle of it's limited VRAM capacity as they are simply too small for the world
of big data.
An interesting thing to watch out for is how Vega relates to AMD's Zen architecture, as
Zen focuses on efficiency and it lacks performance on more intensive AVX instructions which some
HPC markets require.
These instructions also happen to run very well on GPUs but their use is often restricted
to CPU clusters, again, due to limitations in software as well as GPU VRAM capacity.
It's going to take effort from AMD's part, but I think they will have an excellent synergy
with Zen Naples and Vega Instinct, which could penetrate into this HPC niche.
Moving on, the HBCC enables Vega's HBM2 to be more effective in terms of capacity, with
AMD quoting 2x uplift so 4GB HBM2 capacity equals 8GB of GDDR5 for game assets.
The reason this claim isn't just PR smoke and that it can work as describe is due to
a special characteristic of HBM2 itself.
It's not about bandwidth or access latency, as potentially GDDR5X and certainly 6X will
match HBM2 twin-stacks on these metrics.
In order to do what AMD claims, fine-grained data streaming into memory as the GPU is processing
data on it, constant read and write activations of the banks on the memory is required.
Think of the banks in your memory chips as individual storage cells which has to be activated
to read or write.
Activation and access of that data is relatively similar in terms of latency for GDDR5 vs HBM2,
but with GDDR5, once the banks have been activated, they enter a wait stage or cool-down timer
before they can be re-activated.
This wait period, called time Four Bank Activation Window or tFAW, can be as long or longer than
the time it takes for data to be read or written.
This is not an issue for the current VRAM use, most game assets into the VRAM, then
the GPU reads it as needed, there is very little consecutive read/write to the same
banks.
When VRAM is designed to be used like a cache, it needs to handle rapid read/write cycles
into the same banks as data that's no longer required has to be evicted, replaced with
new data streaming in.
Delays due to wait states would like result in a drop in performance or frame stutter.
This feature isn't new though, Fiji also has this capability, with driver tuning Fury X
with only 4GB of HBM1 is able to cope with games at 4K resolution that needs 6 or 8GB
of assets.
Remember the Radeon Pro SSG, Fiji with an M2-SSD onboard to handle large datasets?
The improvement in Vega's ability I think, is down to as AMD said, fine-grained data
movement.
Fiji could use it's HBM as a cache, but I suspect it evicts and streams in larger chunks,
which is fine for games or 8K video encoding, but it may suffer in some scientific workloads.
This feature, while AMD sold it for games, is very much a HPC functionality.
For gaming, I suspect it would be a tough sell for a 4GB Vega even if AMD talks up it's
cache ability.
There is a advantage that HBM2 offers for gamers and that's down to power and die-space
savings.
HBM2 moves a portion of the memory controller to the first cell in each stack so the on-GPU
part can be smaller, freeing up space for other features to enhance performance.
Power wise, my guess is that 8GB of HBM2 saves around 20W, it's not much, but that's around
10% more performance per watt when it comes to a high-end GPU.
A big change that I am excited about with Vega is Rapid Packed Math or two FP16 operations
per cycle, and while most have focused on it's advantages for HPC, in particular AI
deep training, I'm excited for it's potential in gaming.
Games currently use the standard FP32 format for all of it's shaders, but there are many
effects which do not need this level of accuracy and will work fine with FP16.
Now, the game is not suddenly going to run 2x faster, but the specific effects which
often incurs large performance penalties can indeed run twice as fast.
You may think, it's not going to be a big deal because most GPUs on the PC will be crippled
with FP16 shaders.
You would be partially right.
All of NVIDIA's consumer GPUs emulate FP16, often at very slow rates so if game developers
use FP16 shaders, they would have to also include a fallback to regular FP32 shaders
for these GPUs.
AMD's recent GCN GPUs can handle FP16 at full speed so there's more flexibility there, but
the dominant marketshare belongs with NVIDIA, as such, FP16 adoption would be minimal, in
theory.
In practice, in the next few years, what we will see is a big push from game studios towards
FP16 shaders.
The consoles have around 4 and 6 Teraflop GPUs, but gamers demand fluid 4K gaming, and
so developers must come up with better optimizations to extract more performance out of the hardware.
By switching to FP16 shaders, suddenly the PS4 Pro is has over 8 Teraflops.
This is definitely the path towards 4K @ 60 FPS on high settings for these consoles.
Ofcourse, you could argue as I have, that NVIDIA can sponsor the PC port and remove
FP16 shaders.
It's not only possible, it makes the most tactical sense for NVIDIA to slow down FP16
adoption on PC gaming as it only benefits AMD.
But here's where AMD can fight back, game studios want more performance out of the consoles
and AMD engineers can be vital in the shift to FP16 within their game engines.
AMD just needs to reach out, promote and help game studios to achieve this and they will
reciprocate and include FP16 along with FP32 shaders when they release their games on the
PC.
Do you think it was a coincidence that Polaris was modified for fast FP16 performance for
the consoles?
It's no coincidence and AMD stands to gain from this, but they need to push quickly,
seeding their engineers and form partnership with major game studios to take advantage
of the potentially huge performance gains.
Think of it like untapped power reserves within Vega, and if FineWine is used to describe
GCN longevity or future proofing, Vega takes this to the next level as it becomes a 25+
Teraflops gaming monster.
However, a counter point to this, in the longer term, what's stopping NVIDIA's consumer Volta
from offering 2x FP16 performance?
Money.
NVIDIA loves to segment their GPUs for consumer vs HPC by neutering these unique capabilities,
whether it's FP64 or FP16 performance.
NVIDIA is unlikely to offer twice the FP16 performance on consumer GPUs because it has
the potential to cannibalize their Tesla sales.
As such, I think gaming FP16 is potentially AMD's advantage for the next few years.
At least until the next NVIDIA architecture beyond Volta.
The last major change in Vega, is the shift to coherent Pixel & Texture memory where the
Render back-ends become a client of the large L2 cache.
In game engines that use render to texture or deferred shading, this is a performance
and power efficiency gain.
Most modern game engines are actually deferred so this change is much needed, for gaming
and in particular for VR performance.
The real interesting stuff about this change is in a slide which AMD didn't talk much about.
Look at that Infinity Fabric linking Vega, directly from it's L2 cache to external CPU
& PCIe.
This has major implications for HPC and HSA performance, though that's not the scope of
this video and it's gone on long enough.
In summary, Vega should be a leap forward in GCN efficiency, both in terms of performance
per mm2 and performance per watt.
Vega will clock higher, have better geometry & pixel culling & throughput, higher Streaming
Processor utilization and it also focuses on HPC performance.
We know that Vega is approximately 10 to 15% larger than Pascal GP102 which powers the
1080Ti and Titan Xp.
With all of it's architecture improvements, Vega has a very good chance to take the lead,
but ultimately, it's going to come down to whether AMD's driver team able to tap into
this new architecture on launch.
Because it is a major change, performance is even more reliant on optimized drivers,
so let's hope for the best for competition sake, but since it's AMD, expect an average
launch with future drivers to lift Vega's performance.
Finally, price-wise, there's a few things which make me think Vega is not going to be
priced at ridiculous levels, in particular it's limited to only 2-stacks of HBM2 and
the talk from AMD on cost pressures & their desire to regain market-share.
For gamers, Vega can only disrupt the GPU market & regain market-share if it offers
exceptional bang for buck, very much like Ryzen trading blows with Intel's 6900K at
half the price.
So, feel free to get Hyped!
But, also wish AMD's driver team well, because as great as hardware is, without good software
it can be a dud.
These videos take me much longer to make than I expected so apologies for the abrupt ending
last time.
Hopefully is has been insightful for you, thank you for watching and see you next time.
Không có nhận xét nào:
Đăng nhận xét