Intel’s “Ponte Vecchio” GPU Higher Not Be A Bridge Too Far

By | August 24, 2021

[ad_1]


It’s fairly apparent to everybody who watches the IT market that Intel wants an architectural win that results in a product win in datacenter compute. And it’s fairly clear that the highest brass at Intel are placing a whole lot of chips down on the felt desk that the considerably delayed “Ponte Vecchio” Xe HPC GPU accelerator, which brings to bear all the applied sciences and methods the corporate can muster to create a strong gadget, is the large wager for giant iron.

With Intel’s Structure Day behind us and the Scorching Chips 33 convention taking place proper now, it is a good time to take a tough have a look at the Ponte Vecchio gadget and see simply what Intel is doing and what sorts of outcomes the chiplet strategy, with 2D EMIB stacking and 3D Foveros stacking mixed in a single package deal, is yielding for a posh and highly effective gadget.

This factor is a Byzantine beast, however like each different advanced gadget ever made, it’s made from parts that make sense in their very own proper, interconnected in intelligent methods to ship what appears like a monolith. What was true of the literal Ponte Vecchio bridge spanning the Arno River in Florence, Italy on the flip of the tenth century or a cathedral constructed on the flip of the 13th century was as true for the Apollo moon launch in 1969 and can be as true for the superb processing gadgets which can be being constructed throughout the AI revolution of the early 21st century.

Chips are simply one other form of stained glass, and tasks that drive them are additionally moonshots. NASA’s Apollo undertaking landed individuals on the Moon, and was the end result of a $28 billion effort by america authorities to be first to take action.

That quantity might look like lots, and should you modify the entire funds of the lunar missions between 1960 and 1973 to present {dollars}, it’s extra like $280 billion. Which looks like lots. However perspective on what moonshots actually price is essential. Between 1960 and 1973, the US federal funds was a cumulative $2.06 trillion, and the lunar mission price underneath 1.4 % of that, and US gross home product from that very same interval provides as much as $11.88 trillion, and the lunar program represented 24/10,000ths of GDP. Granted, the Nineteen Sixties noticed the rise of social program spending in America in addition to the Vietnam Struggle, each of which ballooned the federal funds and due to this fact makes the lunar program appear to price lower than it may need in any other case.

Chip makers are by their very nature much less various than a nationwide financial system, and their moonshots are undoubtedly extra pricey relative to their dimension. For startups, every little thing is, comparatively talking, a moonshot. In any occasion, this is similar metaphor that Jensen Huang, co-founder and chief government officer of Nvidia, used when speaking concerning the improvement of the “Volta” Tesla GPU accelerator that debuted in Could 2017, and it’s the language that Raja Koduri, Intel’s chief architect and now head of its new Accelerated Computing Programs and Graphics Group, used to explain Intel’s architectural strategy for future merchandise coming subsequent yr and past.

“Step one in making progress is to confess we’ve got an issue,” defined Koduri within the part of Structure Day discussing Ponte Vecchio. “At Intel, we had an issue – virtually a decade-long downside. We had been behind on throughput compute density and assist for prime bandwidth reminiscences, each of that are important metrics for HPC and AI and are the cornerstones of GPU structure.” After which he flashed up this chart, exhibiting the place Intel has been and the place it’s going with Ponte Vecchio:

The blue line is Intel and the inexperienced line is Nvidia.

“We actually needed to shut this hole in a single shot, so we would have liked a moonshot,” Koduri continued. “We set for ourselves some very formidable targets. We began a model new structure constructed for scalability, designed to reap the benefits of essentially the most superior silicon applied sciences. And we leaned in fearlessly.”

None of us imagine the fearless half. Having delayed manufacturing methods on the similar time you might be altering architectures is loads scary. However being afraid shouldn’t be paralyzing, and swallowing its delight and shifting to Taiwan Semiconductor Manufacturing Corp to make a number of the parts of the Ponte Vecchio package deal has most likely saved Intel’s efforts to be an actual participant in GPU compute. If Intel Foundry Providers will get its manufacturing act collectively and has a good worth for etching chips, then it could regain that enterprise. However Intel’s foundry can not maintain up Intel’s chip enterprise ever once more. This a lot is obvious. Breaking the foundry out as chief government officer Pat Gelsinger has completed makes it accountable to the opposite elements of Intel in addition to opening it as much as co-design and manufacture chips for different firms, significantly these in america and Europe the place it has foundry operations.

What is obvious proper now could be that Intel wants a number of profitable moonshots to get itself again into contender positions in CPUs, GPUs, and FPGAs. The competitors is urgent in on all sides. But when Ponte Vecchio is delivered early subsequent yr as deliberate, then Intel has a shot at getting its rightful piece of a brand new market.

The primary check is the “Aurora” A21 supercomputer being put in at Argonne Nationwide Laboratory, which could have tens of 1000’s of those Ponte Vecchio GPU accelerators put in and representing the overwhelming majority of the floating level processing energy within the system, which was anticipated to have at the very least 1.1 exaflops and which is rumored to have 1.3 exaflops of combination compute. (These are peak theoretical efficiency scores.) This machine has been delayed quite a few instances, and its newest incarnation with a pair of “Sapphire Rapids” Xeon SP processors (probably with HBM reminiscence) and 6 Ponte Vecchio GPU accelerators was anticipated to make it onto the June 2021 Top500 supercomputer rankings so it could have a while on the listing earlier than the 1.5 exaflops “Frontier” hybrid CPU-GPU system based mostly on AMD motors would take over on the November 2021 listing. Right here is an precise image of the Aurora node:

There’s some chatter that the A21 node will use HBM2 reminiscence on the Sapphire Rapids processors, and whereas this can be true – you may’t see the CPU within the image above as a result of it as been inverted within the two sockets behind the board – the A21 node clearly has eight DDR5 DIMM slots per socket. You need to use each HBM and DRAM, and we’d definitely try this if we had been Argonne. Significantly if all the HBM2 reminiscence on the CPUs and GPUs may be clustered over Xe Hyperlink interconnects, as we suspect it may be.

Constructing Ponte Vecchio One Block At A Time

Being a parallel processor, a GPU accelerator may be assembled out of myriad compute parts which can be wrapped up into bundles after which packaged and stacked to scale out its parallel efficiency. And that’s exactly what the Ponte Vecchio Xe GPU does, as Hong Jiang, Intel Fellow and director of GPU compute structure in Accelerated Computing Programs and Graphics Group, defined throughout his Structure Day presentation. All of it begins with the Xe core, which Intel calls Xe HPC Core, and it appears like this:

That is just like the Xe cores utilized in different GPU variants created by Intel for regular laptops, regular desktops, gaming machines, or scientific workstations, however these have totally different compute parts and totally different stability of parts and scale, identical to CPUs do from Intel.

The Xe HPC Core has eight matrix math models which can be 4,096 bits extensive every and eight vector math models which can be comparable in idea to the AVX-512 models on the previous a number of generations of Xeon SP CPUs. These are all front-ended by a load/retailer unit that may deal with and combination of 512 bytes per clock cycle. That load/retailer unit is itself fronted by a 512 KB shared native reminiscence (SLM) that acts as a L1 cache and in addition a separate instruction cache. We expect that Intel is laying down FP64 models and FP32 models distinctly on the vector engine, a lot as Nvidia does with its CUDA cores on its GPUs. It appears like there are 256 of the FP64 models and 256 of the FP32 models on every vector engine, after which the FP32 models may be doubled pumped to ship 512 operations per clock in FP16 mode. Sure, Intel might have simply created an FP64 unit and carved it up into two or 4 items to get FP32 and FP16 modes, however this manner, an clever, multitasking dispatcher can allocate work to 2 sorts of models on the similar time. (As Nvidia has completed with its GPUs for some time.)

The matrix math engines, that are roughly analogous to Tensor Core models within the “Volta” GV100 and “Ampere” GA100 GPUs from Nvidia, assist combined precision modes, together with Nvidia’s personal Tensor Circulation (TF32) format in addition to Google’s BFloat (BF16) codecs along with the FP16 and INT8 codecs generally utilized in machine studying coaching and inference.

The vector engine thinks in 512-bit information chunks and the matrix engine thinks in 4,096-bit chunks at FP16 precision, it appears like.

The Xe HPC Cores are stacked up with ray tracing models – that is nonetheless a graphics card and can usually be used for rendering and visualization along with compute, and inside the similar software usually, which is a profit – into what Intel calls a slice. Here’s what an Xe HPC Slice appears like:

This explicit slice has 16 cores and 16 ray tracing models paired one apiece to every core with a complete of 8 MB of L1 cache throughout the slice. That slice has its personal distinct {hardware} context inside the Ponte Vecchio GPU advanced.

An Xe HPC Stack is a posh comprised of 4 slices that has a complete of 64 Xe HPC Cores and 64 ray tracing models with 4 {hardware} contexts, all linked to a shared L2 cache, 4 HBM2E reminiscence controllers, a media engine, and eight Xe Hyperlink ports to attach out to adjoining GPU complexes for coherent reminiscence in addition to to CPUs which have Xe Hyperlink ports. Here’s what the Xe HPC Stack appears like:

We strongly suspect that the Sapphire Rapids Xeon SP with HBM2E reminiscence has Xe Hyperlink ports on it, and if it doesn’t, then it ought to. The Xe HPC Stack has a PCI-Specific 5.0 controller on it in addition to a duplicate engine, and the diagram reveals three media engines, not one. Go determine. There’s additionally a stack-to-stack interconnect that hangs off of the L2 cache reminiscence for the stack, identical to the Xe Hyperlinks ports do. Consider this like NUMA interconnect for server chips the place some NUMA hyperlinks are native and a few are distant. The stack-to-stack interconnect is for native hyperlinks, presumably. It appears like Intel has a two-stack chiplet or chiplet advanced, which is confirmed interlinked like this:

The Xe Hyperlink interconnect is fascinating in that it contains excessive velocity SerDes in addition to a hyperlink material sublayer that’s connected to a swap and a bridge that permits for as much as eight Xe HPC GPUs to be interlinked gluelessly to one another with full load/retailer, bulk information switch, and sync semantics throughout all eight GPUs. This Xe Hyperlink chip has 90 Gb/sec SerDes and an eight-port swap that has eight hyperlinks per tile. It’s etched utilizing TSMC’s N7 7 nanometer processes.

As we stated, we expect the explanation that the A21 node has solely six GPUs is that two of the Xe Hyperlink ports within the Xe Hyperlink swap material are used to carry the 2 Sapphire Rapids sockets into the shared HBM2E reminiscence advanced. Anyway, here’s what an eight GPU machine, the place every GPU is comprised of two interconnected stacks, would appear to be by way of its combination vector and matrix compute:

That diagram above is a press release of the chances of the Xe HPC structure. Here’s what the Ponte Vecchio gadget does to implement that structure. First, right here is the highest view of the Ponte Vecchio package deal:

After which right here is the exploded view that reveals how EMIB and Foveros stacking comes into play to make these chiplets play collectively like one massive monolith. Have a look:

Every of the compute tiles proven on the prime has eight Xe HPC Cores with 4 MB of L1 cache and a 36 micron bump pitch on the Foveros 3D stacking interconnect. These Xe HPC Core tiles are etched utilizing TSMC’s N5 5 nanometer course of, which began threat manufacturing in March 2019 and which ramped into manufacturing in April 2020. There are two distinct tile complexes on the Ponte Vecchio package deal, for a complete of 128 Xe cores and 64 MB of L1 cache throughout these cores.

There are two base tiles, which have a PCI-Specific 5.0 interface, 144 MB of L2 cache, and hyperlinks to the HBM2E reminiscence in addition to EMIB interconnects between the eight Xe tiles. So the entire Ponte Vecchio advanced has two PCI-Specific 5.0 controllers and 288 MB of L2 cache. There are two HBM2E tiles, however we don’t know the way a lot reminiscence capability they’ve.

The entire Ponte Vecchio advanced has chiplets that make use of 5 totally different course of nodes of producing throughout Intel and TSMC, a complete of 47 chiplets, and over 100 billion transistors within the combination throughout these tiles. The A0 silicon (the preliminary stepping of the chip) has simply been assembled, and it delivers greater than 45 teraflops at FP32 precision, has greater than 2 TB/sec of reminiscence material bandwidth and over 2 TB/sec of connectivity bandwidth, in keeping with Intel.

Intel is packaging Ponte Vecchio up in a kind issue that appears acquainted – it’s the Open Accelerator Module kind issue that Fb and Microsoft introduced two years in the past. OAM will assist PCI-Specific and Xe Hyperlink variants, in fact, and we are able to anticipate standalone PCI-Specific playing cards as properly despite the fact that Intel isn’t exhibiting them. Right here is the Ponte Vecchio package deal for actual:

And right here is a number of the different server packaging choices Intel has lined up except for the six-way configuration getting used within the Aurora A21 machine:

Earlier than we are able to make any clever comparisons to Nvidia and AMD GPUs, we have to know if that’s matrix or vector FP32 throughput for the gadget, and we have to know if Intel could have sparsity matrix assist in its gadgets. We additionally have to know what it’ll price. After we know this stuff, then we are able to make some actual comparisons. And we’re excited to just do that.

[ad_2]

Supply hyperlink