'GPUs still rule' asserts graphics guru Raja Koduri in response to custom AI silicon advocate

imaginary_num6er@alien.top · 10 months ago

'GPUs still rule' asserts graphics guru Raja Koduri in response to custom AI silicon advocate

blueredscreen@alien.top · 10 months ago

Jensen Huang said it best: a GPU is the perfect balance between being so specialized that it isn’t worthwhile, and so general that it becomes just another CPU. And they do have custom silicon when necessary, like the tensor cores but again that’s in addition to, not a replacement of the existing hardware. Considering the hundreds of AI accelerator startups (a few of which have already failed), he’s right.

theQuandary@alien.top · 10 months ago

He’s only right in the short term when the technology isn’t stable and the AI software architectures are constantly changing.

Once things stabilize, we’re most likely switching to either analog compute in memory or silicon photonics both of which will be far less generic than a GPU, but with such a massive power, performance, and cost advantage that GPUs simply cannot compete.

GomaEspumaRegional@alien.top · 10 months ago

analog compute in memory or silicon photonics

What does that word salad have to do with AI? ;-)

theQuandary@alien.top · 10 months ago

First-up, here’s a Veritasium breakdown of why a lot of next-gen AI is leaning into analog computing to save space and power while increasing total numbers of computations per second.

https://www.youtube.com/watch?v=GVsUOuSjvcg

The unreliability of analog makes it unsuited for the deterministic algorithms we normally run on computers, but doesn’t have large negative effects on AI algorithms because of their low fidelity nature (and for some algorithms, getting some free entropy is actually a feature rather than a bug).

Here’s an Asianometry breakdown of silicon photonics

https://www.youtube.com/watch?v=t0yj4hBDUsc

Silicon Photonics is the use of light between transistors. It’s been in research for decades and is already seeing limited use in some networking applications. IBM in particular has been researching this for a very long time in hopes of solving some chip communication issues, but there are a lot of technical issues to solve to put billions of these things in a CPU.

AI changed the equation because it allows analog compute. A multiply generally takes 4-5 cycles with each cycle doing a bunch of shift then add operations in series. With silicon photonics, this is as simple as turning on two emitters, merging the light, then recording the output. If you want to multiply 10 numbers together, you can do it in ONE cycle instead of 40-50 on a normal chip (not including all the setup instructions likely needed by that normal multiplier circuit).

Here’s a quick IBM explainer on in-memory compute.

https://www.youtube.com/watch?v=BTnr8z-ePR4

Basically, it takes several times more energy to move two numbers into a CPU than it does to add them together. Ohm’s law allows us to do analog multiplication by connecting various resistors and measuring the output.

You can use this to do calculations and the beauty is that your data hardly has to travel at all and you were already having to use energy to refresh it frequently anyway. The total clockspeed is far lower due to physics limitations of capacitors, but if you can be calculating every single cell of a multi-terabyte matrix at the same time, that really doesn’t matter as your total compute power will be massively faster in aggregate AND use several times less power.

Of course, all these analog alternatives have absolutely nothing in common with modern GPUs, but simple operations are massively more power efficient with in-memory compute and complex operations are massively more power efficient with silicon photonics.

blueredscreen@alien.top · 10 months ago

He’s only right in the short term when the technology isn’t stable and the AI software architectures are constantly changing.

Once things stabilize, we’re most likely switching to either analog compute in memory or silicon photonics both of which will be far less generic than a GPU, but with such a massive power, performance, and cost advantage that GPUs simply cannot compete.

That’s what they said. Nothing about AI is going to stabilize. The pace of innovation is impossible. I’m sure things were happy too at SambaNova until they went bye bye and Nvidia itself hired their lead architect.

theQuandary@alien.top · 10 months ago

I heard this same stuff in the 90s about GPUs. “GPUs are too specialized and don’t have the flexibility of CPUs”.

Startups failing doesn’t prove anything. There are dozens of startups and there will only be 2-4 winners. Of course MOST are going to fail. Moving in too early before things have settled down or too late after your competitors are too established are both guaranteed ways to fail.

In any case, algorithms and languages have a symbiotic relationship with hardware.

C is considered fast, but did you know that it SUCKS for old CISC ISAs? They are too irregular and make a lot of assumptions that don’t mesh well with the compute model of C. C pls x86 is where things changed. x86 could be adapted to run C code well. C compilers then adapted to be fast on x86 then x86 adapted to run that compiled C code better then the loop goes round and round.

This is true for GPUs too. Apple’s M1/M2 GPU design isn’t fundamentally bad, but it is different from AMD/Nvidia, so the programmer’s hardware assumptions and normal optimizations aren’t effective. Same applies to some extent for Intel Xe where they’ve been spending huge amounts to “optimize” various games (most likely literally writing new code to replace the original game code with versions optimized for their ISA).

The same will happen to AI.

Imagine that one of those startups gets compute-in-SSD working. Now you can do compute on models that would require terabytes of RAM on a GPU. You could do massive amounts of TOPS on massive working sets using just a few watts of power on a device costing just a few hundred dollars. This is in stark contrast to a GPU costing tens of thousands of dollars and costing a fortune in power to run that can’t even work on a model that big because the memory hierarchy is too slow.

Such a technology would warp the algorithms around it. You’ll simply be told to “make it work” and creative people will find a way to harness that compute power – especially as it is already naturally tuned to AI needs. Once that loop gets started in earnest, the cost of switching algorithms and running them on a GPU will be far too high. Over time it will be not just cost, but also ecosystem lockin.

I’m not saying that compute-in-memory will be the winner, but I’m quite certain that GPU is not because literally ALL of the prominent algorithms get faster and lower power with their specific ASIC accelerators.

Even if we accept the worst-case scenario and 2-4 approaches rise to the top and each requires a separate ASIC, the situation STILL favors the ASIC approach. We can support dozens of ISAs for dozens of purposes. We can certainly support 2-4 ISAs with 1-3 competitors for each.

blueredscreen@alien.top · 10 months ago

I heard this same stuff in the 90s about GPUs. “GPUs are too specialized and don’t have the flexibility of CPUs”.

Startups failing doesn’t prove anything. There are dozens of startups and there will only be 2-4 winners. Of course MOST are going to fail. Moving in too early before things have settled down or too late after your competitors are too established are both guaranteed ways to fail.

It’s relatively convenient to blame your failure due to being too smart too early instead of just facing the genuine lack of demand for your product.

C is considered fast, but did you know that it SUCKS for old CISC ISAs? They are too irregular and make a lot of assumptions that don’t mesh well with the compute model of C. C pls x86 is where things changed. x86 could be adapted to run C code well. C compilers then adapted to be fast on x86 then x86 adapted to run that compiled C code better then the loop goes round and round.

Nothing about modern x86 architectures constitutes any classic model of “CISC” under the hood, the silicon runs machine code and ops that for all intents and purposes can be abstracted to any ISA.

This is true for GPUs too. Apple’s M1/M2 GPU design isn’t fundamentally bad, but it is different from AMD/Nvidia, so the programmer’s hardware assumptions and normal optimizations aren’t effective. Same applies to some extent for Intel Xe where they’ve been spending huge amounts to “optimize” various games (most likely literally writing new code to replace the original game code with versions optimized for their ISA).

What?

Even if we accept the worst-case scenario and 2-4 approaches rise to the top and each requires a separate ASIC, the situation STILL favors the ASIC approach. We can support dozens of ISAs for dozens of purposes. We can certainly support 2-4 ISAs with 1-3 competitors for each.

Again, they all said that before you, and look where they are now. (hint hint: Nvidia)

GomaEspumaRegional@alien.top · 10 months ago

Most HW startups fail because they never get the SW story right.

At the end of the day, hardware is used to run software. So unless you’re having access to a large software library from the get go (by accelerating a known entity or architecture), or you truly have a fantastic value proposition in terms of being orders of magnitude faster than the established competition and with a solid roadmap in terms of HW and SW. The best most HW startups can hope for is an exit plan where their IP is bought by a bigger player.

HW people some time miss the boat that if something is 2x as fast, but it takes 2x as long to develop for, you’re not giving that much in terms of leadership window for your customers. So they’ll remain with the known entity, even if it’s less efficient or performant, on paper.

norcalnatv@alien.top · 10 months ago

A guy with basically zero success weighs in on the market. Everyone listen up.

old_c5-6_quad@alien.top · 10 months ago

Come on now. Raja has successfully failed upwards many times.

ResponsibleJudge3172@alien.top · 10 months ago

He is responsible for Intel Gaudi, the best option for AI other than Nvidia H100 and A100

norcalnatv@alien.top · 10 months ago

no.

Raja designed the ponte vecchio (and family) for the Aurora Supercomputer collaborating with Jim Keller when he was working for Intel for a short time.

Gaudi came out of Habana Labs, an Israeli company Intel recently purchased

bubblesort33@alien.top · 10 months ago

RDNA1 and 2 were pretty successful. Vega was very successful in APUs, just didn’t scale well for gaming, but was still successful for data center. You can’t hit them all, especially when you have a fraction of the budget that your competition has.

Also, he ran graphic divisions, not a Walmart. People don’t fail upwards in these industries at these levels. When they fail upwards working in some other industries, they fail to middle management. Somewhere you’re not in the spotlight, and out of the public’s eye, but don’t get to make final decisions. Somewhere to push you out of the way. Leading one of less than a handful graphics divisions in the world is not where you land.

Exist50@alien.top · 10 months ago

RDNA1 and 2 were pretty successful

AMD’s dGPU marketshare has been doing terribly these last few years. Clearly they weren’t successful enough.

Vega was very successful in APUs

Compared to what? Intel?

but was still successful for data center

AMD’s marketshare in the datacenter is pretty darn negligible.

Also, he ran graphic divisions, not a Walmart. People don’t fail upwards in these industries at these levels.

Why do you think it’s any different? There are plenty of high profile examples even at the CEO level?

The fact is that every notable initiative Raja’s been a part of for the past decade-ish has been a failure vs original targets. At a certain point, one has to acknowledge a common factor.

iwannasilencedpistol@alien.top · 10 months ago

If anything RDNA3 could be considered a failure

DuranteA@alien.top · 10 months ago

Specialized hardware can make sense for inference of known networks, or actually a bit more broadly, network structures. But for training and research, the structure of models still seems to be in too much flux for specialization much beyond the level of a modern GPU to make sense. For now at least.

The (research) ecosystem would have to settle down a it before you can get a really substantial improvement out o specialization, and be confident that your architecture can still run the latest and greatest in ~2 years after you designed it.

capn_hector@alien.top · 10 months ago

this has been my take, it’s an obvious case of the 80-20 rule. During the times of breakthrough/flux, NVIDIA benefits from having both the research community onboard as well as a full set of functionality and great tooling etc. when things slow back down you’ll see google come out with a new TPU and amazon will have a new graviton etc.

it’s not that hard in principle to staple an accelerator to an ARM core, actually that’s kind of a major marketing point for ARM. And nowadays you’d want an interconnect too. There are a decently large number of companies who can sustain such a thing at reasonably market-competitive prices. So once the market settles, the margins will decline.

On the other hand, if you are building large, training-focused accelerators etc… it is also going to be a case of convergent evolution. In the abstract, we are talking about massively parallel accelerator units with some large memory subsystem to keep them fed, and some type of local command processor to handle the low-level scheduling and latency-hiding. Which, gosh, sounds like a GPGPU.

If you are giving it any degree of general programmability then it just starts to look very much like a GPU. If you aren’t, then you risk falling off the innovation curve the next time someone has a clever idea, just like previous generations of “ASICs”. And you are doing your tooling and infrastructure and debugging all from scratch too, with much less support and resources. GPGPU is turnkey at this stage, do you want your engineers building CUDA or do you want them building your product?

theQuandary@alien.top · 10 months ago

It’s also a technology issue. We have companies working on doing compute in memory which should offer such massive power and cost savings that companies might warp their algorithms around it to save all that money. Same goes for silicon photonics.

It’s way too early to be certain about anything, so companies are going with the least risky option and idiots are pouring billions into adding “AI” in places where it doesn’t belong just like they did with “big data” a few years ago (we’re currently doing that at my company).

bubblesort33@alien.top · 10 months ago

“Why? I am still learning, but my observations so far: the ‘purpose’ of purpose-built silicon is not stable. AI is not as static as some people imagined and trivialize [like] ‘it is just a bunch of matrix multiplies’.”

But it is stable in a lot of cases, is it not? I mean if you’re training a system for autonomous driving, or a training a system for imagine generation, it seems pretty stable. But for gaming it certainly needs flexibility. If we want to add a half a dozen features to games that rely on ML, it seems you need a flexible system.

That does remind me, of how Nvidia abandoned Ampere and Turing when it comes to frame generation because they claim the optical flow hardware is not strong enough. What exactly is “Optical Flow”? Is it a separate type of machine learning hardware? Or is it not related to ML at all?

XYHopGuy@alien.top · 10 months ago

model architecture changes not training objective (e.g self driving car).

Optical flow accelerator accelerates the computation of the direction of a moving image. It’s used as an input to ML, related in the same way a camera is to ML

ResponsibleJudge3172@alien.top · 10 months ago

We just look at benchmarks, of which GPUs do extremely well