Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not a computer architect (so my opinion shouldn't count in this thread), but as someone who did a lot of numerical programming over the years, I really thought Itanium looked super promising. The idea that you can indicate a whole ton of instructions can be run in parallel seemed really scalable for FFTs and linear algebra. Instead of more cores, give me more ALUs. I know "most" software doesn't have enough work between branches to fill up that kind of pipeline, but machine learning and signal processing can certainly use long branchless basic blocks if you can fit them in icache.

At the time, it seemed (to me at least) that it really only died because the backwards compatibility mode was slow. (I think some of the current perception of Itanium is revisionist history.) It's tough to say what it could've become if AMD64 hadn't eaten it's lunch by running precompiled software better. It would've been interesting if Intel and compiler writers could've kept focus on it.

Nowdays, it's obvious GPUs are the winners for horsepower, and it's telling that we're willing to use new languages and strategies to get that win. However, GPU programming really feels like you're locked outside of the box - you shuffle the data back and forth to it. I like to imagine a C-like language (analogous to CUDA) that would pump a lot of instructions to the "Explicitly Parallel" architecture.

Now we're all stuck with the AMD64 ISA for our compatibility processor, and it seems like another example where the computing world isn't as good as it should be.



There's no free parallelism™️ though.

> Author: AgnerDate: 2015-12-28 01:46

> Ethan wrote:

> > Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?

> The advantage of the Itanium instruction set was of course that decoding was easy. The biggest problem with the Itanium instruction set was indeed that it was almost impossible to write a good compiler for it. It is quite inflexible because the compiler always has to schedule instructions 3 at a time, whether this fits the actual amount of parallelism in the code or not. Branching is messy when all instructions are organized into triplets. The instruction size is fixed at 41 bits and 5 bits are wasted on a template. If you need more bits and make an 82 bit instruction then it has to be paired with a 41 bit instruction.

(https://www.agner.org/optimize/blog/read.php?i=425)

Besides, the memory consistency model of Itanium is also a brain teaser used in interviews as counterexamples to poorly-synchonized solutions.


Agner is obviously brilliant, but I think maybe he's looking at general purpose applications.

If I'm doing a million point FFT, I can easily give you 2 million operations in a row without a loop/branch. Maybe 1000 of those at a time could be run in parallel before the results needed to commit for the next 1000. I'd be willing to pay for 1 or 2 nops in the last bundle of every 1000 operations. I admit, the idea might not be awesome for a word processor or spreadsheet, but I did specify signal processing and machine learning.


A common complaint about Itanium was that it was an overpriced DSP masquerading as general purpose CPU.


I actually kind of like that characterization :-)


For signal processing and machine learning, maybe you'd be better off with a systolic array processor or at least a bunch of deep Cray-style vector pipelines? And, like you said, GPUs seem to be doing better at those, in a vaguely Tera-like way, than the Itanic ever could have.


I never got to play on a Cray, but I remember working on a Convex for a couple semesters in school. I had no idea what I was doing back then.

Nowdays, it's pretty clear GPUs are the winner, but like I said, they just don't feel like you actually live and breath inside of them the way you do with CPU code (you shuffle your data over and shuffle it back), and I'm kind of just imagining an alternative timeline where Intel and compiler writers got a chance to run with the EPIC idea.


The dark silicon era will presumably also be the heterogeneous hardware era.


> I admit, the idea might not be awesome for a word processor or spreadsheet,

Or a web server or browser. In fact, it pretty much only helps for your use case. Which is why people are converging on specialised hardware for it, and why Itanium was a commercial failure.


The computationally heavy part of a web browser is the rendering/video engine, which is standard super parallel graphics stuff. Web servers are dominantly I/O (or running wastefully slow scripting languages).

I know it's all a fantasy now, but I wonder what the world might have been like if there wasn't such a split between what you can only do on a CPU and what you can only do on a GPU. Maybe you like heterogeneous computing and gluing C++ to CUDA, but I think it's ugly. Stretch outside of the current box a little bit and imagine a hybrid somewhere in the middle ground of CPUs and GPUs. I think a variable sized VLIW could've gotten there if the market had any more imagination than it does. It's Ford's "faster horses" problem.


Check Xeon Phi (not VLIW though), which executes unmodified amd64 instructions on many cores. However it never becomes a mainstream product and Intel recently killed it. There are many reasons behind its sunset - tooling, programming difficulty to reach max throughput, perf/cost ratio, ...

Related discussion: https://news.ycombinator.com/item?id=17606037


Itanium is essentially a VLIW architecture and... well, as the bottom of the page mentions, VLIW architectures tend to turn out to be bad ideas in practice.

GPUs showed two things: one, you can relegate kernels to accelerators instead of having to maximize performance in the CPU core; and two, you can convince people to rewrite their code, if the gains are sufficiently compelling.


Specifically, the promise of VLIWs (scalar parallelism) was overhyped--the Multiflow compiler couldn't find enough scalar parallelism in practice to keep all 28 ALUs busy (in the widest model we built), or even all 7 ALUs (in the narrowest).

(I ended up running the OS group at Multiflow before bailing right before they hit the wall.)


We need new programming models that make it easier to expose static parallelism to the compiler. Doing it all in plain old C/C++, or even in "managed" VM-based languages, cannot possibly work - and even conventional multi-threading is way too coarse-grained by comparison to what's most likely needed. Something based on a dataflow-oriented description of the code would probably work well, and be possible to integrate well enough with modern functional-like paradigms.


I reached the same conclusion. We need a way to be able to explicitly say "this piece of code here is a standalone unit that is independent of everything else until we have to coalesce results of computation", or "this unit manages those other units and coalesces their results".

Erlang's OTP captures this pretty okay-ish with their concept of processes (preemptively scheduled green threads that get aggressively multiplexed on all CPU cores) but I feel we can go a little bit further than that and have some sort of shorter markers in the code, say `actor { ... }` or `supervises(a1, a2, a3) { ... }` or something.


VISC seems at least potentially promising.

https://www.anandtech.com/show/10025/examining-soft-machines...


Yes, but VLIW's premise was that all that coordination that the VISC architecture is doing at runtime in hardware could be computed at compile-time in software.


Not in general, VLIWs work great in DSP architectures like the Hexagon. They tend to fall down in the presence of unpredictable memory accesses, though, and while Itanium had facilities to try to mitigate that they didn't work well enough.


I think that another issue with VLIW as a general purpose ISA is that for it to be worthwhile the compiler has to have deep understanding of the underlying implementation (deep enough to always generate hazard-free code) such that the CPU does not have to contain any scheduling logic. This is the case for most embedded/DSP VLIW architectures. The issue with that is that there cannot reasonably be any kind of backward compatibility on the machine code level.


15 years ago I thought Itanium was the coolest thing ever. As a compilers student, a software scheduled superscalar processor was kind of like a wet dream. The only problem is that that dream never materialized, due a number of reasons.

First, compilers just could never seem to find enough (static) ILP in programs to fill up all the instructions in a VLIW bundle. Integer and pointer-chasing programs are just too full of branches and loops can't be unrolled enough before register pressure kills you (which, btw, is why Itanium had a stupidly huge register file).

Second, it exposes microarchitectural details that can (and maybe should) change quite rapidly. The width of the VLIW is baked into the ISA. Processors these days have 8 or even 10 execution ports; no way one could even have space for that many instructions in a bundle.

Third, all those wasted slots in VLIW words and huge 6-bit register indices take up a lot of space in instruction encodings. That means I-cache problems, fetch bandwidth problems, etc. Fetch bandwidth is one of the big bottlenecks these days, which is why processors now have big u-op caches and loop stream detectors.

Fourth, there are just too many dynamic data dependencies through memory and too many cache misses to statically schedule code. Code in VLIW is scheduled for the best case, which means a cache miss completely stalls out the carefully constructed static schedule. So the processor fundamentally needs to go out of order to find some work (from the future) to do right now, otherwise all those execution units are idle. If you are going out of order with a huge number of execution ports, there is almost no point in bothering with static instruction scheduling at all. (E.g. our advice from Intel in deploying instruction scheduling for TurboFan was to not bother for big cores--it only makes sense on Core and Atom that don't have (as) fancy OOO engines).

There is one exception though, and that is floating point code. There, kernels are so much different from integer/pointer programs that one can do lots of tricks from SIMD to vectors to lots of loop transforms. The code is dense with operations and far easier to parallelize. The Itanium was a real superstar for floating point performance. But even there I think a lot of the scheduling was done by hand with hand-written assembly.


> The Itanium was a real superstar for floating point performance.

No. Itanium was never a superstar - it was merely competitive when it was at best, and even that was only if you ignored the price/performance, and some of its versions were pretty bad in absolute performance and nowhere near competitive, and it was plain abysmal if you consider price/performance. Also, majority of practical, important numerical computations were memory bandwidth bound, and thus it didn't matter as much whether you can pack the loop perfectly. And Itanium was almost never the highest memory bandwidth machine during its lifetime, partially due to many of its delays.

Itanium would be my choice for the worst architecture, as it successfully killed other cpus, and produced a lot of not very useful research.

MIPS (and Hennessy and Patterson) would be my first choice, for upending the architecture design. Honorable mentions from me would be IBM 801 (lead to many research), Intel iAXP 432 (for capability architecture, which I think will come back at some point), z80 and 8501 for ushering the computing power everywhere, x86-64 for "the best enduring hack".

Anyway, the article itself is great, and I wish I had asked the same question to some of those folks mentioned in the article, and many other architects and researchers when I met them...


> MIPS (and Hennessy and Patterson) would be my first choice, for upending the architecture design.

I agree that it was a revolutionary design for the time. From today's perspective, some of the choices made did not age all that well (in particular branch and load delay slots).

For assembly level programming/debugging, my favorite architecture by far was POWER/PowerPC. Looking at x86 code vs PowerPC after Apple's switch made me almost cry, although the performance benefits were undeniable.


David Patterson didn't work on MIPS.


> There is one exception though, and that is floating point code.

I wish I had read to the end before replying (and then deleting) responses to each of your items. :-)

I think we're mostly in agreement, except for the following minor tidbits:

> The width of the VLIW is baked into the ISA [...] no way one could even have space for that many instructions in a bundle.

My understanding was that your software/compiler could indicate as many instructions as possible in parallel, and then stuff a "stop" into the 5 bit "template" to indicate the previous batch needed to commit before proceeding. So, you wouldn't be limited to 3 instructions per bundle, and if done well (hopefull), your software would automatically run faster as the next generation comes out and has more parallel execution units.

> all those wasted slots in VLIW words and huge 6-bit register indices take up a lot of space in instruction encodings

128 registers, so I'd think it'd be 7 bit indices. Each instruction was 41 bits, which is roughly 5 bytes. Most SSE/AVX instructions end up being 4-5 bytes (assuming no immediates or displacements), and that's for just 4 bit register indices. So it doesn't seem much worse than we have now.


> Instead of more cores, give me more ALUs.

It kinda didn't work that way though.

In practice, all of your ALUs, including your extra ones, were waiting on cache fetches or latencies from previous ALU instructions.

Modern x86 CPUs have 2-4 ALUs which are dispatched to in parallel 4-5 instructions wide, and these dispatches are aware of cache fetches and previous latencies in real time. VLIW can't compete here.

VLIW made sense when main memory was as fast as CPU and all instructions shared the same latency. History hasn't been kind to these assumptions. I doubt we'll see another VLIW arch anytime soon.

I accept the idea that x86 is a local minimum, but it's a deep, wide one. Itanium or other VLIW architectures like it were never deep enough to disrupt it.


> In practice, all of your ALUs, including your extra ones, were waiting on cache fetches

I think if anyone could get around memory bandwidth problems they would, but for some very interesting and useful algorithms, I can tell you way in advance exactly when I'll need each piece of memory. For these problems, VLIW/EPIC with prefetch instructions would be a win over all the speculation and cleverness.

> Itanium or other VLIW architectures like it were never deep enough to disrupt it.

History is what it is, but I'm just imagining an alternative timeline where all the effort spent making Pentium/AMD64 fast was pumped into Itanium instead, and compiler writers and language creators got to target an architecture that didn't act like a 64 bit PDP-11.


Are these sorts of prefetch instructions widely supported in VLIW, though? AIUI, one of the "tricks" the Mill folks had to come up with as part of designing a generally-usable VLIW-like architecture, is making RAM accesses inherently "async", making it easy to statically-schedule other instructions in case the RAM access stalls.


Rather than individual-memory-address prefetch instructions, how would you feel about sending a DMA program to a controller on-board a memory DIMM, that would then enable you to send short external commands to the memory that would be translated by the DMA program into custom “vector requests”, to which the memory could respond with long streams of fetch responses—shaped somewhat like the output of a CCD’s shift-register—where this stream of fetch responses would then entirely overwrite the calling CPU’s cache lines with the retrieved values?


Random thought: what’s stopping main memory from being as fast as CPUs (in throughput, not necessarily latency)? TDP? The unwillingness to pay $100s per DIMM?


As far as throughput is concerned, mostly the idea that there is an DIMM at all. That means that there has to be some wiring between the CPU/MC and actual DRAM array that has manageable number of wires and reasonable RF/EMC characteristics.


The bandwidth between the CPU and the DIMM (and any overhead from ser/des narrowing in the physical signalling layer used to connect the two) is only a constraint if the DIMMs are, to put it in a funny way, RISC—if you have to send them a stream of low-level retrieval requests to describe a high-level piece of state you’d like to know. Which does describe most modern DIMMs, but not all of them.

https://en.wikipedia.org/wiki/Content-addressable_memory (CAM), as used in network switches, isn’t under the same constraints as regular RAM. The requests you make to CAM are CISC—effectively search queries—putting the whole memory-cell array to work at 100% utilization on each bus cycle.

But even CAM is still slower than the CPU. Even when it’s on the same SoC package as the CPU, it’s still clocked in such a way that it takes multiple CPU cycles to answer a query. So, at least in this case, bus bandwidth is not “the” constraint.


The whole idea of DRAM is about making the whole thing cheaper by limiting the outside bandwidth (it is not that DRAM chips have multiplexed address bus to save pins but because supplying address in two phases is inherent to how DRAM array works).

There is nothing that prevents you from making SRAM/CAM array running at same or even higher clock speed than CPU made with same semiconductor technology except cost of the thing. And in fact, n-way associative L1 cache (for n>1) is exactly such an CAM array.


The problem is latency, not throughput. Memories do have the bandwidth to keep the processor caches full, but they can not handle random access.

Anyway, you increase throughput by just adding more hardware. That's easy and widely done.


DRAM is pretty fast in throughput. The problem is the "random access" nature; every piece of indirection or pointer chaining is an unpredictable access. Every time you have a "." in your favourite object orientated language, every step in a linked list: if that's a cache miss, you have to wait for it to come back ... and then potentially cache miss the next lookup as well.


I'm not sure that "." does it, at least in C/C++. What's on the right should be very close in address to what's on the left, unless it's a reference.

In Java... not so much. I believe that "." in Java is the same as "->" in C/C++, unless what's on the right is a primitive data type rather than an object.


Several times in the last months we've seen that various AMD CPUs performed better on certain benchmarks with overclocked RAM. I am not sure the RAM that can help the CPU saturate the transfer channels even exists today.


I worked on a project with machines that used the pa-risc cpus. The importance of optimized compilers can’t be understated (and math libraries) which made those machines really shine. My understanding was the Itanium (which basically replaced pa-risc in hp Unix machine lineup) never got the compiler support to realize the architectures strengths, so everyone looked to the safer bet in 64bit computing.

It’s hard to compete with the scale of x86. Like software I feel the industry tends toward one architecture (the more people use the architecture the better the compilers the more users ...) Even Apple abandoned PowerPC chips.


> so everyone looked to the safer bet in 64bit computing.

Itanic was the safer bet in 64-bit computing. It just sucked. Intel didn't switch to AMD64 until underdog AMD was already eating their lunch.

Today there are probably more aarch64 CPUs being sold every month than amd64 CPUs (including Intel's).


If anyone stood a chance to compete against x86, I'd think Intel would be it :-)


But Intel failed here with the same mistake they've made many times before: the people who actually buy Intel kit basically want a faster 8088/80386/Pentium, not a novel, cleaner, sexier new architecture. See: iAXP432, i860, i960Mx and recently Itanium.

Linux Torvald has an interesting take on this (whether or not you agree): https://yarchive.net/comp/linux/x86.html


I think Linus is very focussed on the stuff he cares about, and that shows by his complaint about PAE, which is pretty invisible to use userland folk.

> the people who actually buy Intel kit basically want a faster 8088/80386/Pentium

My company is small fries compared to most folks, but what we really wanted at the time was cheaper DEC Alphas.


I think Linus is very focussed on the stuff he cares about

I think he'd be the first to agree with that.

PAE, which is pretty invisible to use userland folk

Perhaps, but the original conversation is about architecture, and PAE was a pretty grungy architectural wart, and the sort of thing that's very visible to the OS folks (as you say, what Linus cares about).

we really wanted at the time was cheaper DEC Alphas

There were 'cheap' Alphas: the 21066/21068 like in the Multia. But they were dogs; to be cheap, they had to give up big cache and wide paths to main memory. Expensive system support level stuff was required for fast Alphas (complex support chipsets, 128-bit wide (later 256-bit) memory buses). More commodity inertia would have fixed that over time, but they never got there. Intel on the other hand was way down the road reaping commodity benefits, and it ran the software commodity folks wanted.


> Perhaps, but the original conversation is about architecture, and PAE was a pretty grungy architectural wart, and the sort of thing that's very visible to the OS folks (as you say, what Linus cares about).

I suspect if he was into writing compilers (or graphics, or numerics, or ...), some of the other grungy architectural warts of x86 might annoy him too.


I suspect if you read the linked thread, you'd see exactly what he thought. He was at Transmeta at the time, and isn't exactly unfamiliar with what compiler writers get annoyed with. You may or may not agree with what he says, but he has a cogent, interesting perspective.


Ahh, I only read the one post. My bad.


Well, I read the rest of his comments. There are little tidbits which are good observations, but mostly I continue to think he just disregards things which aren't in his field of interest. If I was more cynical, I might think he wanted the continuance of x86 specifically because he was working at Transmeta.


I don't think x86 compatibility mattered. When it launched in 2001 it was supposed to replace HP's PA-RISC architecture which is totally different anyway. Sun and its SPARC processors were very much alive, Google was only three years old, and AWS was five years away - the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet.

Of course, the joke is that cheap x86 processors did outperform Itanium (and every other architectures, eventually).


> When it launched in 2001...the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

When do you mean? In 2001?

It occurred to Yahoo, whose site had been run that way since almost the beginning, on FreeBSD. It occurred to Google, whose site was run that way since the beginning. It occurred to anyone who was watching the Top500 list, which was already crawling with Beowulfs — admittedly, not at the top of the list yet. It should have occurred to Intel, who were presumably the ones selling those servers to Yahoo and Penguin Computing and VA Research (who had IPOed in 1999 under the symbol LNUX). It had occurred to Intergraph, who had switched from their own high-performance graphics chips to Intel's by 1998. It had occurred to Jim Clark, who had jumped off the sinking MIPS ship at SGI. In 1994. It occurred to the rest of SGI by 1998, when they launched the SGI Visual Workstation, then announced they were going to give up on MIPS and board the Itanic.

I mean, yes, it hadn't occurred to most people yet. Because most people are stupid, and most of the ones who weren't stupid weren't paying attention. But it hadn't occurred to most people at Intel? You'd think they'd have a pretty good handle on how much ass they were already kicking.

> the joke is that cheap x86 processors did outperform Itanium (and every other architectures, eventually).

Even on my Intel laptop, more of the computrons come from the Intel integrated GPU. In machines with ATI (cough) and NVIDIA cards, it's no contest; the GPU is an order of magnitude beefier.


> When it launched in 2001...the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

The idea was certainly around in the early-to-mid 1980s, when some former Intel engineers founded Sequent.

The Balance 8000, released in 1984, supported up to 12 processors on dual-CPU boards, while the Balance 21000, released in 1986, supported up to 30.

I interviewed the founder, Casey Powell, and he was explicit about multiple Intel microprocessors replacing large systems. He was targeting minicomputers at the time, of course, but we all anticipated that bigger sets of more powerful CPUs would eventually surpass even the biggest "big iron".

Powell was a great guy. However, his company got taken over by IBM. In the end, he didn't get to change the world.

"It's hard to be the little guy on the block and have really great technology and get beaten, just because the other guy is big." https://www.cnet.com/news/sequent-was-overmatched-ceo-says/


Right! Some friends of mine spent a lot of time programming a Symmetry in the early 1990s. Also around the same time, 1988, Sun introduced the Sun386i, which could even run multiple MS-DOS programs at once — but it wasn't a huge success, and they stuck with SPARC. I think Sequent and the Sun386i were just too early, say by about six or seven years.

An interesting question is: what are the structural advantages of bigness? When Control Data produced the world's fastest computer, some people at IBM wondered how it could happen that a much smaller company could beat them to the punch that way; others believed that that smallness was precisely the reason.


The advantage of smallness is that you can be faster than the big guys. You also can go into smaller niches.

The advantages of bigness are that you can use scale to make the same thing less expensive, and that you can make at least one mistake without it killing you, and that you can chase more than one "next big things" at once.


> I don't think x86 compatibility mattered.

It did. At that time Linux and open source were not the clear winners in the server space they are now and people were not used (or able) to recompile their code.

Windows took a long time to support Itanium and companies wouldn't buy it because they had nothing to run on those machines. They got x86 machines instead and amd64 when it became available.


> I don't think x86 compatibility mattered.

I'll admit I had a limited worldview, but not running Excel as quickly seemed like the kind of criticisms I saw in the trade rags at the time.

> the idea that a massive array of cheap x86 processors will outperform enterprise-class servers simply hadn't occurred to most people yet

Oh, I don't know. There was a really common Slashdot cliche running around at that time: "Can you imagine a Beowulf cluster of these?"


Yeah, it had clearly occurred to everyone on Slashdot. On the other hand, we liked to "imagine Beowulf clusters" of all kinds of recondite hardware. Elbrus 2000 is real power!

HOT GRITS!


/. was another world, one that I miss sometimes. I'm probably viewing it through rose-tinted nostalgia, but I don't remember the vitriol and hatred that so many social sites are soaked in these days.

The Mozilla open source announcement, the Microsoft anti-trust case, Linux exploding in popularity, it all felt like we were changing the world for the better.


/. was great... and then it wasn't. The troll population went way up, the "first post" thing was just noise, and eventually I just quit going there.

HN is also less than it used to be, but I think that the mods have kept it better than /. became - so far, at least.


That seems right on such important sorts of computation, disregarding other factors. On the history, actual HPC numbers for Itanium have appeared in the Infamous Annual Martyn Guest Presentation over the years. An example that came to hand from 15 years ago to compare with bald statements is is https://www.researchgate.net/profile/Martyn_Guest/publicatio...

Regarding GPUs, Fujitsu may not agree (for Fugaku and spin-offs) depending on the value of "horsepower" relevant for HPC systems, even if an A64FX doesn't have the peak performance of a V100. They have form from the K Computer, and if they basically did it themselves again, there was presumably "co-design" for the hardware and software which may be relevant here; I haven't seen anything written about that, though.


That C like language is called "Verilog". (Yes I know it's a HDL but the point still stands. FPGAs are commodity these days.)


Verilog isn't really very c-like. It is slightly more c-like than jts main competitor, VHDL. It is more c-like than lisp. But really, neither of those is saying much.

Ways it is not c-like:

- begin...end instead of curly braces

- parallelism with assign, always, initial

- tasks and functions and their subtle differences

- non-blocking assignment

- bit-oriented variables and operations

- 4-state logic (0, 1, X, Z)

- other constructs for modeling hardware like time delays, tri-state wires, drive strengths, etc.

Not to mention that SystemVerilog has taken over Verilog and adds OOP with classes, a complicated (sorry, powerful) assertion mini-language, constrained randomization, a streaming operator, and so on.


Heh, I suspect hardware folks would like a CPU which programmed well with Verilog or VHDL, and I know that trying to make a hardware description language accessible to software folks has been a pipe dream for at least 25 years. However, I don't think Verilog was the solution for Itanium, and the useful niche for FPGAs seems increasingly limited to low Size Weight And Power realms.

The FPGA projects I've seen (using very high end FPAGs, not commodity/cheap ones) seem like they're always bumping up against clock rates and making timing as soon as they try to do anything approaching what you can do on a CPU or GPU. Of course there are exceptions where the FPGA does really simple and parallel things, but FPAGs aren't a panacea.


There was a company called reconfigure.io that came the closest to an accessible HDL (they compiled Go to VHDL/Verilog) but seems to have died and their founder is now with ARM.


I recently worked with another company trying to solve the same problem, but I suspect I should keep my mouth shut due to NDA crap. Regardless, I suspect we're easily another decade out before anyone makes FPGAs accessible to the masses, and I can't think of many problems where I would take an FPGA over a GPU, but there are still some.


Or you could prototype the algorithm in matlab then convert to HDL. https://www.mathworks.com/products/hdl-coder.html




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: