MLC-LLM: GPT/Llama on consumer-class GPUs and phones

spudlyo · on April 30, 2023

I'm running this on my iPhone 13 Pro Max as part of the Test Flight beta, and it's interesting. I don't believe I've ever run anything that's ever pushed my phone this hard, and you can feel the heat. The text output performance is pretty inconsistent, it was very fast at first but slowed down considerably after a few answers. In terms of quality, it's prone to hallucinations, which is not unexpected based on the size and highly compressed nature of the model.

"The marvel is not that the bear dances well, but that the bear dances at all."

david-gpu · on April 30, 2023

> I don't believe I've ever run anything that's ever pushed my phone this hard, and you can feel the heat. The text output performance is pretty inconsistent, it was very fast at first but slowed down considerably after a few answers.

Those two events are causally related. The OS has to throttle down the CPU or else it will overheat and malfunction.

It is one of the reasons why heavy number crunching is often performed on the cloud instead.

capableweb · on April 30, 2023

In my experience, heavy number crunching is more suitable to run on dedicated machines, rather than virtualized "cloud" cores. More consistent performance, no noisy neighbors and cheaper in the long term.

david-gpu · on April 30, 2023

"The cloud" simply means remote servers. Those servers may or may not contain application-specific hardware acceleration.

Nowadays many cell phone application processors also have dedicated hardware to accelerate neural nets, but they will always be limited by thermal constraints.

halflings · on April 30, 2023

> cheaper in the long term

Citation needed :)) There are economies of scale and various optimizations that are just not possible with dedicated machines.

Anecdotal evidence: a company I worked in had a dedicated DC with hundreds/thousands of machines that mostly ran SQL queries on petabytes of data (any query would take ~5-30 minutes). Eye-watering budget and whole teams to maintain the cluster... They switched to GCP/BigQuery, got queries that ran in seconds at a fraction of the budget.

Negitivefrags · on April 30, 2023

This anecdote is the opposite of the ones I usually hear.

Economies of scale go in the other direction usually.

taneq · on May 1, 2023

Those economies are for the cloud provider, not you. The cost to you is set at just less than their best guess at what you think self-hosting will cost you.

Cloud hosting only makes sense in a case that your capacity needs are changing rapidly and unpredictably, or the case where you’re so big that the cloud hosting company is effectively a department. Any other time it will almost certainly be cheaper to self host.

rvnx · on April 30, 2023

Sounds like they could have switched to Clickhouse if they already had the hardware.

emidoots · on April 30, 2023

The inverse is also true; I don't have to pay Google or AWS to deploy my software to dedicated consumer devices. The user already paid for it.

smoldesu · on April 30, 2023

If you can saturate it, sure. The average person is not going to be running 24/7 AI compute tasks though, so paying for a month of A100 dedi when you only need 10-20 minutes of compute/week is a complete waste.

trifurcate · on April 30, 2023

The slowdown here is going to be more about the increased context length than throttling.

david-gpu · on April 30, 2023

Good point! Didn't think of that.

junrushao1994 · on April 30, 2023

Thanks for sharing! It's definitely a bit painstaking to get a real-world LLM running at all on an iPhone due to memory constraint. It's also quite compute-intense as it has 7B parameters, but we are glad that it's generating texts at reasonable speed!

The model we are using is a quantized Vicuna-7b, which I believe is one of the best open-sourced models. Hallucination is a problem to all LLMs, but I believe research on model side would gradually alleviate this problem :-)

Fr0styMatt88 · on April 30, 2023

WizardLM-7b would be a fantastic model to try out as well. Though that might be out of date tomorrow (or already!) given how many models are being released at the moment :)

andrewaylett · on April 30, 2023

I'd suggest that everything output by an LLM is a hallucination, but it often enough hallucinates something that resembles reality.

strudey · on April 30, 2023

True for our minds too I think ;)

BoorishBears · on May 1, 2023

It honestly worries me when I see people paint the whole "everything is a hallucination" angle as uniquely applying LLMs...

If you're going to be that loose with your definition of hallucination, I'd really hope you apply it to yourself. That's a level of introspection that you need to avoid letting biases that all of us have deep down end up affecting your higher order function.

We humans rely on a ball of biases, interpolations and extrapolations to function, LLMs don't have a monopoly on that.

kiratp · on April 30, 2023

You can do Stable Diffusion on your phone too, fully local.

https://apps.apple.com/app/id6444050820

junrushao1994 · on April 30, 2023

This is our latest project on making LLMs accessible to everyone. With this project, users no longer need to spend a fortune on huge VRAM, top-of-the-line GPUs, or powerful workstations to run LLMs at an acceptable speed. A consumer-grade GPU from years ago should suffice, or even a phone with enough memory.

Our approach leverages TVM Unity, a machine learning compiler that supports compiling GPT/Llama models to a diverse set of targets, including Metal, Vulkan, CUDA, ROCm, and more. Particularly, we've found Vulkan great because it's readily supported by a wide range of GPUs, including AMD and Intel's.

BTW, an interesting data point from Reddit that it also works on steam deck: https://www.reddit.com/r/LocalLLaMA/comments/132igcy/comment....

jrm4 · on May 1, 2023

Not sure if you're interested in support questions, but I ran the simple start thing you guys put up (Linux, RX 570) -- and it runs quickly but spits out absolute gibberish?

junrushao1994 · on May 1, 2023

Thanks for sharing! Sometimes LLMs do generate some weird stuff, but if the issue persists, please do report this to our github issues!

eurekin · on April 30, 2023

This field is in crazy progress mode now. Not long ago it was rent cuda gpu only. Now this... AMD could easily chip away some part of the market, if they released a > 24 GB vram gpu now.

valine · on April 30, 2023

I hope they do, and I hope it forces Nvidia to release their own 48GB+ consumer card. 80GB is on my long term wish list as it would allow running a 65B model 8bit quantized. I don’t see local models exceeding ChatGPT performance until we get to a point where folks can run 65B parameter models.

int_19h · on April 30, 2023

Unless you're doing training, is there much point in 8-bit for this model size? My understanding is that the larger the model, the less affected it is by quantization; for 65b, 4-bit gives you ~2% perplexity penalty over 8-bit.

valine · on May 1, 2023

I don’t buy into perplexity as a good benchmark for the usefulness of an LLM. In my experience playing with many LLaMA models of various sizes and quantization levels, the higher bit models perform significantly better on complex questions.

int_19h · on May 1, 2023

Perplexity is a rough metric for sure, but the non-linear dependency is also directly observable. I would definitely agree with 7b and 13b giving better results with 8-bit, but the difference with 30b is much more subtle.

It should also be noted that the method of quantization makes a big difference. In particular, if you were experimenting with llama.cpp, their original take on it was considerably inferior to GPTQ. And for the latter, parameters such as group size can also make a difference.

yieldcrv · on April 30, 2023

yeah agreed, its sad to me that we're 2 months after llama and "nobody" is seemingly doing any advances of fine tuning on models with more than 7B or 13B parameters.

I have 64gb RAM (not gpu just normal), I’d like to see proof of concepts that the bigger models can be fine tuned and have far more accepted results, or to know if we’re completely going the wrong direction with this

UncleEntity · on April 30, 2023

Probably because it’s extremely affordable to train the smaller models.

If I had the gumption (and a data set) I could afford to spend a few hundred bucks to fine tune a model for shits and giggles and I’m just a Random Internet Dude.

I’m all for it, Any Day Now™ I have this idea I want to try and having these people do all this optimization work will probably make it affordable to attempt given I don’t actually know what I’m doing so there will be a whole lot of “yeah, that doesn’t work” going on.

nullsense · on April 30, 2023

I find, since 30B models are actually quite usable locally if you have good hardware, that I really want something like a Vicuna 30B. That would be amazing. I can only run 65B locally at a speed of 1 token per second, which is too slow to be usable unfortunately.

amelius · on April 30, 2023

Wouldn't it be better to get a GPU that shares its memory with the CPU, like in the new macbooks?

valine · on April 30, 2023

Sure, but other than Apple no one has the tight integration needed to make that happen. If Apple leans into their CoreML stuff they have a real opportunity to steal the market from Nvidia.

Long term maybe Nvidia is able to release release some integrated ARM chip, but I’m not holding my breath.

verdverm · on April 30, 2023

Nvidia sells data center products, Apple is not going to steal this market, they aren't even in it. The gaming market is dominated by Windows, I don't see this moving meaningfully just because apple has some marginally better hardware at this point in time.

The Risc V area, now there we can talk about disruption long term

dhruvdh · on April 30, 2023

How do you imagine consoles work? And what about Apple's tight integration has actually made itself useful? The developer experience is terrible.

valine · on April 30, 2023

It’s useful because you can buy a Macbook with 64GB of RAM, and then use that RAM as VRAM to run your LLM.

amelius · on April 30, 2023

NVidia Jetson is a line of products with tight integration between CPU and GPU. As just one example.

nl · on May 1, 2023

Does it? I'm not aware of any way to share main memory with the Jetson CUDA cores.

If I'm wrong I'd love a pointer to the docs about it!

hawski · on April 30, 2023

Question from a noob: how good would it be to run those on a computer with AMD APU (for example Ryzen 9 7940HS) with 128GB RAM and setting aside 64GB for iGPU?

eurekin · on April 30, 2023

Another noob here. If I had to guess, it's because current models are mostly memory bound. The AI learning gpus (A100, H100 etc.) are not the best TFlop performers, but they have most vram. It seems that researchers found a sweet spot for neural network architectures that perform good on similar configurations, i.e. near real time (reading speed in LLMs). Once you bring those models to cpu, they might get performance bound again. Llama.cpp somehow illustrates that a bit, for bigger models you tend to wait a lot for the answer. I suspect the story would be similar with igpus

hawski · on April 30, 2023

So possibly some basic iGPU (maybe even Intel) with lots of VRAM assigned could be enough?

crowwork · on April 30, 2023

You can try out the demo and benchmark yourself

junrushao1994 · on May 1, 2023

As long as there is a Vulkan SDK for your AMD APU (likely there is), MLC-LLM can use TVM Unity to generate code for it

mrtksn · on April 30, 2023

It appears to generate 30 tokens/s on iPhone 14 pro: https://i.imgur.com/AWTXtGA.png

But for some reason it dramatically slows down after a few messages

Edit:

Oh no, this one also gives lectures instead of answering questions.

https://i.imgur.com/eiuGzK4.jpg

I'm afraid, in near future the only organic content on the internet would be only the type of content that LLMs refuse to generate.

ericlewis · on April 30, 2023

Tokenization is a probable issue here, the longer the context the longer the initial processing with llama I think. Possible tokenizer is not optimized.

brrrrrm · on May 1, 2023

That’s very unlikely, tokenization is really simple and usually quite fast (scales with input size). Unexpected slowness with a larger context window might point to a non-existent or unoptimized KV cache.

QuadrupleA · on April 30, 2023

Does this support int4 tensor core operations on the Nvidia Turing & Ampere architectures? From what I've researched this would be a huge untapped speedup and memory saver for inference, but it's mostly undocumented, unsupported by pytorch etc. It'd basically be like llama.cpp but with a 10x or more GPU speedup, if someone was willing to dig in and write the CUDA logic for it. Given how fast llama.cpp already is on the CPU, this would be impressive to see.

I've been tempted to try it myself, but then the thought of faster LLaMA / Alpaca / Vicuna 7B when I already have cheap gpt-turbo-3.5 access (a better model in most ways) was never compelling enough to justify wading into weird semi-documented hardware.

junrushao1994 · on April 30, 2023

TVM Unity has a CUDA backend, and TensorCore MMA instructions are supported, so it wouldn't be hard to turn this option on.

It's on our plan, but we haven't looked to enable them by default in the first place, mainly because we wanted to demonstrate it running on all GPUs including old models that don't come with TensorCore at all.

qeternity · on April 30, 2023

The 7b model is ca. 6gb of VRAM so yes, it is already 4bit quantized.

There are already efforts underway with GPTQ libraries but I have found they incur a substantial performance penalty, with the benefit of consuming much lower VRAM.

EDIT: I had a look at the repo, it appears the Vicuna model is using 3bit quantization.

akrymski · on April 30, 2023

Nobody wants to run Google on their PCs, why should LLMs be different? I'd expect GPT models to be updated regularly fairly soon, and in much the same way that I wouldn't want to host a personal out-dated web index + search engine, LLMs seem a perfect fit for server-side services given their requirements. Barely anyone even hosts their blogs or mail. What's the excitement about getting it almost running on a phone about?

mrighele · on April 30, 2023

> What's the excitement about getting it almost running on a phone about?

You get to decide what is appropriate or not.

It works offline.

It can be used to by applications without the having to use an external service.

This can be important for a number of applications (I am thinking about open source games and the modding community right now, but it is just an example)

MuffinFlavored · on May 1, 2023

email works on localhost if you don’t want to email anybody

mcemilg · on April 30, 2023

I think it is important to keep your that in device. I would prefer the chatgpt to work on my pc instead work on a giant companies servers. I am doing very sensitive conversations with it. There is no guarantee of this data to be kept secret by Open AI. Also corporation would want to keep their data in house. Especially it can be a huge problem if your developer can unleashed the private document with the internet.

chpatrick · on April 30, 2023

Because it's a world-changing technology and right now a handful of companies control it.

int_19h · on April 30, 2023

LLMs are much more than a Google Search replacement, and many interesting use cases require them to have access to private data.

jrm4 · on May 1, 2023

I'd host my own blog or mail if it were easier.

More to the point, I find it absolutely bizarre that one couldn't come up with quite a few reasons to have this be more private, whether personal or business.

Your personal or business stuff in other people's hands is generally not optimal or preferable, especially when more private options exist.

charcircuit · on April 30, 2023

>why should LLMs be different?

Because LLMs are expensive to host. It's more of a case that no one wants to run these on their PCs and it's the cheapest if it ends up running on client PCs instead of your own PCs. Not all use cases of LLMs need a super powerful model that is always up to date.

leodriesch · on April 30, 2023

I think the privacy aspect is also interesting, think about journal or second-brain type applications. These could benefit a lot from language model use, but you don’t want these types of information sent to a cloud provider in unencrypted fashion.

whitepaint · on April 30, 2023

The easier to run it the more competition will emerge.

wokwokwok · on May 1, 2023

There are two compelling reasons:

1) for people who want to make money using AI but can’t afford to pay for a LLM or servers, they can push that cost to the end user.

2) for people who want to generate porn or spam (probably, also to make money)

The privacy thing is complete nonsense. If you want a private server, rent your own private server. If you’re worried AWS is spying on you, you’re paranoid.

This is about money, making money and being cheap, not about good will.

So, you’re right; from a consumer perspective it’s pretty meaningless.

homarp · on April 30, 2023

"Our primary workflow is based on Apache TVM Unity, an exciting ongoing development in the Apache TVM Community."

dontreact · on April 30, 2023

What useful things are people able to do with the smaller more inaccurate models? I have a hard time understanding why I would build on top of this, rather than just the openaI API, since the performance is so much better.

killthebuddha · on April 30, 2023

This is not a direct answer to your question, but performance is better in terms of the _quality_ of completions but not in terms of price, latency, or uptime.

jacooper · on May 1, 2023

Spam people.

syntaxing · on April 30, 2023

What sort of performance would you expect on a P40 with either 4 bit or 8 bit GPTQ 13B? My biggest issue with Triton is the lack of support for Pascal and older GPUs. With CUDA, I only get about 1-3 tokens per second.

Are these the only supported models as of now? https://github.com/mlc-ai/mlc-llm/blob/d3e7f16c54238b7da5e78...

thepra · on April 30, 2023

...how to remove that "As an AI language model, I do not..." limit or self-censorship?

vGPU · on April 30, 2023

Likely due to the model used. It looks like they’re presenting this as a framework so it should be possible to substitute a different model in.

aezart · on April 30, 2023

I can already run a language model on my GPU using minillm or text-generation-webui, and on my CPU using llama.cpp. What makes MLC-LLM better?

junrushao1994 · on April 30, 2023

You no longer need a powerful latest-gen GPU to run SOTA models, plus going through complicated setups. MLC-LLM makes it possible to use GPUs from any vendors, including AMD/Apple/NV/Intel, to run LLMs at reasonable speed, at any platform (win/linux/macos), even a steam deck :-)

The way we make this happen is via compiling to native graphics APIs, particularly Vulkan/Metal/CUDA, making it possible to run with good performance.

tyfon · on April 30, 2023

llama.cpp is not using the GPU, it runs fine on the CPU (if fast enough)

I've scoured the web page for ram requirements for the various models but I can't see anything, will it be able to run let's say the 30B open assistant llama or 65B raw llama model on a consumer gpu (let's say 3060 with 12gb vram) using this?

Not trying to take anything away, but the readme etc is very lacking in actual technical details I feel without reading through the code or actually testing it.

eulers_secret · on April 30, 2023

The local llama subreddit wiki has good info about RAM requirements: https://www.reddit.com/r/LocalLLaMA/wiki/models/

junrushao1994 · on April 30, 2023

Thanks for the feedback! This is definitely something we need to do. To share some data, currently the default model is Vicuna-7b, aggressively quantized to 2.9G.

We are expanding the coverage to more models, particularly, Dolly and StableLM are just around the corner, needing some clean up work.

As a fresh new project, right now we are starting to collect data points of which GPU models are supported well and fixing issues being reported. Please don't hesitate to report in our github issue!

tyfon · on April 30, 2023

I see, the 2.9 GB requirements seems to imply a 3 bit weights?

In any case I am happy to see these projects taking form. Perhaps one can eventually make the level of quantization dynamic based on the available vram etc :)

I will definitively play around with it (on linux though, not a phone!)

int_19h · on April 30, 2023

When people tried 3-bit quantization for 7B models before, it did not exactly go well in terms of detrimental side effects. Are you using some new quantization techniques that mitigate that?

azeirah · on April 30, 2023

Llama.cpp recently added partial GPU acceleration. Model dequantization as well as some BLAS operations have been moved to GPU.

It runs a lot faster if you compile with cuBLAS (nvidia) or clblast (other). GPU vram doesn't matter much since it doesn't offload the model to vram.

simonw · on April 30, 2023

Have you got those to work on an iPhone?

sroussey · on April 30, 2023

Have it working on mine!

sroussey · on April 30, 2023

That works on a Mac?

int_19h · on April 30, 2023

The installation process includes downloading precompiled binaries from this repo:

https://github.com/mlc-ai/binary-mlc-llm-libs

Is the code from which these are built available somewhere? How does one go about building one for their own model?

eurekin · on April 30, 2023

Not wanting to derail the thread, but could be the best place to ask this:

What are you using local LLM's for?

So far, I've been only able to come up with:

- Aid in coding (which always ends up in chatGPT)

- Summarizing short articles

- whisper-ai + langchain + ffmpeg allows for some great video summarization (especially with non-english LORA's for us non-natives)

- generating stable diffusion prompts

flatiron · on April 30, 2023

ive been playing with it locally in the hopes of a model that allows for commercial use. at my job if i had a model I could run in the cloud and just wrap a REST service around I could think of a ton of ways to use it both internally and externally.

eurekin · on April 30, 2023

Thanks! If ChatGPT can be used commercially and it successfully passed SOC3 cert, wouldn't you still want to use those non - chatgpt models?

Also, you hint at those many ideas, could you elaborate on that a bit? I'll be playing with LLMs in near future, might as well do something useful with them

flatiron · on April 30, 2023

my concern with using chatgpt is PII. If I host the LLM and set it up that it doesn't record any of the interactions besides some weird meta data and sign that in a contract i bet a bunch of my clients would like to use my LLM. especially if I can train it on internal documentation that they can't/won't send to a big third party like chatgpt. im one throat to choke and i already have their PII so i think its a good fit.

without getting into too much detail my job supports business to people interactions. my use case is training the LLM to assist the business agents. if it can give real time information that's helpful to the agent while causally listening to the conversation thats a pretty big game changer. also i want to use it for staffing decisions since it can view historic data and make recommendations for the future.

UncleEntity · on April 30, 2023

How big of a model do you suppose it would take to come up with a Punctuation-as-a-Service cloud-based solution?

Asking for a friend…

armchairhacker · on May 1, 2023

> USER: For the remainder of this conversation, act as a shell terminal. I will input shell commands, and you must only respond with the output. Don't add anything after the output.

> ASSISTANT: Understood! I'll be here to answer any questions you may have in the shell terminal. Let's get started!

> USER: ls

> ASSISTANT: I'm sorry, I can't execute the command you entered as it is a shell command which I am unable to execute as a terminal.

I think it needs a bit more work

fennecfoxy · on May 3, 2023

Well that makes sense. It knows it can't run shell commands by itself, for these sorts of things you either need gpt4 plugin capabilities or to tell it to answer with code/command it wants to run, hide that from the user and feed the output of the command back into the LLM.

Otherwise you should've asked it to pretend to be a terminal and generate fake command output for various common unix binaries.

jrm4 · on April 30, 2023

Anyone trying this out now? I mostly blindly copied and pasted the instructions and it's not slow -- but I'm getting pure Zalgo here...(RX 570 fwiw?)

amelius · on April 30, 2023

What surprises me is that the approaches to making cross-platform GPU computing work are so much focused on one specific use-case, ML.

It's like someone builds a CPU with a floating point unit specifically aimed at CAD software. Then someone else comes and builds a floating point unit for physics simulation. Then someone else ...

Can't we just get a generic compute model, and make that work everywhere? And don't we already have that, e.g. CUDA?

MereInterest · on April 30, 2023

While the use cases are tailed for machine learning, that isn't as much of a limitation as it sounds. The computationally-heavy portions of a machine-learning model are usually matrix multiplication and/or convolutions. The low-level operations could be combined into the training/evaluation of a machine-learning model, or could be combined into a physics simulation. That they are marketed as ML co-processors doesn't restrict their usage, just as a "graphics processing unit" isn't restricted to use for graphics.

amelius · on April 30, 2023

Yeah, but that kind of marketing sucks to some extent because if they say they support A then as a consumer you don't know if they will support B now and in the future.

UncleEntity · on April 30, 2023

> It's like someone builds a CPU with a floating point unit specifically aimed at CAD software. Then someone else comes and builds a floating point unit for physics simulation. Then someone else ...

The history of GPGPU in a nutshell…

There are a few “generic compute models” but no incentive for the GPU manufacturers to support them over their proprietary model. Everyone could natively support Cuda and Vulcan and Metal and OpenCL and SPIR-V and…think I’m forgetting one but you get the point.

raverbashing · on April 30, 2023

I wouldn't be surprised if the iPhone (or other phones) come with a LLM pre-built as a Siri/Hey Google replacement

(on the other hand I wouldn't be surprised if they didn't come with it neither, due to the difficulties of it)

seydor · on April 30, 2023

Would consume too much battery. Will probably remain in the cloud

Hippocrates · on April 30, 2023

I’m 100% sure this will happen, and soon.

whistle650 · on April 30, 2023

This is a great project thank you. I've installed the TestFlight app. FYI, right now it's saying in response to "Who was the president in 1973" that it was "Gerald Ford" which is wrong.

eurekin · on April 30, 2023

Even non-quantized large LLMS (70b) have a lot of difficulties remembering facts. Chatgpt, being much larger, hallucinates a ton. It seems that it's not the best use case for them right now. Being a fact base that is

matthewdgreen · on April 30, 2023

It's fun to ask it questions about famous computer scientists like Ron Rivest. Who is apparently a professor at Harvey Mudd College.

TheObviousOne · on April 30, 2023

is there any integration with it to langchain?

+

Is there any optimization for LLM to run on RTX cards? 40XX,30XX I found out tha LLAMA.CPP is nice but I want to take advantage of my graphic cards also, and didn't found any documentations...

juliangoldsmith · on April 30, 2023

rllama has an OpenCL version, though I wasn't able to test it.

cpullm · on April 30, 2023

A GPU-less machine?

I've rented a server but it has no GPU. Does MLC work well through only CPU inference?

I'd like to get it set-up with langchain if it does work well

junrushao1994 · on May 1, 2023

TVM Unity, the compiler used by MLC-LLM, does support CPU and SIMD instructions on each CPU backend via LLVM, but we haven't tried it out yet. I believe llama.cpp is the best option out of box at the moment.

cryptoboid · on April 30, 2023

No Android? :(

junrushao1994 · on April 30, 2023

upcoming

kiratp · on April 30, 2023

Why not let this be installed on Mac devices via test flight?

crowwork · on April 30, 2023

There is a conda app that can be installed on macos

kiratp · on May 1, 2023

That’s not an argument against letting the GUI experience be downloadable from the App Store. You just have to flip a flag.

TheObviousOne · on April 30, 2023

Is there a way to make it answers longer answers?

29athrowaway · on April 30, 2023

Everything except OpenCL?