Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On a technical level, they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs.

But the results are pretty amazing. It completely knocks Openflamingo && even the original blip2 models out of the park. And best of all, it arrived before OpenAI's GPT-4 Image Modality did. Real win for Open Source AI.

The repo's default inference code is kind of bad -- vicuna is loaded in fp16 so it can't fit on any consumer hardware. I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.

I also developed a toy discord bot (https://github.com/152334H/MiniGPT-4-discord-bot) to show the model to some people, but inference is very slow so I doubt I'll be hosting it publicly.



Indeed, really simple. And yes, the results are shockingly good. But what I find most remarkable about this is that the ViT-L+Q-former's hidden states are related by only a linear projection (plus bias) to the Vicuna-13B's token embeddings:

  emb_in_vicuna_space = emb_in_qformer_space @ W + B
These two models are trained independently of each other, on very different data (RGB images vs integer token ids representing subwords), and yet somehow they learn to embed different data in feature vectors that are so... similar. WHY should that be the case?

It suggests to me there may be something universal about the embedding layers and hidden states of all trained deep learning models.


I think it’s just that affine transforms in high dimensions are surprisingly expressive. Since the functions are sparsely defined they’re much less constrained compared to the low dimensional affine transformations we usually think of.


Good point. Didn't think of that. It's a plausible explanation here, because the dimensionality of the spaces is so different, 5120 vs 768. Not surprisingly, the trained weight matrix has rank 768: it's using every feature in the lower-dimensional space.

Still, it's kind of shocking that it works so well!

I'd be curious to see if the learned weight matrix ends up being full-rank (or close to full-rank) if both spaces have the same dimensionality.


They would have full-rank because all the embedding space is used. There are no unused large pockets.


The weight matrix's rank would decrease for each feature in the target space that cannot be expressed as as a linear combination of features in the input space (plus a bias). For example, if the target space has a feature representing a non-visual quality like "smelliness," it would not be expressible as a linear combination of features representing visual attributes like "redness," "blueness," and "greenness," etc. in the input space.

If both spaces have the same dimensionality, the learned weight matrix would be full-rank only if every feature in the target space is expressible as a linear combination of features in the input space (plus a bias). Which brings me back to my original question: WHY would that be the case when the two models are trained independently on data that is so different?


A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.

So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?

If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...


> But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities...

That's a really good point. Thank you!


>somehow they learn to embed different data in feature vectors that are so... similar

At it's core, BLIP2 already projects RGB inputs into text token space and Vicuna (or rather LLaMA) uses such tokens as inputs as well as outputs. The only reason why a linear layer is needed at all is because they are not trained at the same time, so you still have to move text embeddings from one space to another. But it should not be surprising at all that one hidden linear layer suffices to do just that (see the universal approximation theorem [1]). This approach is just an efficient way to combine different models for downstream fine-tuning tasks while keeping their weights frozen, but it is neither new nor particularly surprising.

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...


Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different) text token space makes sense to me. See also fpgaminer's comment at https://news.ycombinator.com/item?id=35603246 . However, I don't see how the universal approximation theorem is relevant here. The fact that deep models with sufficient capacity can approximate any function does not imply that two deep models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.


>I don't see how the universal approximation theorem is relevant here. The fact that deep models

The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.


You'd need to increase width (dimensionality) if you make these models shallow.

My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.


The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.


BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.


apparently you can project directly with CLIP. See here - https://llava-vl.github.io/. This seems pretty wild to me.


That seems pretty wild to me too.


This is the best answer. It makes sense to me. Thank you :-)


as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162


Thank you for sharing this. I would not have expected that. It does seem pretty wild.


Man you need to look at this - https://llava-vl.github.io/. They project with a linear layer from Clip directly. With blip-2, you could say it already converts RGB into token space.


>so hopefully by tomorrow it'll be runnable by 3090/4090 users.

Taking a step back, this is just a wild statement. I know there's some doom and gloom out there, but in certain aspects, it's an awesome time to be alive.


It already runs!


I've never seen anything quite like it.


Then it's an impressive demonstration of how modular neural networks can be. Maybe we don't even need to train monoliths


Maybe a distributed trainer? AI@Home?


> they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs

Oh yes. Simple! Jesus, this ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.


> This ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.

Just like any discussion between advanced web devs would make any humble woodworker feel?

And just like any discussion between advanced woodworkers would make a humble web dev feel?

"It's really simple, they're just using a No. 7 jointer plane with a high-angle frog and a PM-V11 blade to flatten those curly birch boards, then a No. 4 smoother plane with a Norris-type adjuster and a toothed blade for the final pass."

Whut?

"You could use Webpack to bundle your HTML, CSS and Babel-transpiled TypeScript 5 down to shim-included Ecmascript 4", "They're just using OAuth2 authentication with Passport.js and JWT tokens, which easily gets you CSRF protection", "Our e-learning platform uses LMS.js and xAPI.js, plus SCORM for course packaging and Moodle as the LMS backend.", ...

There was a time you didn't know what any of that meant.

Just because you don't know what the words mean shouldn't make it sound difficult. Not saying AI is easy, just that the jargon is not a good indication of difficulty and we should know better than to be so easily mystified.


Hey, guys. Hey. Ready to talk plate processing and residue transport plate funneling? Why don't we start with joust jambs? Hey, why not? Plates and jousts. Can we couple them? Hell, yeah, we can. Want to know how? Get this. Proprietary to McMillan. Only us. Ready? We fit Donnely nut spacing grip grids and splay-flexed brace columns against beam-fastened derrick husk nuts and girdle plate Jerries, while plate flex tandems press task apparati of ten vertipin-plated pan traps at every maiden clamp plate packet. Knuckle couplers plate alternating sprams from the t-nut to the SKN to the chim line. Yeah. That is the McMillan way. And it's just another day at the office.


This post is double great and I will never forgive Amazon for canceling that show.

For those that don't know this is from a show called Patriot.

https://en.wikipedia.org/wiki/Patriot_(TV_series)

Scene: https://youtube.com/watch?v=-F-IHvF5OCA


I remember seeing someone link to that scene recently as a joke on Twitter (about Twitter trying to explain Twitter Blue). Within a few days I’d watched the entire series… absolutely phenomenal show.

Edit: ah I actually saw the prior scene where Leslie was explaining to John what he expected (which is the setup for the linked bit): https://www.youtube.com/watch?v=G7Do2tlYLhs


Just tell me do we need a turbo encabulator or not?


I'll take 2


Talk dirty to me!


runtime polymorphism


The thing is, machine learning sorta requires a few math prerequisites: linear algebra, differential equations, and to some degree vector calculus. Most web developers don’t have this background.


If you want to understand the theory, that's true. If you want to develop an intuitive understanding without having to understand all the nuts and bolts (and I understand that can be a big ask for how some people learn/understand), give this a try: https://karpathy.ai/zero-to-hero.html


The irony is Karpathy presents the limit/epsilon definition of derivatives in the first half hour (quite well IMO and he never actually says “epsilon”) which is very much a nuts and bolts kind of explanation in calculus.

That said, when most people say differential equations they’re usually thinking of analytical solutions which is very much not necessary for practical ML.


I dunno, I just watched it and he almost immediately throws out the formality of the delta-epsilon definition and starts using infinitesimals.

Thank god.


I would say the limit epsilon derivative is exactly the sort of thing grandparent post is talking about. It's quite intuitive and doesn't require hardly any mathematical foundation at all, other than basic geometry and algebra. You can understand topics that build on that simple concept without understanding the more formal derivative definitions.


Web devs have become blue collar!? =P

Great idea, actually. I do hope for a curriculum that enables kids on the trade school path to learn more about programming. Why not Master/Journeyman/Apprentice style learning for web dev??


That's kind of how I think about bootcamps pumping out web devs. They're like trade schools, teaching you just enough fundamentals to know how to use existing tools.


I kind of agree, but I'd add that I don't think it's a bad thing.


Mostly agree... though I don't think the bootcamps get enough fundamentals in. Not to mention that it takes the type of person that will go above and beyond what has been assigned to succeed trying to be a productive employee in the space. I'm self-taught and the first years of my career spent countless hours reading, practicing and solving problems. I still spend a good 10-15 hours a week reading and exploring software development and try to at least keep up with what's out there. In the end, the best you can do is be aware of what, or even that options are out there.

I can't imagine starting out today...


You make a good point. Except that a number of these concepts and tooling in the ML world have been slingshotted into the forefront in a relatively short time and it has been hard to play catch up. For eg. - someone said "frozen Vicuna" below - what does that mean?



Okay, I won't mention how much is wrong in the webdev statement... :-D


I love your analysis.


> take BLIP2's ViT-L+Q-former

This thing takes an image and creates a representation matrix.

> connect it to Vicuna-13B with a linear layer

Vicuna is an open LLM, pretty good quality, not as good as GPT3.5 though.

This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.

> and train just the tiny layer on some datasets of image-text pairs

You then get a shitload of image-text pairs and train the model to describe the images in text. But keep both the image and text model frozen. Is that hard? No, just flip a flag. So this "linear projection layer" (a matrix multiplication) is the only learned part. That means it takes less time to train, needs fewer examples and requires less memory.

Training the image and text models was much more difficult. But here we don't train these models, they use them as ready-made parts. It's a hack on top of two unrelated models, so it is cheap.

In the end the finishing touches - they label 3500 high quality image-text pairs, and fine-tune on them. Now the model becomes truly amazing. It has broad visual intelligence, and scooped OpenAI who didn't release Image GPT-4 in the APIs yet.

The important lesson to take is that unrelated models can be composed together with a bit of extra training for the glue model. And that open AI is just as powerful as "Open"AI sometimes. It's breathing down their necks, just one step behind. This model is also significant for applications - it can power many automations in a flexible way.


> This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.

I thought they were creating image tokens based on the queries during finetuning and appending them to the language model. They are not text tokens.


In practice, it's a lot more like web dev than you might imagine.

The above means that the approach is web-dev like gluing, almost literally just,

    from existingliba import someop
    from existinglibb import anotherop
    from someaifw import glue

    a = someop(X)
    b = glue(a)
    Y = anotherop(b)


And just like webdev, each of those were done in a different platform and require arcane incantations and 5h of doc perusing to make it work on your system.


This is why the Hugging Face transformer ecosystem is so good, as each of those blocks will roughly have the same unified API.


You can just ask GPT how to do it. Much like a lot of web dev!


And the code GPT gives you won't work, much like a lot of web dev? ;P


Maybe it's because of how I use it, but the code ChatGPT gives me has always been super helpful and 99% correct. But, we have a policy at work not to use it for work product so I have to spend time changing enough of it where it's different, and I'm never copy/pasting anything. Enough changes to the structure and variables to make it sufficiently different that it can't be considered pasting company data into GPT, ask my question(s), see what comes back out, refactor/type manually into my IDE, test. I'd say one out of every 8-9 times I get something objectively wrong - a method that doesn't exist, something not compiling, etc. But it's faster than using google/DDG, especially with some prompting so that it just spits back code and not 5th-grade level explanatory paragraphs before and after. And well over half the time it does exactly what I need or sufficiently close that my initial refactoring step gets me the rest of the way.


Would you say that this satisfies the spirit of the company policy? Or is it a bit of a hack to get around it?

I ask because we are about to produce a similar policy at work. We can see the advantages of it, but likewise, we can't have company data held in their systems.


If I use it I also make sure it’s something completely non-core business, like an arcane piece of sorting or ugly rxjs construction.

I get the IP angst, but some companies think their GetGenericObjectFromDB() REST bs is secret sauce.


To the average VC a computer switching on is secret sauce enough, the rest is really just an implementation detail.


The policy is to not send any "sensitive company data" into ChatGPT, which I 100% agree with. How we implement a given Vue component or a particular API isn't sensitive or particularly novel so if I strip the business logic out I do honestly believe I'm complying with the spirit of the policy.


all the more proof that AI has reached the point where it can replace most web dev. You get what you train for ;P


at some point someone makes a service where you can let AI take over your computer directly. Easier that way! Curling straight to shell taken to next level.


So...AutoGPT? Now with command-line access! Have fun :)

https://github.com/Significant-Gravitas/Auto-GPT/


Found my next hobby project


Buddy this ain't 2022 anymore, ask chatgpt (with a plugin that can read docs).


It's more like gardening:

    1. plant seed
    2. ...wait a very long time...
    3. observe completely unexpected but cool result
The unexpected part of step 3 is what makes this very different from any kind of engineering, even webdev.

Of course, there is a lot of engineering involved in good ML, but that is more comparable to agricultural engineering in the sense that it's just a lot of dumb plumbing that any engineer can do without knowledge of the actual application.


I mean, for me, the unexpected part of 3 is what got me into programming in general. The first time you type a mysterious incantation into an editor and a few more mysterious incantations into the console and the console prints "Hello, world" like it was supposed to, it's unexpected because it's hard to believe that any of this mysterious incantation stuff actually works at all.

As you get better at programming you have to take on harder problems to create the surprise of something working, because you gain confidence, and as you gain confidence, you start expecting your code to work. It's only when you've compiled the thing 6 times with small corrections and gotten segfaults each time and the 7th time you finally find the place you weren't updating the pointer and you correct it, but this is the 7th error you've corrected without the segfault going away, so you don't really expect it to fix the problem, but then you run it and it's fixed!

And then you get a job and the reality is that most of the jobs you're just writing CRUD apps and for a little while you can get some surprise out of learning the frameworks, but eventually you actually get really, really knowledgeable about the Postrgres/Django/React stack and nothing surprises you any more, but because nothing surprises you any more, you're really effective and you start being able to bill the big bucks but only for work on that stack because it takes time to struggle enough to get surprised, and the time that takes means your time is worth less to your clients. Money ruins everything. And if you don't do anything non-billable, it's easy to forget what programming felt like when you didn't know how your tools all worked inside and out. Not everyone takes this path but it's certainly the easiest path to take.

I think for a lot of folks who have been doing this for a long time, the reason ML is so exciting is it's getting them back out of their comfort zone, and into a space where they can experience surprise again.

But that surprise has always been available if you continue to find areas of programming that push you out of your comfort zone. For me it's been writing compilers/interpreters for programming languages. Crafting Interpreters was awesome: for the first time I benchmarked a program written in my language against a Python program, and my program was faster: I never expected I'd be able to do that! More recently, I wrote a generational GC. It's... way too memory-intensive to be used in my language which uses one-GC-per-thread for potentially millions of threads, but it certainly was a surprise when that worked.

Personally, I'm keeping track of ML enough to know broad strokes of things but I'm not getting my hands dirty with code until there are some giants to stand on the shoulders of. Those may already exist but it's not clear who they are yet. And I've got very little interest in plugging together opaque API components; I know how to make an API call. I want to write the model code and train it myself.


I like how you've expressed this insight, and it is so true.

Becoming great at a particular technology stack means modelling it in great detail in your head, so you can move through it without external assistance. But that leaves an arena without discovery, where you just reinforce the same synapses, leading to rigidity and an absence of awe.


count me in :)


And repeat that ~4 times to make it look like LangChain


There is a little more to it than that. Abstractions in ML are very leaky.


I've only been reading ML stuff for a few months and I kind of understand what it's saying. This stuff isn't as complex as its made out to be.

It's just a bunch of black boxes AKA "pure functions".

BLIP2's ViT-L+Q-former AKA

    //I give you a picture of a plate of lobster it will say "A plate of lobster".

    getTextFromImage(image) -> Text
Vicuna-13B AKA

    //I give you a prompt and you return completion ChatGPT style
     getCompletionFromPrompt(text) -> Text

We want to take the output of the first one and then feed in a prompt to the LLM (Vicuna) that will help answer a question about the image. However the datatypes don't match. Lets add in a mapper.

    getAnswerToQuestion(image, question) -> answer 
        text = getTextFromImage(image)
        prompt = mapTextToPrompt(text)
        return getCompletionForPrompt(prompt)

Now where did this mapTextToPrompt come from ?

This is the magic of ML. We can just "learn" this function from data. And they plugged in a "simple" layer and learned it from a few examples of (image , question) -> answer. This is what frameworks like Keras, Pytorch allow you to do. You can wire up these black boxes with some intermediate layers and pass in a bunch of data and voila you have a new model. This is called differentiable programming.

The thing is you don't need to convert to text and then map back into numbers to feed into the LLM. You skip that and use the numbers it outputs and multiply directly with an intermediate matrix.

    getAnswerToQuestion(image, question) -> answer 
        text = getEmbeddingFromImage(image)
        embedding = mapEmbeddingToInputEmbeddingForLLM(text)
        return getCompletionForEmbedding(embedding)
Congratulations you now understood that sentence.


Interesting, so the LLM is "just" getting your question plus a normal text description of the image (as vectors)?


At a high level yes.

More precisely - It gets the question After irs passed through a matrix that transforms the text description of the image so the LLM can “understand” it.

It maps from the space of one ML model to the other.


This feels like such an accessible explanation.


Thank you for the insightful breakdown. Cheers!


Just get rid of all the abbreviations in your mind - they seem to be very intimidating. I really liked the explanation that Stephen Wolfram did on ChatGPT:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

Maybe someone has resources to understand machine-learning on an ELI5 level.


Wow, he waits until halfway through the article to mention A New Kind of Science. Usually he works it into the first couple of paragraphs!


I known it’s hard to believe but I sense LLMs have slightly knocked his ego down and injected a small dose of humility.

https://youtu.be/z5WZhCBRDpU

I pick that up in above video and also in the post above.

Definitely healthy for him which just to be clear I’m a huge Wolfram fan and the ego doesn’t really bother me, it’s just part of who he is, however I do find it nice that LLMs are having him self reflect more than typical.


Not a big Wolfram fan myself. I gave him the benefit of the doubt and bought "A New Kind of Science" (freakin' expensive when it first came out), and read the whole 1280 pages cover to cover ... Would have been better presented as a short blog post.

I find it funny how despite being completely uninvolved in ChatGPT he felt the need to inject himself into the conversation and write a book about it. I guess it's the sort of important stuff that he felt an important person like himself should be educating the plebes on.

Predictably he had no insight into it and will have left the plebes thinking it's something related to MNIST and cat-detection.


I just happen to read this article of him, which I found easy to understand. I'm neither a huge proponent nor opponent of the likes of his work. Or, bluntly speaking: I don’t know much else about his reputation in the community.


Seriously, ChatGPT was the thing that gave me a foothold into the AI/machine learning world... because it gave me hope that a mere mortal can achieve something reasonable with this tech without a crazy amount of work and educational background.


There are really great resources now from eli5 about all of this tech to books like ‘the little learner’ which any programmer can get into. Yes, it takes effort but it is a great time for it.


I don't have much experience myself. I only started ~10 months ago -- just a month or two before Stable Diffusion.

You just have to do it every day. It's fun!


Can you recommend what kind of small daily activities would help a web dev get into it?


Regardless of what you want to learn, "small daily activities" is a bit hard. You can learn some stuff by osmosis, following the feeds of AI devs && AI channels, but the bulk of what I learn comes from starting projects & digging into code & reading papers.

If you can hold attention span over several days (I can't), work on a project bit-by-bit. Just make sure it uses modern AI stuff, and that you have smart people to talk around with.


Big "a monad is just a monoid in the category of endofunctors" vibes from this one.


But that's literally what it is...


I was where you're at about ... oh wow, it's been almost ten years since I jumped into machine learning. Mind you, I've been learning on the side most of this time other than a theoretical class at the University of Minnesota. But, that aside, and depending on where you're at in your understanding, this is a great resource for catching up if you're really interested: https://karpathy.ai/zero-to-hero.html it was posted on HN a couple of weeks ago and I have to say it's a really good introduction and Andrej Karpathy is a passionate and excellent teacher. You may want to brush up on some intro Calculus, but it's very understandable.


Only because of big complicated sounding terms, that also exist in web dev.


> like a dog trying to read Tolstoy

this got a chuckle out loud from me. great visual.


This could be a great prompt to test the limits of txt2img models. The astronaut riding a horse got boring already :)


FWIW I work in LLMs and I consistently fail to do simple webdev stuff


Maybe you're just holding it wrong: You're not supposed to let your LLM rest or chat idly while you do the webdev stuff yourself, but to make your LLM do the webdev stuff for you ;P


Web stuff probably makes ML devs feel the same way.

ML is just a different field, using a different set of technologies from those you’re familiar with.


The best ML PhDs can’t do what frontend devs can: understand CSS :D


Arf!


It sounds like a BLIP2 with an extra linear layer for finetuning (or aligning the Q-former with a new LLM?). What makes it more powerful than BLIP2?


It's better because

1. it's using vicuna as a base.

2. It has a pretty high quality fine-tuning dataset. I initially missed this, and it's a very important advantage.

3. (speculatively) it doesn't collapse to extremely short responses (which BLIP2 and other models trained on image-text caption pairs) because of how small/simple the adapter is.

I was interested in training a BLIP2-LLaMA model before this, and I might still do it just to test (3).


Can any of this realistically run on CPU at some point?

(Not training obviously)


I'm developing framework [1] in Golang with this goal in mind :) It successfully runs relatively big LLM right now, and diffusion models will be the next step

[1] https://github.com/gotzmann/llama.go/


Yes, you can run inference at decent speeds on CPU with llama.cpp. A token is about 0.75/words, so you can see lots of people getting 4-8 words/s on their CPUs: https://github.com/ggerganov/llama.cpp/issues/34

There a lot of optimizations that can be done. Here's one w/ potentially a 15X AVX speedup for example: https://github.com/ggerganov/llama.cpp/pull/996


quantized Vicuna runs ok-ish in my 16GB i7 laptop (onboard graphics) and the output is usable

see this comparison: https://old.reddit.com/r/LocalLLaMA/comments/12ezcly/compari...

these models quantised to 4bit should run in CPU set ups with 16GB of RAM + 16GB of swap (Linux) and perhaps other setups run similarly


I've run LLaMa models on my CPU before, ViT-L and the Q-former are two transformer models as well, so I can't see why they wouldn't run on a CPU.


Someone is probably going to port it to llama.cpp soon.


It will do, probably quite soon. Many people are trying.


This opens a huge possibilities. It's likely we could simply plug in stable diffusion using a linear layer. As well as whisper and some TTS. Getting a back to back mixed image/sound/text engine running on a laptop.

I wonder if there's powerful enough ViT model that does OCR.


> I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.

How about 2x3090? Can it be run on multiple gpus?


With fp8, would 4GB be enough or is 6GB more like it?


Thanks for a useful comment.

Do you reckon the 4bit quantized Vicuna just won't do here? https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-1...

I think with this everything OpenAI demonstrated ~5 weeks ago has been recreated by actually-open AI. Even if it runs much much slower on prosumer hardware and with less good results at least it is de-magicked.


It'll work! I just haven't touched any of the 4bit stuff myself, so I don't personally know how to add it. Great low-hanging fruit for anyone else to take on.


The magic is in the quality of GPT-4 output. That hasn’t been recreated yet.


Open AI still hasn't exactly reached the level of gpt3.5. GPT-4 is way ahead of anything.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: