Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How is ChatGPT's behavior changing over time? (arxiv.org)
289 points by tim_sw on July 19, 2023 | hide | past | favorite | 178 comments


I think we should stop trying to quiz LLMs on mathematics, something for which they are explicitly not designed to do with their tokenized view of the world. Ask GPT-4 to use its Wolfram plugin and it returns the answers quickly and correctly.

Second, I think the code generation bit of this paper is blown out of proportion. The code can't be immediately injected into a codebase due to a formatting change (triple quotes). I'd be more interested in changes to the quality and performance of the code generated, not whether it can be easily copy/pasted from the page.

I could not replicate that result (but could replicate the other math issues). I've also never seen the triple quotes in results so it's unclear if there was a temporary presentation bug.


Seriously. GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you. Wtf is the point. We have computer things to do every math problem. It’s a waste of energy to use LLMs for it in my opinion.


"We" are never going to stop trying to use LLMs for math. They are obviously mimicking a smart person. What do you do with smart persons? You query them with bad and lazy questions (often without being too honest about that to yourself), and hope/expect helpful answers.

Because that is how it works. The limiting factor so far has been the smart persons time and patience. Now, no longer.

People moderating their LLMs usage is never happening, from here on out until the end of civilisation. Any LLM service that is designing for that is done. You need to make lazy questions efficient. People do not care about how complicated your sql query is and they will never care. People will not give up on energy, meat, cars, as long as they feel they are giving something up.

People will never think twice to not make your LLM think twice.

If it seems useful and convenient, people will use it. If it's not giving good answers to lazy questions out of the box, they will go to the thing that does.


>"We" are never going to stop trying to use LLMs for math. They are obviously mimicking a smart person.

A lot of words to make a big deal out of nothing. All that is needed is some new abstracted layer that identifies a math question and then proxies it over to the wolfram plugin. That’s it

We don’t have crazy debates over whether a polygon should be rendered by the cpu or a gpu. We solved this problem


So a Mixture of Experts model; but then isn't ChatGPT using that already? Why're the models so bad at math despite being trained on academic papers and books clearly - or why do they hallucinate and make up non-existing citations?


Because they don’t actually reason about things they just map words based on probability. If they’ve seen a math problem enough they may get it right due to probability, but using that knowledge to solve a new problem isn’t likely.


I have bad news about how "solved" CPU/GPU rendering split is.


It’s not about the results, it’s about its ability to “reason”. Math is about as close to pure reasoning we get so I don’t get the pessimism.

If it is bad at math and can’t be taught, then you have a fundamental problem. It’s a matter of time before this limit gets hit in other domains.


This seems fair enough to me. That ChatGPT currently struggles with certain types of maths problems points to reasonably fundamental shortcomings in what otherwise appears to have the beginnings of a general purpose reasoning engine (whether you consider it AGI or not), and I'm willing to bet extraordinarily clever minds are working hard on trying to address those shortcomings.


I don't think that trying to shoehorn LLM into being AGI is the right path. Maybe some are trying to achieve this, or are hoping for this... but IMO this is trying to fit a square peg into a round hole. It's definitely an element in the overall puzzle, I think we can all agree, though. Even so, rather than warping this hammer to also screw in a bolt, why not combine it with another tool more fit for the job?


Your thinking is very emotional and shows your inability to understand that "conversational output" is basically an accidental side-effect of a complex tool that just replicates our speech without understanding what it is saying.

Through continuous use, I have found that it does not "reason". That doesn't mean it's not valuable in many ways, and I have found it to be very helpful in a multitude of diverse applications, including helping me reflect on my own life through my own interpretations of its output. It's also a great interface for JSTOR, wikipedia, and basically any language learning.

I'm having a hard time making the jump from "this must be a calculator" to "this must be a philosopher" to be useful. When did we ever have those requirements for a tool?

This tool is just not made for math. Most of its logic processing abilities seem to surpass mine if I am only given 5 minutes to understand a problem. If you understand the tool, you will get the most out of it. Stop anthropomorphizing it, and stop pretending that it can't generate both highly beneficial or highly harmful content simply because it doesn't have a soul/d*ck or whatever.


In my experience, it can reason usably well but acts like it has dyscalculia. It'll set up a proper algorithm, step through it and trip over digits.


Yes and the reason it trips over digits is because those digits wind up being tokenized in ways that seem unexpected to us and would produce the same difficulty if we were presented with them:

100,000 + 987 - 1444 * 25,945.842 / 0.0042

becomes

"100" one hundred

"," comma

"000" triple zero

" +" space plus

" 9" space nine

"87" eighty seven

" -" space minus

" 14" space 14

"44" forty four

" *" space times

" 25" space twenty five

"," comma

"9" nine

"45" forty five

"." period

"8" eight

"42" forty two

" /" space divide

" 0" zero

"." period

"00" double zero

"42" forty two

Now imagine someone reading that to you over the phone once and asking you to do the math in your head and you aren't allowed to use paper and pencil and you have to get it right the first time.


"It'll set up a proper algorithm, step through it and trip over digits."

Or it is just pretending to do so. And since it pretends, of course it trips over all small things as it does not understand them.


I don't think you can pretend to properly describe and evaluate an algorithm, anymore than you can pretend to solve a riddle - the answer is either right or it isn't.

And in this case, the shape of the answer is often right; it just makes ... ordinary errors. Ironically, the AI is a lot better at high-level thinking than correct calculation.


"Ironically, the AI is a lot better at high-level thinking than correct calculation."

That would be actually a human like feature .. except I do not consider what LLMs are doing as thinking.


> GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you.

I don't think this is a great analogy. if your 737 couldn't drive on the ground and your astrophysicist couldn't answer basic maths questions I wouldn't want to fly in that plane or put much faith in the astrophysicists answers to more complex questions.

Maybe maths is not a particular strength of LLMs, but asking questions where it is easy to judge the factual accuracy of the responses seems a pretty reasonable test to be running.


> asking questions where it is easy to judge the factual accuracy of the responses seems a pretty reasonable test to be running.

It isn't reasonable if that isn't what the system was designed to do.

It would be a poor test of my general practitioner's competence to ask him calculus questions and conclude he doesn't know what he's talking about because he can't answer them.


Given that a drug's concentration in your bloodstream is absolutely critical for prescribing them, and considering that this is calculated using Calculus, you'd better hope your GP does actually know their Calculus!


My point was that a 737 can be a land vehicle, but very badly, because it's optimized (to an extreme amount) for flying. I fly on 737s all the time, knowing that have terrible stopping distances and cornering. The 0-60 could be decent, but you can't really accelerate all-out to try it, or you'll overshoot and end up going 175 and crashing.

The astrophysicist can do long division in his head, but he'll be about as fast and accurate as the next person, because he doesn't practice arithmetic every day.

I agree with the commenter somewhere in my thread who said all an LLM should be optimized for is to classify the type of problem and feed it into a purpose-built, deterministic solver that is trained to interpret math as math and not as language, be it ML-based or algorithmic.


A better example would be - calling your guitar lessons teacher for help on a statistics problem.


Possibly more like instantiating the statistics problem by getting, say, every 3 members of an orchestra to represent a triplet of doors in the Monty Hall problem and then making a ball-park guess what the results are from a seat in the middle of the back row.


The point of doing $non-llm-optimal-thing on an llm is the hope that it let's you skip on formal syntaxes, which are mentally taxing.

It's far easier even for an expert to communicate what they want in natural language than it is in a formal syntax for all but the most trivial things.

It should be a goal of these tools to do this correctly.


> the hope that it let's you skip on formal syntaxes, which are mentally taxing

agree completely, but the LLM should be focused, then, strictly on formulating an "execution plan" of sorts and handing that off, not on performing math itself.

In other words, when asked "if i have 349 blueberries and one blueberry turns into a cherry per hour, how many of each fruit will I have in 93478 minutes?" it shouldn't be doing the actual arithmetic, but it should be figuring out what arithmetic would need to be done.


This is my second ChatGPT comment so I hate to sound like an evangelist, but it basically does this, reproducibly:

You describe a complex relationship between several nodes(people, cities, etc), and then ask ChatGPT to draw the relationship as a graph data structure. It will create a formula that Mathematica can render, and then send the formula to Mathematica before presenting an ascii drawing. Usually. Sometimes it just complains that it can't draw and explains what the graph looks like to you in a written formal/human-readable syntax.

In other words, yes it just summarizes the problem, converts it to a formal syntax, and sends it off to some other tool.


> Seriously. GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you. Wtf is the point. We have computer things to do every math problem. It’s a waste of energy to use LLMs for it in my opinion.

Using GPT to do maths is probably like using a 737 to drive around on the ground.

Teaching GPT to do maths might be like teaching a child the times tables - a skill that can help overall reasoning.


Well that's... a claim. What's your reasoning for operations on digits within the model itself improving any current metrics?


Well I didn't really make a bold claim, I said "might", but presumably there is a reason why we teach kids basic mental maths before we teach them how to use a calculator.

Humans are different to an AI, but putting that aside, my intuition would be that if we never taught kids any mental maths, their concept/understanding of numbers would be fundamentally different to how it is if they learn that 9 x 9 = 81 (also look at how your fingers move - there is a relationship there!).

But who knows, AI is strange and there's lots of stuff that needs to be experimented with. I would think training an intuitive sense of numbers would have other fall-outs though. This is half the beauty of LLM's right? You show an LLM some history books and it also learns about biology, politics, grammar, etymology and love. You teach a LLM maths and it also learns ... ?


Apparently, models finetuned for coding are better at logical inquiries than those that are not.


> GPT doing math is like using a 737 to drive around on the ground, or if you had the phone number of a prominent astrophysicist and you call him to do long division for you.

One significant difference is that in both of those examples it is (or quickly becomes) plain why it’s a ridiculous idea. Even if you don’t understand it yourself, you’ll get external feedback fast. Not so with LLMs, where even people with technical needs may fail to see what is or isn’t a good use of the tool. Case in point: https://news.ycombinator.com/item?id=36782446

“You’re holding it wrong” isn’t a valid argument in perpetuity. At a certain point it becomes the fault of the designer, not the user.


My guess is that for most folks good at math => you're smart, so the most obvious way to test an LLM is to ask it a math question, which is trivial to construct. Another motivation IMO is that asking a relatively simple math question and seeing an LLM fall flat kind of makes many to feel good about themselves and gives them a chance to show it off on social media.


Asking models to do math is kind of an effecitve way to measure their capabilities, especially in reasoning and abstraction, which are quite important for problem solving.


You don't need to reason and abstraction to do basic calculation. ChatGPT will however happily give you some decent answers about not-too -hard math that requires reasoning. It just won't operate on digits.

Those are completely different ideas.


What is the point of doing anything if you can't use flashy technologies ? Leave that to the old people. /s


I used GPT-4 to generate a non-cryptographic random 64 character string. It was faster to ask GPT-4 for the string than ask GPT-4 for the instructions to generate the string from my terminal. GPT-4 was faster than google.


That definitely won't be random.


It's probably 4, that's pretty random.


Close enough for the purpose at hand!


Were you in a room that had it's walls slowly caving in like in indiana jones or why does it matter that it was faster ?


Close! Was in a meeting and needed to see the behavior of a system when passed a string greater than 64 char.

My mental capacity was used elsewhere - using chatGPT let me answer an important customer question authoritatively.


I don't think this will be a truly random string.


wouldn't it be less time to just type one yourself? Like open notepad, hit the keys like a deranged monkey, select the first 64 characters, done?


That is very much not random, but generally enough for 99% of all use cases.


It would be more random than an LLM's output


I would not bet on whether chatgpt's results would be more or less random than this process.


Aren't chatgpts results deterministic if you use the same seed, temperature and other parameters?


Surely navigating to random.org is faster than typing a prompt.


That’s likely a very good example of the limitations of LLMs.


Obligatory bash solution:

  $ random_bytes() { xxd -plain -c 0 -l "$1" /dev/urandom; }
  $ random_bytes 32
  e6a4a7bbea69a0164cbb66c89f8f528af93c6d2459fd28d2640e2952c031b618


>/dev/urandom

Hmm, I’ve been using /usr/games/fortune

Is this not a best practice?


I'm glad ChatGPT exists so I don't have to remember any of that


No, we definitely should continue to quiz LLMs on mathematics and absolutely any other topics. Otherwise how do we know and understand limitation of the system?


We should also test its capabilities on cooking steak, flying rockets, and making love. Only then will we know if AI can be superior to humans on all things.


I know you're being facetious, but I'd absolutely be in favor of having AI benchmarks for any and all of these.


I realized that halfway through typing that too. As well as the absurdity of trying to create AIs that exceed a human in all tasks.

On a serious note, most of these AIs are bad at math but good at writing code for calculators. So what you'll be benchmarking is their ability to create and use tools.


Well, if there was an API hooked up to a webcam & 6-dof arm that would be an interesting task. (The steak)


Lovemaking, too.


I think knowing if the code can be used verbatim is actually the more important part practically speaking. That is the actually useful part.

Quality is important to humans, because humans have to read it, but correctness is what people using ChatGPT for code actually need. So long as the quality and performance is good enough, then it will be useful.

Performance is such a nuanced topic that you need very context aware devs anyway, and I think a general purpose LLM is never going to have that kind of awareness.


> never going to have that kind of awareness

Be careful with that goalpost, it might make sudden movements.


Yes, when ChatGPT went public using 3 I was like, “Hah, yeah, that can’t come close to writing the narrative portions in my work.” Then 4 came along and it was more like, “ooh, a basic first draft in 15 seconds? That I can work with.”

Now, at least for some projects, I can give it a few rapid bullet points in incomplete sentences and have that first draft in seconds, after which I just need to tweak, add in tables and more detailed stats & results, etc. quite useful.


GPT is very useful for the scenario of when you have a complex, dialect specific bit of SQL and need to turn it into an ORM query like sqlalchemy. It get's you about 90% the way there


Or OpenAI can stop being stupid and adopt LLaMA-like tokenization, which special cases numbers and tokenize them into individual digits.


Are LLaMA-based models better at math because of this digit-based tokenization?


Isn’t the tokenization tied to the training of the model?


The model learns an embedding table, where (roughly) each row is used as the model’s internal representation of each token. The numbers in that table are learned. What isn’t learned is the map from token (i.e. combination of characters/byte-pairs) to row-index in the embedding table. That is given by tokenization

EDIT: removed redundant bit


They should stop teaching math to kids for the same reason. Just give them a calculator! Or show us the evidence that math education improves cognitive abilities in other domains.


I would argue that ChatGPT is actually quite good at "mathematics", in the sense of helping me formulate a problem in an appropriate mathematical structure, and coming up with a good sequence of steps to solve / simplify it. Not perfect by any means, but not bad at all.

What it is bad at is actually performing the steps accurately, but as others mentioned, that's where Wolfram and/or a code interpreter would come in.


In terms of evaluating LLMs, I'd argue quizzing them on maths is still better than the other thing people keep doing - quizzing them on facts and self-contradicting scenarios, hoping to get them to only either recall information perfectly, or answer with "I don't know".

I'm not an AI/ML scientist, so I may be way off mark here, but everything I've read so far, and all my experience playing with GPT-3.5 and GPT-4, tell me that comparing performance of an LLM to that of a human is a category error, because the LLM isn't a good analogue of a whole human mind - but it's a very good analogue to human inner voice. The stream of consciousness. The whatever-it-is that surfaces your unconscious/subconscious thought process in form of words and sentences.

The inner voice is fast, it's reactive. It generates thoughts that match the situation, whether they're correct or factually accurate or not. It's up to the conscious part of your mind to stop, refine, or recycle those thoughts. If you let it keep going, it'll give you thoughts based on what feels like should follow the thoughts that came before. And, unless you habituated responding to anything new with "I don't know" followed by ignoring the topic, the inner voice will start blurting answers to what looks like a question/problem statement; whether or not they'll make any sense, depends on your familiarity with the topic in question.

Pretty much 1:1 what LLMs do.

Now, this could all be noise, but I don't think so. I know not everyone has a distinct inner narrative (much like not everyone can visualize things in their mind - I can't), but many (most?) people do. The description of the "inner voice experience" I gave above is something I figured out over a decade ago - before LLMs or even deep learning were a thing, before I knew anything about the NLP beyond recognizing the term "Markov chain" is somehow related. Could my inner narration style be unique? Possibly, but given how advice to avoid connecting your inner voice directly with your vocal apparatus is deeply infused in culture and literature, I strongly suspect this is just how it works.

All this to say: it is my hypothesis, so far corroborated by experience, that when you start feeding absurd amount of unlabeled text to a transformer model, letting it pick up on the structures encoded within, what you get is a close equivalent to our own inner voice - the part that deals with associations, not logic or data storage. You can't expect it to get good at performing arbitrary computation or recalling data with perfect fidelity, because it's structurally not what it's suited for. For humans, performing arbitrary calculations or perfect recall requires engaging a slower, more algorithmic thinking process (and/or external memory). That part is currently missing in the LLM-based AI systems we're playing with.


I couldn’t agree with you more. Today when we are talk with LLM it’s like we (sorry for anthropomorphism) interact with a naked mind and we rely on pure immediate recall and it’s train of thought. I wonder whether allowing LLMs to perform inner dialogue (as it was done with the through/action pattern), to stage information for latter, and then based on that making it form response will be the next step.

It will be much more compute intensive, as each response will probably require multiple context windows and distillations.


Could not agree more. It doesn't understand what a number is, why is everyone trying to quiz it on maths instead of, perhaps, seeing how good it is at language tasks, or even foreign languages? I suspect it has gotten a lot worse in non-english since launch.


The paper is talking about DELTA though. It used to do well and doesn't now.


Why do you think this?

FWIW I use GPT-4 regularly to explain Koine Greek from the New Testament to me; its ability there certainly hasn't diminished in the last two months.


it "understands" numbers the exact same amount that it understands words.


Well, no, because numbers are “words” that represent something that behaves in a very different way to words in a sentance.


Triple quotes sound like how you do markdown code blocks to me

``` code ```


This paper is being misinterpreted. The degradations reported are somewhat peculiar to the authors' task selection and evaluation method and can easily result from fine tuning rather than intentionally degrading GPT-4's performance for cost saving reasons.

They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.

Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.

Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.

The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.

In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.


In my opinion the more likely thing is that OpenAI is gaslighting people that the finetuning is improving the model when it likely mostly improves safety at some cost to capability. I'd bet this is measured against a set of evals and it looks like it performs well BUT I'd also bet the evals are asymmetrically good at detecting "unsafe" or jailbreak behavior and bad at detecting reduced general cognitive flexibility.

The obvious avenue to degradation is that the "HR personality" is much more strictly applied and the resistance to being jailbroken is also in some sense an inability to think.


The ability to detect quality is harder than the ability to detect defects, so the obvious metric is improved while the nebulous one is "good enough". They are competing goals.

This is not necessarily the case, and even if it is doesn't imply gaslighting as compared to inability to measure.


Previous commentary I know of from OpenAI staff:

Logan: The API does not just change without us telling you. The models are static there. https://twitter.com/OfficialLoganK/status/166393494793189785... may 31

Peter: No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one. https://twitter.com/jlowin/status/1679660938415177731 july 14

either the models are static, or they are being improved continuously and there have been unforeseen regressions. only one can be true at any point in time. was this policy changed in the last 1.5 months?


Not really. They have a way of squaring this circle, by changing their inference code. Speculative sampling [1] would still make their first claim a lie – sure, there'd still be the original GPT-4 model, plus a smaller draft worker. But early exit decoding [2] allows you to get almost as good results for much cheaper from exactly the same checkpoint. We know that this line of research for large-scale inference is going strong [3] so it stands to reason that OpenAI, with their wealth of talent focused on GPT-4 throughput&inference [4], large contexts and aggressive pricing policy, would also develop something like that. And of course it's "smarter" that way – in a very deceptive sense of the word.

1. https://arxiv.org/abs/2302.01318

2. https://arxiv.org/abs/2207.07061

3. https://arxiv.org/abs/2307.02628

4. https://openai.com/contributions/gpt-4


I don't get why you're jumping to cloak and daggers style operations: OpenAI would not kneecap their commercial offering by randomly changing how it works.

At the end of the day 99% of the confusion comes from people using the web interface, which undoubtedly does change much more often than the API versions they share.

The web app they host isn't a simple API wrapper, it does summarization, has some sort of system prompt, and calls the moderation API. That's undoubtedly being updated all the time.


> OpenAI would not kneecap their commercial offering by randomly changing how it works.

Have you seen the 25 messages/3 hours limitation for GPT-4? Why do you think they did that? Of course they would make more money scaling up the volume, but how to do that when compute is so limited? Of course, by using some kind of approximation - quantised model or speculative sampling come to mind. It's hard to pinpoint model regressions, but scaling up volume is great, one more incentive to do it.


You realize that's a limitation in the web application right?

The web app is a consumer app (B2C) the api is commercial (B2B). They tinker with the B2C app because it's already a lossy approximation of using the model between the summarization and system prompt.

They cannot mess with the commercial offering willy-nilly: People are building businesses predicated on it behaving a certain way. That's why there are dated version that you can pin to with the API. The web app changes whenever they feel like it.


You keep repeating that. You don't even know if the people commenting to you use the API or the "web app". I use the API and I noticed the same stuff others have.


> Have you seen the 25 messages/3 hours limitation for GPT-4?

If can't tell if that's about the API or the web app, I don't think you're familiar enough with the subject to speak on it.


I don't need to be familiar with it to see that you're not being genuine in your interpretation of peoples' comments and in the way you're responding to people in this thread. Case in point: your reply to me.


What was I supposed to reply to your factless baseless anecdote with?

Millions of dollars in spend from products predicated on the API not randomly changing?


Arguably, both ChatGPT and API are consumer apps. That includes researchers. Pay as you go, no strings attached, "oh yeah no no, we're not changing anything, follow our CEO on Twitter if you want to know more". That kind of stuff.

The actual B2B offering is handled by Microsoft, via Azure OpenAI. Same models, but deployed on Azure - meaning they come with SLA and all the right protocol and compliance stuff, so that your people can negotiate with their people - and if you're willing to spend enough, you'll get the models for yourself. Not the weights, of course: just no training on your inputs, not even retaining inputs for 30 days "because ${legal reasons}" - instead, you can pick and chose, fine-tune and deploy OpenAI models on your own tenant, and basically manage everything except the weights themselves.


Maybe arguable if you don't know what consumer apps are? Also sounds like you haven't actually used Azure OpenAI:

- It has the same 30 day retention for legal reasons unless you manually request (just like OpenAI)

- You can't fine tune any models that you can't fine tune on OpenAI, and in fact default access is a subset of what OpenAI offers.

- "and if you're willing to spend enough, you'll get the models for yourself" is a bit of nonsense, Azure OpenAI forces everyone to make a "tenant", that's just for VPC stuff to work. Outside of that it's bog standard fine tuning and at most "on your data" which is a wrapper for chunking + vector embeddings

- Azure OpenAI has a narrower built in filter that you can't modify without again, a separate request.

Azure OpenAI overall is mostly for companies that need to signal to other companies that they're using Azure: it's no more commercial than the OpenAI offering.


> Also sounds like you haven't actually used Azure OpenAI

On the contrary, I am using Azure OpenAI daily at work, and I'm explicitly not allowed to use "regular" OpenAI offerings.

> It has the same 30 day retention for legal reasons unless you manually request (just like OpenAI)

It doesn't, at least not for us.

> Azure OpenAI has a narrower built in filter that you can't modify without again, a separate request.

I'm not sure if it's narrower, but it is there and I have a strong suspicion that MS is just trying to extract additional rent from companies that really want to turn the filter off.

> Azure OpenAI overall is mostly for companies that need to signal to other companies that they're using Azure: it's no more commercial than the OpenAI offering.

No. Azure OpenAI is for companies that don't play fast and loose with data - their own data, and their customer data. Of course, most companies don't give a damn, but for big enough companies, or those operating in certain industries, there are actual, severe legal consequences for mishandling the data, and such companies don't have the option to just not give a fuck and dance with OpenAI - they need to sign an actual contract with a serious entity that understands regulatory compliance, and how corporations tick. Microsoft is such entity. OpenAI isn't.


> It doesn't, at least not for us.

So then you filled out the request because the default is exactly the same as OpenAI: retained unless you manually apply for an exception.

https://customervoice.microsoft.com/Pages/ResponsePage.aspx?...

> I'm not sure if it's narrower, but it is there and I have a strong suspicion that MS is just trying to extract additional rent from companies that really want to turn the filter off.

You don't need to question if it's narrower, OpenAI used to surface it as an API separate from the moderation API and it's much stricter by design.

> No. Azure OpenAI is for companies that don't play fast and loose with data - their own data, and their customer data...

I don't know if you actually believe this or you're just not aware, but the companies that don't play fast and loose aren't using OpenAI period: Azure flavored or otherwise.

OpenAI has SOC2, GDPR and CCPA compliance. They comply with HIPPA and offer BAs. They sign DPAs on a case-by-case basis same as Azure.

You're pretty much proving the value of Azure in your comment: it's a veneer of familiarity that coaxes people who are convinced the new kid on the block must be untrustworthy.

If OpenAI can't promise something Azure can't either: They're entirely dependent on OpenAI for this. Every idiosyncrasy behind Azure OpenAI maps back 1:1 to OpenAI.


> I don't know if you actually believe this or you're just not aware, but the companies that don't play fast and loose aren't using OpenAI period: Azure flavored or otherwise.

They do, and that's the biggest value proposition of Azure OpenAI right now: strong contractual guarantees, from a reputable partner (that's easy to hit with lawsuits should they go rogue :)).

The current situation is that it's pretty unwise for any company to ignore GPT models. OpenAI itself is a wildcard, but getting the same from Microsoft isn't "playing fast and loose with data" any more than using Windows and Office 365 across the organization is. Most large corporations and governments have been building their office work and communication around those tools for decades now, so - questions of antitrust aside - all the kinks have been worked out. I don't think you appreciate how big a difference this makes.

I mean, it's either that or all the company communication I got on this was bullshit.

> OpenAI has SOC2, GDPR and CCPA compliance. They comply with HIPPA and offer BAs.

That's the first I hear of it, but since I never dealt with OpenAI itself on that level, I accept this was my ignorance speaking; thanks for clarifying.

> You're pretty much proving the value of Azure in your comment: it's a veneer of familiarity that coaxes people who are convinced the new kid on the block must be untrustworthy.

I think you're underestimating the importance of this. What you call "veneer of familiarity" translates to billions of dollars of differences in terms of security risk.

As mentioned before, MS has been in this space for a while, and has decades of trust and experience built with governments and corporations and other big organizations. Microsoft is a known, trusted quantity. That alone is worth a lot.

But then, there are also technical aspects too - like how deploying to a tenant on Azure integrates properly with all the other services you use to run half the company. In practical terms, this means all use is monitored and auditable by in-house teams, and all the in-house policies are being enforced. OpenAI can't begin to offer this level of integration - they have neither technical nor legal resources for that.

> If OpenAI can't promise something Azure can't either: They're entirely dependent on OpenAI for this. Every idiosyncrasy behind Azure OpenAI maps back 1:1 to OpenAI.

None of that matters here. The models are what they are - peculiar large matrix multiplication as a service. By themselves, they're pretty much pure functions. The part that matters is operations - both technical and legal aspects - and this is where Microsoft and OpenAI are independent and have different offerings.

Also, looking at the way money flows, I think it's OpenAI that's dependent on Microsoft right now, not the other way around. They kinda pretend to be just friends with benefits, but it's obvious who the dependent party is.


I'm advising on calls with firms that have existed since the 1800s: their clients don't even want LLMs involved in output, regardless of who's hosting what.

Companies that don't play fast and loose are not using LLMs yet. They use "old school" ML at most with much narrower scope because at this point it's simply less of a liability.

You seem to think I'm underestimating what Azure's name adds to OpenAI: I fully understand how bureaucratic organizations work off vibes under the guise of name recognition and my point is I simply have no respect for it.

If you genuinely care about customer data, then the value of being able to sue MS instead of OpenAI is moot. You also probably aren't going to use a service that shouts from the roof tops about not using your data then quietly keeps it for 30 days unless you manually opt-out. You probably don't use some model with unsolved copyright/PII questions. And a million other unknowns

> The part that matters is operations - both technical and legal aspects - and this is where Microsoft and OpenAI are independent and have different offerings.

You might want to check OpenAI's subprocessor list if you think that they're not the same technically...

https://platform.openai.com/subprocessors/openai-subprocesso...

And Azure's subprocessor list is a superset of that list, not a subset.


No, it makes sense to secure engagement with the most expensive implementation and then cut costs, this kind of stuff is pervasive in the industry. Besides, we have Brockman on record saying that they do "a lot of quantization"[1][2] so it's not paranoia to suspect other optimization schemes when there's a clear performance drop, which they have also denied a few times.

1. https://chat.openai.com/share/44a0c5b6-c629-470a-992f-8cdbbe...

2. https://www.youtube.com/watch?v=_hpuPi7YZX8


Paranoia would be charitable: it's FUD.

If you intentionally smear the line between their web app which is chock full of optimizations to even let it function as it does (the web app's max conversation length exceeds the context window) and the API which is versioned and iterated on in the open... it's either a lack of understanding or FUD.


> OpenAI would not kneecap their commercial offering by randomly changing how it works.

> As of July 3, 2023, we’ve disabled the Browse with Bing beta feature out of an abundance of caution while we fix this in order to do right by content owners. We are working to bring the beta back as quickly as possible, and appreciate your understanding!

https://help.openai.com/en/articles/8077698-how-do-i-use-cha...


Thank you for confirming my point?

> At the end of the day 99% of the confusion comes from people using the web interface, which undoubtedly does change much more often than the API versions they share.

The API does not offer any browsing features, that's the web app.


You can call a specific version of the model. It's ones of the API values. The latter person is referring to the "gpt-4" which the documentation states will update and change without warning.


Irritatingly, OpenAI reps deny any change in model capabilities over time. It's more likely that as the models are optimized for cost and performance, their in-house evaluation metrics don't cover everything customers are interested in. Meanwhile, the probabilistic nature of LLM output means there is plausible deniability.


I'm now sometimes using GPT3.5 because it produces better results than GPT4. GPT4 was amazing when it first came out for me. Either it has deteriorated or I was lucky in the beginning and am now lucky with GPT3.5 from time to time.


I have the same experience. GPT4 seemed surprisingly better at first, but now I can't tell a big difference between the two. Admittedly I'm bantering back and forth about history, literature, and languages...


Source? They’ve only denied the API model changing. Not the website.



(Tweet contents)

> No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one. > Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before.

I don’t see how it supports your argument. Your comment says “they deny making changes to GPT-4”, and the tweet says “we are making incremental improvements to GPT-4”.


It's important to read company marketing statements as if you were a lawyer. We haven't made GPT-4 dumber != We haven't made ChatGPT(4) dumber.

Personal hypothesis is that they have made a few changes to ChatGPT recently - possibly quantization, and almost certainly some tweaks to make it give shorter/less detailed answers.

But by the nature of a probibalistic tool being run by a secretive company, it's hard to say for sure. Maybe I and everyone else complaining have just started to get unlucky answers.


They're also going to claim "that's alignment, we're making it less evil, not dumber"


Same experience for me. Very unlucky answers now.


I did not say that they deny making changes. I said that they denied changes in capabilities; implied, in this context, is changes for the worse, since changes for the better should not affect people's experiences negatively.


"No, it is you that is dumb!" - OpenAI


Some people also say that the earth is flat. Will you believe them?


And even the lecturers acquiesced when they found that a lecture on the sea was none the less stimulating when compiled out of other lectures that had already been delivered on the same subject. “Beware of first-hand ideas!” exclaimed one of the most advanced of them. “First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation. Do not learn anything about this subject of mine — the French Revolution. Learn instead what I think that Enicharmon thought Urizen thought Gutch thought Ho-Yung thought Chi-Bo-Sing thought Lafcadio Hearn thought Carlyle thought Mirabeau said about the French Revolution. Through the medium of these ten great minds, the blood that was shed at Paris and the windows that were broken at Versailles will be clarified to an idea which you may employ most profitably in your daily lives. But be sure that the intermediates are many and varied, for in history one authority exists to counteract another. Urizen must counteract the scepticism of Ho-Yung and Enicharmon, I must myself counteract the impetuosity of Gutch. You who listen to me are in a better position to judge about the French Revolution than I am. Your descendants will be even in a better position than you, for they will learn what you think I think, and yet another intermediate will be added to the chain. And in time” — his voice rose — “there will come a generation that had got beyond facts, beyond impressions, a generation absolutely colourless, a generation ‘seraphically free From taint of personality,’ which will see the French Revolution not as it happened, nor as they would like it to have happened, but as it would have happened, had it taken place in the days of the Machine.”

E M Forster, "The Machine Stops", 1909


>June version added “‘python and “‘ before and after the code snippet. Second, it also generated a few more comments. While a small change, the extra triple quotes render the code not executable.

These are to represent markdown code blocks, which probably helped their front end developers, but hinders copy-pasta


Outputting that on the API is kinda dumb, but yeah, hardly something worth publishing.


The pre-prompting length will grow more and more as more liabilities in the responses are uncovered. I would imagine the more the pre-prompting grows the more attention is diverted to the rules rather than the user prompt, and the less reasoning available.

I wonder if they'll start using LLMs while ingesting new data. eg asking the LLM if the content is helpful, cites sources, respectful, positive, not-thin content, common or often duplicated content, etc etc, before each content import.


Commented this on a submission that was a duplicate of this one, but:

When GPT's details were allegedly leaked [0] I read about the mixture of experts and wondered if this explains my recent distaste for GPT. Lately I have been using Anthropic's offering (mainly since it is free and has a 100k context window) and I've been surprised at just how well it reasons and understands what I'm asking. It responds like GPT 4 used to, and speaks to me with nuance I haven't seen since Bing released their original chatbot. I still find GPT 4 better if I want to fine-tune the model and make it adopt a persona-- Claude will almost always refuse.

Either way, I'm a bit confused with the way GPT 4 has been changing over time. It seems the team made significant changes to the model quality in favor of performance. Whether accuracy or performance is more important is up for debate, but something is clearly changing.

[0] https://archive.ph/2RQ8X


reminder that that "leak" was by a twitter grifter that accessed and reposted wholesale without attribution the contents of a paid newsletter and had the gall to chargeback the credit card. dont encourage him or link to him.


With those GPT4 architecture infos leaked a few weeks ago, my personal theory is that they use mixture-of-experts routing for scaling the system. For instance, they could say: from 0 to 50% system load use a mixture of top-6 models for inference, from 60 to 80% use top-4 models, above 80% use top-2 models. This could make ChatGPT look dumber or smarter depending on the time of the day and system usage. Naturally, the variance of inference quality would spread and people would experience regressions in quality with some probability.

They could also use these parameters over time to become more cost efficient, in general, or reserve GPUs (i.e. allocating more expert networks) for some higher-margin use-cases.


They could even persist the inference parameterization, e.g. "use top-8 for 'compute the sum of 42 and 17'", for every prompt ever seen and reuse it later to prevent people doing "poor man's regression testing" and make the model look more stable.


I’m guessing that some tasks relegated to a single one of those experts are tasks that previously benefited from those experts’ capabilities being integrated in a single model.

Sort of like having a problem and then asking both a biologist & a chemist for their assessment. But if asked a single biochemist they could more readily provide an answer that synthesized both more comprehensively, even if you got both the biologist and the chemist in the same room with each other.


The main thing that ChatGPT has gotten better at is rejecting jailbreaks and refusing to go off the reservation. It has been demonstrated that "safety" trades off against "capability". I'm sure OpenAI has evaluations that demonstrate the improvements they've been making to "safety" have not come at the cost of capability, but I'd bet those evaluations are wrong (by being insufficient). It also wouldn't surprise me if the tradeoff between "safety" and capability is just intrinsic.

You can have your model talk like HR or think like a mad scientist, but not both equally well.


I've cancelled my pro subscription. You can't ask me to pay money and also expect me to spend hours Googling around to find prompts for basic queries.

I would also like to not be treated like an idiot. Every time I query anything related to health or medicine, chatGPT will give me a short generic answer, then add two paragraphs of warnings about how I should just seek a medical professional's help. As if I'm the kind of moron who will blindly follow whatever some AI tool tells me and not actually go to a doctor if there's something wrong with me.

This organization seems like it's being run by (scared) lawyers.


I'm nostalgic about the first public version of GPT-4. It did have some sparks of AGI. The interface was very ergonomic because it assumed correctly many things about the user's request and intentions. Now I find I have to write very long prompts and explanations to steer it correctly, as if training a brand new employee. It's really tiring. And I'm not even talking about the accuracy of the responses.

It's sad, but I understand why they did it. It was too much power in the hands of people.


Never let OpenAI or Google employees (in regards to search) gaslight you into believing that things aren't being "enshittified". Glad to see scholarly evidence of this coming out.


Honest enshittification is good but rare. Like a cloud telling you what specific CPU SKU is in the hardware (not just 1 vcore).

They should say what you are paying for in terms of technicals, either spill the beans on the architecture or have some SLAs on how good the thing is.


I work in cloud-land, the problem with exposing every little technical detail is that 1. part of the point of purchasing a cloud product is to abstract that away 2. coding to the implementation instead of the API creates huge headaches for everybody because there are often good reasons (make things more efficient) to change the implementation 3. sadly, there are “features” such as overcommit which you don’t want to know about as a customer.

While some of those are anti-customer on their face, the “we shouldn’t commit to implementation details and instead just to our API/feature surface” may seem anti-customer while actually making things much better for customers. When implementations are supported in perpetuity development grinds to a standstill. Even overcommit ends up helping customers - cloud companies can offer lower sticker prices.

In the case of both regular Public cloud and OpenAI - if you need stability beyond publicly supported APIs you are probably not a good fit as a customer and should instead find a company willing to commit to lower level implementations (eg bare metal, truly major/minor versioned software) or roll-your-own.


That is fine, then you need some kinds of comparable metric that stays the same over time. FLOPS maybe?

Ideally the same across clouds, or more likely cloud-specific.

Or even re pin it every 5 years. E.g. a 2020-CPU, 2025-CPU etc.

We need a "horse power" for the cloud I guess!


Profit motive corrupts Absolute profit motive corrupts absolutely


I read the paper and while I agree that maybe math is not GPT-4 strongest point, I have noticed the same degradation in quality.

And as I know there will be many commenters asking, here's my experience.

Four months ago I released a mobile app that wraps the "Open"AI API and allows a more private use. The app also includes 15 domains with over 150+ editable prompts, specifically crafted to help users get the most out of GPT.

I initially crafted these prompts in English but given I wanted to allow also non-English speakers to make use of them, I started to translate them in Italian and German.

I wrote a small script that took each English prompt and, with some more prompt-fu, translated them to these languages, using GPT-4.

As I'm fairly fluent in both Italian and German, I was able to verify the quality of these translations.

Well, around end of May, when I first noticed some weird answers from GPT, I ran my script again and (surprise!) the quality of the translations is visible inferior.

On a different note, I noticed that from one day to the other, any prompt would get an empty reply in the app, only to discover that "Open"AI has also made a subtle change in the json format of the API response that made broke the parser in the app.

Admittedly, 150 phrases is not a huge sample size and I could/should have used the default json parser.

I don't trust "Open"AI and will, as soon as it's feasible, change to a different model.


This tracks with my experience too. For the past few years, I've been using "give me the pinyin for these Chinese characters" as a rough guide for how well a given model would perform. When GPT4 was released, it absolutely nailed this task; every word was 100% correct, every single time. For reference, even GPT3.5 wasn't great at this.

I repeated this last week and it's still getting the phonetic content right, it's now often getting the tones wrong (something which I literally never saw on release).

Perhaps changing the prompt would deliver the correct results, but that's not really the point. Something has pretty clearly changed such that old prompts no longer delivering the same results.


Just yesterday I gave ChatGPT a summarization task and it performed horribly. I even tried multiple times and got the identical answer. Then I gave the identical prompt to gpt-3.5-turbo via the API and I immediately got the expected good answer.


Which gpt-3.5-turbo API? Legacy completion or chat?


Turbo is always chat.


A little surprised this sort of thing passes for a publish-able paper.

Isn't it the equivalent to saying "Here's the top 10 results for google searching golden retrievers March 2023, and here's the top 10 results from June 2023. We see that google is returning even cuter animals today. Unfortunately though, one of the results linked to a page full of cats."

I'm sure openai has a list of standard questions that it tracks the responses it is getting over a period of time with perfect knowledge / version timing of their own releases.

This does not seem like valid research / publication.


Its arxiv its not a published paper. Until it's reprinted by a journal or proceedings, its basically a blog post that people feel more comfortable citing. It's main purpose is to introduce/increase caution about other methods using these services: "our findings shows that the behavior of the “same” LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality"

Which seems valuable for other research using OpenAI GPT directly.


No, the methodology listed in the paper shows their efforts were significantly more extensive than you represent here. As one example, their test of prime numbers went through 500 randomly selected primes, also evaluating the chain of thought in responses along with the success rate of this classification task. I’m not sure how you arrived at your impression, but it does not match the contents of the paper that I saw with a quick scan of each section.


Not to seem sarcastic, but 500 randomly selected primes doesn't change my mind. I read the paper, and dumbed it down to dogs and cats.

Do you really feel like the efforts were significant or meaningful?


>I read the paper, and dumbed it down to dogs and cats.

I could read the Pricicipia, or anything, and dumb it down to dogs and cats, but that reduction to the dumb would be a failing all my own.

As for the test-- a classification task benchmarked with 500 examples is a fairly decent test of a system's capabilities, be it LLM or a traditional machine learning model. And again, this was only one of a variety of tasks. While it's certainly no Principia, I don't know how you get from there to dogs and cats, nor have you explained your reasoning on how you found the path between the two.


Well, for starters, you take a model with 100 trillion parameters, and then you test it with... 500 examples. WAIT did you say hundred? 500 hundred? Not 500 hundred thousand right?

Okay okay hold on back to the drawing board.

...dumbed down to cats and dogs.


This is simply not the way such models work and your jump from the LLM’s parameter count in its own training to testing its capabilities in a classification task is a non sequitur.

…Principia and failings


If it's not deterministic, it's not going to be useful in professional settings!

Example: you get a bunch of data from your boss, need to find something out, you prompt engineer the results. Your boss get's back to you (five weeks later), likes the result, just want's you to fix a minor thing. It's now giving you completely different results. It's like building on sand.


Since the system appears to be dynamic to certain extent (beyond training) this will be a permanent problem. The feedback that large/public GPT systems are given and retrained upon will cause ups and downs, depending on the mood of the society. But I believe this entanglement of people dependent on LLMs and vice versa will go pear shaped until full Idiocracy scenario is reached. As George Carlin supposedly said "never underestimate the power of stupid people in large groups".


How is Copilot's behavior changing over time?

And has anyone successfully used Copilot to do simple but tedious things like generate library bindings for various languages? I suspect that Copilot would be good at this kind of thing because it is a relatively simple task and there is a lot of training-code available.


I use Copilot primarily as a tab-complete bot, essentially only accepting its input if that's what I was going to type. It is correct mayb ~95% of the time, but depending on the programming language it will try to suggest a large block of code that I don't really trust (in my usage, it loves to spit out Python but in my Java codebase it sticks to boilerplate or finishing my line for me).


For _medical_ questions, the public version of ChatGPT has become useless. Even a simple question such as "What are the typical side effects of XYZ" is answered with a gigantic disclaimer and then often a fancy version of "I am not sure". Yeah, that much I knew already.

It used to be better in the past.


https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

> The math questions were of the form “Is 17077 prime”? They picked 500 numbers, but all of them were prime!

> The June version of GPT-3.5 and the March version of GPT-4 almost always conclude that the number is prime regardless of whether it is prime or composite. The other two models do the opposite. But the paper only tested prime numbers, and hence concluded that GPT-3.5’s performance improved while GPT-4’s degraded.


OpenAI is going to be left in the dust by (actual) open models. Llama 2 is already reaching GPT-3 levels, and can run inference on consumer hardware. Crazy how fast that flipped.


What enables this?

There's a huge gap between GPT-3.5 and 4, put there by a massive amount of money, from my understanding.

To compete, with open source projects being less well funded, I would assume that orders of magnitude improvements in training cost would be required. What do you see driving this, and who do you see paying for it?

If Meta, or anyone else, gets something that beats GPT-4, I would naively assume they would monetize it. If someopen source effort manages to beat GPT-4, then I assume some well funded, profit seeking, entity will dump orders of magnitude more money into whatever enabled the open source offerings to compete.

I don't think GPT-4 is the pinnacle of OpenAI, or that their funding will run dry, so they shouldn't be viewed as a static entity.


>Or, are you suggesting that GPT-4 is the pinnacle of OpenAI, or that their funding will run dry?

My bet is that Meta has pivoted almost entirely to this space with their R&D in the last six months. Llama 2 is spectacular. And with its' success, there will undoubtedly be more. They also happen to have access to limitless amounts of compute, cash, and engineering that puts OpenAI to shame. This could finally be their chance to create a platform for real. And the open source community and startups will benefit off of that.


I haven't tried Llama 2 yet - what specifically makes you say it's spectacular? The licensing or the model itself?


What motive do you see for them releasing these models for free, especially with the massive increase in cost associated with catching up?


> What motive do you see for them releasing these models for free, especially with the massive increase in cost associated with catching up?

Owning a platform. Zuck's dream.

They put tons of money into these models, nurture an ecosystem of companies built around them, and then start gradually figuring out a licensing model for the ones that take off.


Kill Op*nAI, for starter. If they see it as a threat, commoditizing the tech is a great way to get rid of them at a reasonable cost.


Llama really isn't open source, at least not in the sense of FOSS licenses like GPL or MIT. It comes with a number of use-case conditions and gives Meta many avenues to revoke a license if they feel like it. They also have a hard cap on the number of allowed users you may have using your Llama-based product above which you must seek further Meta approval.

Furthermore, Llama remains well below GPT-3 on human rated tests such as programming, and GPT-3 is already over three years old. It is also misleading to suggest Llama 2 can be ran on consumer hardware - the smaller and quantized models can but those are even more lacking in capability. Full power Llama 2 still requires multiple kilowatts of electricity and $10,000+ of compute hardware per inference session.

OpenAI does not have a moat, but they do have a very high wall.


> They also have a hard cap on the number of allowed users you may have using your Llama-based product above which you must seek further Meta approval.

This is not what the Llama 2 license says [0]. There is a cap on the number of active users of products (any products, not just ones that may make use of Llama) by the company who plans to use Llama 2 as of Llama 2 release date.

Not a “future cap on number of users of your Llama2-based product”. Also, the cap is 700 million users.

> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta

[0] https://github.com/facebookresearch/llama/blob/main/LICENSE


> Furthermore, Llama remains well below GPT-3 on human rated tests such as programming, and GPT-3 is already over three years old.

It's just not part of the entended use-case, and hasn't been trained to do so. It's almost like complaining that StableDiffusion isn't good at text generation…

> It is also misleading to suggest Llama 2 can be ran on consumer hardware - the smaller and quantized models can but those are even more lacking in capability

The biggest models will be abble to run on the CPU just fine with llama.cpp as long as you have enough (cheap) RAM. Sure it's slow, but you can run it.

> Full power Llama 2 still requires multiple kilowatts of electricity and $10,000+ of compute hardware per inference session.

What is that “per inference session” doing here? You pay the hardware only once you know… (and the number of kilowatt isn't per inference session either, the number of Watt•hour is)


>It's just not part of the entended use-case, and hasn't been trained to do so. It's almost like complaining that StableDiffusion isn't good at text generation…

For the most part none of these systems were trained to do anything. The capabilities are emergent. My statement about the benchmarks remains unfazed.

>The biggest models will be abble to run on the CPU just fine with llama.cpp as long as you have enough (cheap) RAM. Sure it's slow, but you can run it.

Yes, but what products/clients are you pitching where that kind of wait will be acceptable? Time is money. Standing up racks of maxed out RAM server slots is still far from inexpensive too.

>What is that “per inference session” doing here? You pay the hardware only once you know… (and the number of kilowatt isn't per inference session either, the number of Watt•hour is)

Good job at completely missing the point. I know what a kilowatt is and it is the unit I meant, not kilowatt-hour. I'm referring to the hardware necessary to run one user session of inference. While user 1's tokens are generating, users 2 and up are in the queue waiting. If you want to serve multiple users simultaneously you will need to invest in multiple $xx,xxx units of hardware each requiring a multi-kW electrical circuit.

There is a reason ChatGPT incurs electrical bills on the order of a million dollars per day.


> For the most part none of these systems were trained to do anything. The capabilities are emergent.

The base models yes, but the chat version have been explicitly tuned for specific use-cases, it's not just basic next-token-prediction.

> Time is money

No, time is time, only context gives it monetary value: a single ms in some context can be much more expensive than an hour in another.

> Standing up racks of maxed out RAM server slots

Why are you talking about servers we're talking about the ability to run things on consumer hardware here…

> There is a reason ChatGPT incurs electrical bills on the order of a million dollars per day

Because they spend many MWh, to run all of their customers' requests, but when you're running your own requests on your own hardware, then the energy consumption is tiny, even if you needed a lot of instantaneous power.


>Llama really isn't open source, at least not in the sense of FOSS licenses like GPL or MIT.

Note that I never said "open source" just "open models". As in, I can now actually build things with GPT3 capability that run locally. I couldn't care less about the model code.

>Llama 2 still requires multiple kilowatts of electricity and $10,000+ of compute hardware per inference session.

I'm running llama-2-7b-chat on my 8GB M1 Mac right now with llama.cpp. Completions are instant, and essentially at GPT3 levels of accuracy.

The higher param models require up to 64GB RAM, but it's all CPU based.


If you compiled llama.cpp with Metal I wouldn’t say its ~all CPU based. But it is still incredible! I need to set this up myself this weekend :)



> and $10,000+ of compute hardware per inference session.

What hardware would you need to run it at home?


>What hardware would you need to run it at home?

Step 1: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/ma...

Step 2: https://github.com/ggerganov/llama.cpp

Step 3: you're welcome


That's very helpful, thank you.


> and $10,000+ of compute hardware per inference session.

That is not true. A common macbook with lots of RAM (>32GB) is enough. Or any x86 computer with lots of RAM. llama.cpp is CPU only and quite fast


Could they just be moving to a more aggressive quantization so they can save money?

PS: Which is obviously ripping off paying users.

Competition is the answer.


one things for sure, it really does not want to output entire code blocks now, opting heavily to just give the delta.


[deleted]


Hyperbole, much? Are we at that stage of the hype curve already?


It's "Open" AI ;-)


Am I the only lucky person who never invested a dime for GPT4?


OpenAI just lie through their teeth lol.


I tried the prime number problem and GPT-4 nailed it. I’m not sure whether they are testing things correctly…

“ Sure, let's go step by step.

A prime number is a number greater than 1 that has no positive divisors other than 1 and itself. This means if we can find any other number (excluding 1 and the number itself) that divides 17077, then it is not a prime number.

Let's start by checking divisibility by 2. Since 17077 is an odd number (its last digit is 7), it is not divisible by 2.

For the next step, we can look at divisibility by 3. We add up the digits: 1+7+0+7+7 = 22. Because 22 is not divisible by 3, 17077 is also not divisible by 3.

The smallest prime number greater than 3 is 5. Checking divisibility by 5, we see that 17077 does not end in 0 or 5, so it is not divisible by 5.

The smallest prime number greater than 5 is 7. To check divisibility by 7, there is no simple rule like with 2, 3, and 5, so we would perform the actual division. But we can stop here because I just recalled a rule that we can only check up to the square root of the number and the square root of 17077 is approximately 130.68. So if there were factors other than 1 and 17077, we would have found them by now.

Therefore, 17077 is a prime number.”


> GPT-4 nailed it

Did it? Did you read what it wrote?


I’ve seen so many of these on Twitter since chat GPT came out.

> Look at this amazing capability, it gets it perfectly! (screenshot of interaction with clear flaws/outright wrong answer)


It certainly nails the cadence of a rigorous proof. It just needs to learn to append “the rest is left as an exercise to the reader” to the end of its output.


Aren't there more potential primes between 130 and 7?


Between 5 and 130 actually, it didn't even checked 7!


Can you define "nailing" it? Cause this isn't what I'd call nailing it...


I ought to have noted sarcasm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: