It’s interesting to me how many people I’ve argued with about LLMs. They vehemently insist that this is a world changing technology and the start of the singularity.
Meanwhile whenever I attempt to use one professionally it has to be babied and tightly scoped down or else it goes way off the rails.
And structurally LLMs seem like they’ll always be vulnerable to that. They’re only useful because they bullshit but that also makes them impossible to rely on for anything else.
I’ve been using LLMs pretty extensively in a professional capacity and with the proper grounding work it becomes very useful and reliable.
LLMs on their own is not the world changing tech, LLMs+grounding (what is now being called a Cognitive Architecture), that’s the world changing tech. So while LLMs can be vulnerable to bullshitting, there is a lot of work around them that can qualitatively change their performance.
I’m a few months out of date in the latest in the field and I know it’s changing quickly. What progress has been made towards solving hallucinations? The feeding output into another LLM for evaluation never seemed like a tenable solution to me.
Essentially, you don’t ask them to use their internal knowledge. In fact, you explicitly ask them not to. The technique is generally referred to as Retrieval Augmented Generation. You take the context/user input and you retrieve relevant information from the net/your DB/vector DB/whatever, and you give it to an LLM with how to transform this information (summarize, answer a question, etc).
So you try as much as you can to “ground” the LLM with knowledge that you trust, and to only use this information to perform the task.
So you get a system that can do a really good job at transforming the data you have into the right shape for the task(s) you need to perform, without requiring your LLM to act as a source of information, only a great data massager.
That seems like it should work in theory, but having used Perplexity for a while now, it doesn’t quite solve the problem.
The biggest fundamental problem is that it doesn’t understand in any meaningful capacity what it is saying. It can try to restate something it sourced from a real website, but because it doesn’t understand the content it doesn’t always preserve the essence of what the source said. It will also frequently repeat or contradict itself in as little as two paragraphs based on two sources without acknowledging it, which further confirms the severe lack of understanding. No amount of grounding can overcome this.
Then there is the problem of how LLMs don’t understand negation. You can’t reliably reason with it using negated statements. You also can’t ask it to tell you about things that do not have a particular property. It can’t filter based on statements like “the first game in the series, not the sequel”, or “Game, not Game II: Sequel” (however you put it, you will often get results pertaining to the sequel snucked in).
Yeah, it’s just back exactly to the problem the article points out - refined bullshit is still bullshit. You still need to teach your LLM how to talk, so it still needs that cast bullshit input into its “base” before you feed it the “grounding” or whatever… And since it doesn’t actually understand any of that grounding it’s just yet more bullshit.
Definitely a good use for the tool: NLP is what LLMs do best and pinning down the inputs to only be rewording or compressing ground truth avoids hallucination.
I expect you could use a much smaller model than gpt to do that though. Even llama might be overkill depending on how tightly scoped your DB is
They are useful when you need to generate quasi meaningful bullshit in large volumes easily.
LLMs are being used in medicine now, not to help with diagnosis or correlate seemingly unrelated health data, but to write responses to complaint letters or generate reflective portfolio entries for appraisal.
Don’t get me wrong, outsourcing the bullshit and waffle in medicine is still a win, it frees up time and energy for the highly trained organic general intelligences to do what they do best. I just don’t think it’s the exact outcome the industry expected.
I think it’s the outcome anyone really familiar with the tech expected, but that rarely translates to marketing departments and c-suite types.
I did an LLM project in school, and while that was a limited introduction, it was enough for me to doubt most of the claims coming from LLM orgs. An LLM is good at matching its corpus and that’s about it. So it’ll work well for things like summaries, routine text generation, and similar tasks (and it’s surprisingly good at forming believable text), but it’ll always disappoint with creative work.
I’m sure the tech can do quite a bit more than my class went through, but the limitations here are quite fundamental to the tech.
Ask it to write code that replaces every occurrence of “me” in every file name in a folder with “us”, but excluding occurrences that are part of a word (like medium should not be usdium) and it will give you code that does exactly that.
You can ask it to write code that does a heat simulation in a plate of aluminum given one side of heated and the other cooled. It will get there with some help. It works. That’s absolutely fucking crazy.
Maybe, that really depends on if that task or a very similar task exists in sufficient amounts in its training set. Basically, you could get essentially the same result by searching online for code examples, the LLM might just make it a little faster (and probably introduce some errors as well).
An LLM can only generate text that exists in its training data. That’s a pretty important limitation, which has all kinds of copyright-related issues associated with it (e.g. I can’t just copy a code example from GitHub in most cases).
No, it does not depend on preexisting tasks, which is why I told you those 2 random examples. You can come up with new, never before seen questions if you want to. How to stack a cable, car battery, beer bottle, welding machine, tea pot to get the highest tower. Whatever. It is not always right, but also much more capable than you think.
Ask it to finish writing the code to fetch a permission and it will make a request with a non-existent code. Ask it to implement an SNS API invocation and it’ll make up calls that don’t exist.
Regurgitating code that someone else wrote for an aluminum simulation isn’t the flex you think it is: that’s just an untrustworthy search engine, not a thinking machine
Not consistently and not across truly logical tests. They abjectly fail at abstract reasoning. They do well only in very specific cases.
IQ is an objectively awful measure of human intelligence. Why would it be useful for artificial intelligence?
For these tests that are so centered around specific facts: of course a model that has had the entirety of the Internet encoded into it has the answers. The shocking thing is that the model is so lossy that it doesn’t ace the test.
And global warming correlates with the decline in piracy rates. IQ is a garbage statistic invented by early 20th century eugenicists to prove that white people were the best.
You can’t boil down the nuance of the most complex object in the known universe to a single number.
Income and education levels are not the same thing as intelligence, nor are smarter people higher earners or more well educated.
IQ correlates best with educational achievement. Educational achievement is best predicted by your zip code. Poverty creates sharp educational disadvantages.
Intelligence, as measured by your maximally attainable levels of obtaining knowledge and skills, is something that the majority of people will never test their limits of.
IQ tests do not measure that maximum, only how far along that trajectory you might have come compared to your “peers”.
Therefore: IQ tests are 1 step removed from just asking someone where they grew up, how much college they attended and how much money their parents made.
It has nothing to do with measuring that underlying factor and everything to do with measuring socioeconomic status.
That is exactly what it doesn’t. There is no “understanding” and that is exactly the problem. It generates some output that is similar to what it has already seen from the dataset it’s been fed with that might correlate to your input.
It’s interesting to me how many people I’ve argued with about LLMs. They vehemently insist that this is a world changing technology and the start of the singularity.
Meanwhile whenever I attempt to use one professionally it has to be babied and tightly scoped down or else it goes way off the rails.
And structurally LLMs seem like they’ll always be vulnerable to that. They’re only useful because they bullshit but that also makes them impossible to rely on for anything else.
I’ve been using LLMs pretty extensively in a professional capacity and with the proper grounding work it becomes very useful and reliable.
LLMs on their own is not the world changing tech, LLMs+grounding (what is now being called a Cognitive Architecture), that’s the world changing tech. So while LLMs can be vulnerable to bullshitting, there is a lot of work around them that can qualitatively change their performance.
I’m a few months out of date in the latest in the field and I know it’s changing quickly. What progress has been made towards solving hallucinations? The feeding output into another LLM for evaluation never seemed like a tenable solution to me.
Essentially, you don’t ask them to use their internal knowledge. In fact, you explicitly ask them not to. The technique is generally referred to as Retrieval Augmented Generation. You take the context/user input and you retrieve relevant information from the net/your DB/vector DB/whatever, and you give it to an LLM with how to transform this information (summarize, answer a question, etc).
So you try as much as you can to “ground” the LLM with knowledge that you trust, and to only use this information to perform the task.
So you get a system that can do a really good job at transforming the data you have into the right shape for the task(s) you need to perform, without requiring your LLM to act as a source of information, only a great data massager.
That seems like it should work in theory, but having used Perplexity for a while now, it doesn’t quite solve the problem.
The biggest fundamental problem is that it doesn’t understand in any meaningful capacity what it is saying. It can try to restate something it sourced from a real website, but because it doesn’t understand the content it doesn’t always preserve the essence of what the source said. It will also frequently repeat or contradict itself in as little as two paragraphs based on two sources without acknowledging it, which further confirms the severe lack of understanding. No amount of grounding can overcome this.
Then there is the problem of how LLMs don’t understand negation. You can’t reliably reason with it using negated statements. You also can’t ask it to tell you about things that do not have a particular property. It can’t filter based on statements like “the first game in the series, not the sequel”, or “Game, not Game II: Sequel” (however you put it, you will often get results pertaining to the sequel snucked in).
Yeah, it’s just back exactly to the problem the article points out - refined bullshit is still bullshit. You still need to teach your LLM how to talk, so it still needs that cast bullshit input into its “base” before you feed it the “grounding” or whatever… And since it doesn’t actually understand any of that grounding it’s just yet more bullshit.
Definitely a good use for the tool: NLP is what LLMs do best and pinning down the inputs to only be rewording or compressing ground truth avoids hallucination.
I expect you could use a much smaller model than gpt to do that though. Even llama might be overkill depending on how tightly scoped your DB is
They are useful when you need to generate quasi meaningful bullshit in large volumes easily.
LLMs are being used in medicine now, not to help with diagnosis or correlate seemingly unrelated health data, but to write responses to complaint letters or generate reflective portfolio entries for appraisal.
Don’t get me wrong, outsourcing the bullshit and waffle in medicine is still a win, it frees up time and energy for the highly trained organic general intelligences to do what they do best. I just don’t think it’s the exact outcome the industry expected.
I think it’s the outcome anyone really familiar with the tech expected, but that rarely translates to marketing departments and c-suite types.
I did an LLM project in school, and while that was a limited introduction, it was enough for me to doubt most of the claims coming from LLM orgs. An LLM is good at matching its corpus and that’s about it. So it’ll work well for things like summaries, routine text generation, and similar tasks (and it’s surprisingly good at forming believable text), but it’ll always disappoint with creative work.
I’m sure the tech can do quite a bit more than my class went through, but the limitations here are quite fundamental to the tech.
That’s kinda the point of my above comment: they’re useful for bullshit: that’s why they’ll never be trustworthy
It’s a computer that understands my words and can reply, even complete tasks upon request, nevermind the result. To me that’s pretty groundbreaking.
It’s a probabilistic network that generates a response based on your input.
No understanding required.
Same
Ask it to write code that replaces every occurrence of “me” in every file name in a folder with “us”, but excluding occurrences that are part of a word (like medium should not be usdium) and it will give you code that does exactly that.
You can ask it to write code that does a heat simulation in a plate of aluminum given one side of heated and the other cooled. It will get there with some help. It works. That’s absolutely fucking crazy.
Maybe, that really depends on if that task or a very similar task exists in sufficient amounts in its training set. Basically, you could get essentially the same result by searching online for code examples, the LLM might just make it a little faster (and probably introduce some errors as well).
An LLM can only generate text that exists in its training data. That’s a pretty important limitation, which has all kinds of copyright-related issues associated with it (e.g. I can’t just copy a code example from GitHub in most cases).
No, it does not depend on preexisting tasks, which is why I told you those 2 random examples. You can come up with new, never before seen questions if you want to. How to stack a cable, car battery, beer bottle, welding machine, tea pot to get the highest tower. Whatever. It is not always right, but also much more capable than you think.
It is dependent on preexisting tasks, you’re just describing encoded latent space.
It’s not explicit but it’s implicitly encoded.
And you still can’t trust it because the encoding is intrinsically lossy.
It can come up with new solutions.
Ask it to finish writing the code to fetch a permission and it will make a request with a non-existent code. Ask it to implement an SNS API invocation and it’ll make up calls that don’t exist.
Regurgitating code that someone else wrote for an aluminum simulation isn’t the flex you think it is: that’s just an untrustworthy search engine, not a thinking machine
Yet it can outperform humans on some tests involving logic. It will never be perfect, but that implies you can test its IQ
IQ correlates with a good number of things though. It’a not perfect but it’s not meaningless either.
And global warming correlates with the decline in piracy rates. IQ is a garbage statistic invented by early 20th century eugenicists to prove that white people were the best.
You can’t boil down the nuance of the most complex object in the known universe to a single number.
Not perfectly you can’t. But similarly to how people’s SAT scores predict their future success, IQ tests in aggregate do have predictive power.
IQ is objectively a good measure of human intelligence. High IQ people have higher educational achievement, income, etc.
Don’t take the effect and make it the cause my guy
I never said it’s the cause. We’re trying to find a measure that correlates well with actual intelligence g
IQ correlates with g, but also income/education correlates with g because smarter people do better in these metrics.
IQ doesn’t make you smarter, but smarter people can do better on IQ tests
Smarter by what measure? IQ?
You’re using circular reasoning here.
Income and education levels are not the same thing as intelligence, nor are smarter people higher earners or more well educated.
IQ correlates best with educational achievement. Educational achievement is best predicted by your zip code. Poverty creates sharp educational disadvantages.
Intelligence, as measured by your maximally attainable levels of obtaining knowledge and skills, is something that the majority of people will never test their limits of.
IQ tests do not measure that maximum, only how far along that trajectory you might have come compared to your “peers”.
Therefore: IQ tests are 1 step removed from just asking someone where they grew up, how much college they attended and how much money their parents made.
It has nothing to do with measuring that underlying factor and everything to do with measuring socioeconomic status.
It was a crude tool invented by eugenicists to promote genocide and you should stop using or respecting it at all.
“Test it’s IQ”. The fact that you think IQ is a useful test for intelligence tells me everything I need to know
The fact you went out of your way to write it’s when I wrote the correct “its” tells me everything I need to know about your educational achievement
That is exactly what it doesn’t. There is no “understanding” and that is exactly the problem. It generates some output that is similar to what it has already seen from the dataset it’s been fed with that might correlate to your input.