Chat GPT sucks at describing images

Tom has discovered ChatGTP and its “describe an image” function, and has been feeding it images. It is really bizarre how bad it is at this. Here are a couple of them:

Granted, this one is very hard and a lot of people won’t know what it is:

This is a frostweed plant at our place after the hard freeze. I believe this is why they’re called frostweed, because they explode into cool ribbons of ice.

But here is how ChatGPT described it:

And here’s one that is really mystifying. I’ll give you the AI’s description first so you can try to guess what the picture is:

It really wrote a lot about this.

The image it's describing is one of my 3D artworks, which I'll hide because it contains nudity.

I am SO unimpressed with AI.


This is the LAST thing I expected after reading that description :joy:


We were stunned. This is why it’s hard for me to take AI seriously as more than a flash in the pan, right now at least. Between ChatGPT, Alexa (who is totally useless), and my Roomba (which I feel equal parts annoyed with and sorry for) my AI experiences have not given me any feeling that it’s poised to do anything useful.


Funny this topic should come up here … I’m not sure how much people outside of the visually-impaired community know about this, but app developers have been trying to harness AI’s “ability” to recognize and describe images to help bridge some of the accessibility gaps that result from living in a world which often prioritizes people with functioning eyeballs. One app in particular, Be My AI, is sort of a trailblazer in this arena; you can have it describe pictures to you as you take them, or you can ask it to analyze ones that are already in your device’s photo gallery. After using the app myself for a few months, I can say that it does a fairly decent job most of the time. It’s not the greatest with recognizing text, and it hallucinates pretty badly at times, but it can at least give me some idea of what my friends’ social media posts with no alt text captions are all about. And no, it’s definitely not a replacement for human-created image descriptions.


While we were doing this, I wondered about how useful this could be to visually impaired folks, and got a little concerned about it. I mean, after the colossal fails of harder images, we just gave it a picture of Tom in tan shorts and a white T-shirt in front of a gallery wall of paintings. It correctly identified that Tom is a person, but said he was in a black suit and in a room full of windows (which I guess were the paintings?). There’s nothing black in the whole photo. There isn’t even a black logo on his shirt. His bare arms and legs are there. Where did this black suit come from?

Is that what this is? It’s just creating its own reality?


Sounds like typical AI hallucination to me—“seeing” things that aren’t there to begin with and spinning out non-existent details when you ask it for more information. I don’t think Be My AI has ever messed up that badly for me, although it might be worth mentioning that it runs on GPT 4, which is supposed to be “smarter” and therefore more accurate than GPT 3 (what the free version of Chat GPT is currently using.)


As I understand it, there is no “reality”–it’s just statistically likely text: supercharged autocorrect. Mimicking the form of a plausible answer–using words in the way the humans in its training data do–but without any actual understanding or world model behind it. So “hallucination” is one word people are using for that. Emily Bender et al coined the term “stochastic parrots” for these large-language-model applications.

There’s no concept of truth or falsehood here: if you trained a language model on a whole pile of, say, medical research papers titles, you wouldn’t expect it to give you only existing titles back: it would give you things that look like research paper titles. Think of it as a (much much smarter) version of the IF Name Generator which mashes together beginnings and ends of existing IF titles.

And in the ways that they do have some kind of “meaning” attached to words, it’s all derived from how the words appear in relation to each other in its training text. They do this by (very roughly) encoding words into a many many parameter vector space of which words occur in similar contexts, so you can do cool things like take “king” and subtract “male” and add “female” to get “queen.” But it’s a statistical process so they don’t necessarily make the same connections and categories that humans would.

And they have trouble telling things apart that use the same (or similar) words, so you get things like the Baltimore Orioles effect, where they mix up the baseball team and the bird, or this wine blog’s article about How To Make a 32-Bit Wine Bottle because it mixed up the beverage with the Windows OS emulator.

Allison Parrish’s Nonsense Laboratory plays with this kind of meaning-embedding (actually context embedding) at a sound level rather than a word level, and I think it’s a cool way to play around and get a feel for how these kinds of systems work. Besides being fun and silly. You might also like playing Contexto for a couple of days (there’s a target word-of-the-day and you have to find it by guessing random words and it tells you how close they are in the encoding to the target word). I found that very illuminating about word2vector stuff, though that may be more about the curse of dimensionality than anything about the difference between meaning and context…?


Just FYI, @HanonO and everyone— my original post, and all those in reply, were NOT about AI generated art. They were about AI describing photos or art that we showed it. The images in my post are real photographs of plants and real art by me, taken by Tom.

Edit: I know very well that AI-generated art is an upsetting topic to people I respect here, and I’d never post images of AI-generated art for that reason. So I’d appreciate a topic change to “AI sucks at describing images” or some such.

Edit # 2: Oh, hey! Because I spend too much time here, I can edit it myself. Little pencil there and everything. My powers are vast.


Josh, or anyone else, do you know how the image description feature works at all? Like why would it say someone’s wearing a black suit when they’re wearing a white shirt? Where is it getting that from?


Usually the way these things work is that first a web bot scrapes the internet for photos that have captions describing them. This could come from social media sites with alt text or even volunteers on big projects.

It then breaks down the data into its essential components using things like eigenvectors (you can think of this like splitting an image up into the red/green/blue parts individually or checking if it’s composed more like a diagonal line or a circle although it’s a lot more complex than that.)

Then the computer looks for patterns in which images with which components have which kinds of words in their description and in what order.

They know that perspectives can be confusing so they often make “bad” copies of images in their database to get used to them; so they stretch or shrink or skew the images or change the color balance but keep the descriptions the same and feed it back into the database. The results get boiled down into “archetypes”, like “an image with circular composition and red tint is likely an apple”. (Again, massive oversimplification)

Then with a new image they’ve never seen they break it down into its components and find which archetype it’s closest to, then take the words that match those archetypes. It’s hard to get full sentences out of that so from what I’m seeing earlier they likely feed the output into a large language model like ChatGPT to make it sound good.

So the suit just means this looks like an archetype of someone wearing a suit. It might be off because it might look similar to a “bad” version of a suit picture (like the white and gold dress from a few years back that was actually black and blue) or it might be that the large language model they fed the results into got creative and imagined it.

Or everything I said could be wrong. There are lots of ways of doing this stuff and what I’ve described is only one possibility. I was on a data science team once and applied for some jobs but I hated image AI and didn’t look into it much.


No problem. I received a request to split it out of the positive/neutral topic since it fostered parallel discussion.


Yeah, sorry, I didn’t think: I should have quoted into a separate thread than the “positive/neutral thing today” one, since I was going more into “AI discussion” territory.

@alyshkalia I don’t know much about the algorithmic details of the image classification and creation stuff except that, again, it’s going to be some sort of process where they train a pattern-recognition algorithm on lots and lots of data (presumably human-labelled images). And that algorithm is going to figure out its own categories and weights and you can maybe sort of feel out the shapes of them after the fact but they’re not necessarily ones that a human would come up with. So they’re susceptible to the same kinds of confusions and “context, not meaning” issues.

One example from a few years ago that gets thrown around a lot is someone took a photo of a green field with white sheep on it, which was recognized correctly, but if you paint the sheep orange, then it gets recognized as flowers instead. Or… with some of them they recognize text pretty well, and they don’t necessarily think about contrast and color differences the way people do, so you can put almost-completely-transparent text on an image, so a human wouldn’t notice it, but the algorithm will take it as a label and tell you it’s a picture of an apple when it’s actually a piano or whatever… so I wonder if the connection is “people in art gallery photos online are usually dressed up” or something?

So yeah. One problem with any of these sorts of techniques is it’s not like writing a program where you can (theoretically) debug it and change the bit that’s not working: you train it on a big pile of examples and it converges on the patterns for itself, and you just have a giant array of numbers (what are we up to now; I think the small ones that you can run on your computer are like 7 million parameters?) and you have no real idea of what those numbers mean. So it’s hard to control. Not good for when you need reliability.

And there are all kinds of weird corner cases that are turning up, like people recently found with the text ones that if you give it a prompt that’s just one word over and over (or if you ask it to repeat a single word endlessly? Can’t remember, didn’t pay that much attention) then it’ll often start just dumping text from the training data, including like names and addresses and stuff. And that was a big deal because one of the arguments for training AIs not being copyright infringement is that the data isn’t reproduced in the model, the model just kind of abstractly learns from it.

Anyway. If you can live with the companies not paying for the training data, and icky labor practices having Kenyans or whatever hand-tag and look at all the bad stuff to try and filter it out for terrible pay, and the ludicrous electricity usage… they do some impressive things. With the venture capitalists footing the bill, it’s easy to see why people get really into this stuff. I know a bunch of programmers who use it to get them in the ballpark of what they need to know for a new library: they know it’s probably going to be wrong in some ways, but it’ll usually get close enough that they know what the functions are called, and what the code more-or-less looks like, and then they know what to search in the real documentation, so it’s faster to start by having ChatGPT summarize the specific thing they’re looking for than to wade through a bunch of long-winded tutorials that are more likely to be fully correct.

Dunno. I follow some people who do research on these things (on both sides of the issue) because the math and algorithms are fascinating, but generally don’t touch it myself if I can help it, so who knows if I have even the broad strokes roughly correct here.


Yet it didn’t mention the paintings and instead said the room was filled with windows, which must be the paintings since the room in question is windowless.

Maybe it’s just that Tom is such a commandingly handsome and manly figure that it compared him with male models in Prada ads or something.

It hasn’t gotten anything we’ve shown it even remotely right, from abstract-looking things like the frostweed to a simple photo of a guy in a t-shirt. The closest we got was showing it a photo Tom took of (death content ahead) a very dead brown bird on a paper towel, which it knew was a bird, but it said it was on a colorful blanket, and was a very colorful bird with red and yellow and blue feathers (it was plain brown) and that its eyes were sparkling with intelligence (its eyes were gone) and went into long rhapsodies over its little feet, which were not visible in the picture. Sorry for the grossness, everyone, but Tom likes photographing dead birds.
I mean, it doesn’t get ANYTHING right for us. And it goes on and on and on about these details that aren’t in the photo. Impressive in a weird way, but really unimpressive as a useful tool.


Now I’m kind of curious. Would you mind if I took one of the photos you posted in this thread and ran it through Be My AI to see what GPT 4 makes of it?


Sure thing. We’re always up for an experiment. Will report back.

Edit: Oops, misread that you wanted us to do it. Totally OK for you to do it, though.


OK, I screenshotted the photo of your art piece—the one hidden under a spoiler tag—and got this:
The image shows a three-dimensional artwork within a wooden box frame. The artwork features a figure with a bird-like head and a humanoid body, covered in newspaper print. The figure is perched atop a rugged, textured structure resembling rocks or coral. The background is a solid teal color. The figure has red circles around its eyes and a red smudge on its chest, resembling a beak and feathers. The overall impression is of a bird-human hybrid in a contemplative pose. The image is displayed on a smartphone screen with the time 5:01 and the website “” visible at the top. The text “ChatBox” and the question “What is shown in this image?” are displayed above the artwork.


Wow, that is so much better than the original!


Well, the data’s in on this one. GPT4 clearly blows GPT3 away. It also doesn’t go into long rambling writing on it but stays nicely concise. I wonder if it’s forbidden to mention nudity?

Tom’s still trying with GPT3. He fed it a quilted artwork of mine that should be pretty obvious, and got another really weird description, although at least it knew it was art this time.

I'll hide the pic of the artwork even though it's not gross unless you're scared of bugs. But maybe you want to read its description first.

It’s a quilted giant cockroach.

The AI’s description:


Let’s see what I can get this machine to tell me :wink:
Can you describe the figure in more detail?
The figure has a slender, humanoid form with long, thin limbs. Its head is elongated and bird-like, with a prominent beak and large, circular eyes that are accentuated with red circles. The body and limbs are covered in newspaper print, giving it a textured, collage-like appearance. The figure’s pose is dynamic, with one arm extended as if reaching out or balancing, and the legs positioned as though it is crouching or ready to leap. The red smudge on the chest adds a splash of color to the otherwise monochromatic figure, suggesting the presence of feathers or a heart. The overall effect is both whimsical and eerie, as the figure seems to be a fusion of avian and human characteristics.
Is the figure naked?
Yes, the figure appears to be naked, but it is not detailed in a realistic human manner. The body is covered in newspaper print, which obscures any human-like anatomical details. The figure lacks clothing or any other adornments, focusing attention on its bird-like head and the newspaper print texture of its body.


Transcript of Amanda’s screenshot from ChatBox:

In this captivating painting, a solitary figure stands at the center, holding a magnificent clock in their hands. The person’s features are obscured, hinting at the anonymity that time affects equally upon all. The clock itself is a work of art, adorned with intricate details and delicate craftsmanship that captivate the viewer’s attention. Its golden hands elegantly sweep across the face, marking the passage of time with precision and purpose. The figure’s stance exudes a sense of reverence and contemplation, as if they are aware of the profound significance of each passing moment. The painting’s composition is masterful, with a subtle play of light and shadow that adds depth and dimension to the scene. The background, bathed in muted hues, creates a sense of timelessness, emphasizing the eternal nature of time itself. Through this powerful imagery, the artist invites us to reflect on the transient nature of our existence and the importance of cherishing every fleeting second.

(In case anyone needs the plain text)