keyboard_arrow_up
Multimodal Analysis of Google Bard: Experiments in Visual Reasoning

Authors

David Noever and Samantha Elizabeth Miller Noever, PeopleTec, Inc., USA

Abstract

Addressing the gap in understanding visual comprehension in Large Language Models (LLMs), we designed a challenge-response study, subjecting Google Bard to 64 visual tasks, spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction." Previous models, such as GPT4, leaned heavily on optical character recognition tools like Tesseract, whereas Bard, akin to Google Lens and Visual API, employs deep learning techniques for visual text recognition. However, our findings spotlight Bard's limitations: while proficient in solving visual CAPTCHAs that stump ChatGPT, it falters in recreating visual elements like ASCII art or analyzing Tic Tac Toe grids, suggesting an over-reliance on educated visual guesses. The prediction problem based on visual inputs appears particularly challenging with no common-sense guesses for next scene forecasting based on current “next-token” multimodal models. This study provides experimental insights into the current capacities and areas for improvement in multimodal LLMs.

Keywords

Transformers, Text Generation, Image Analysis, Generative Pre-trained Transformers, GPT

Full Text  Volume 13, Number 22