Towards LLMCI - Multimodal AI for LLM-Vision UI Operation-Reference-Cited by-同舟云学术

Towards LLMCI - Multimodal AI for LLM-Vision UI Operation

Published:2024-07-22 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Barham Husam¹,Fasha Mohammed¹

Affiliation:

1. University of Petra

Abstract

Human-computer interaction (HCI) has evolved significantly, yet it still largely depends on visual communication through screens and manual input devices. While this paradigm is likely to remain dominant for the foreseeable future, this research suggests that existing user interfaces (UI) can also be leveraged by Large Language Models (LLMs) to interact with computers. By integrating vision models into a multimodal framework, LLMs can gain the ability to understand and operate UI elements, enabling them to retrieve information, run functions, and perform various tasks just like humans. The framework utilizes a vision model to communicate UI components and information to the LLM, which then leverages its language understanding capabilities to retrieve information, and operate keyboard and mouse inputs.This paper introduces a new element to Human-Computer Interaction (HCI), called LLM-Computer Interaction (LLMCI), which combines Large Language Models (LLMs) with computer vision via intelligent agents. These agents process user text commands and use visual perception to recognize visual and textual elements of computer interfaces. This allows the Multimodal AI to independently perform complex tasks and navigate applications in a way that resembles human behavior. We present a proof-of-concept framework that illustrates how the agent uses LLMs and computer vision to handle interface elements, complete tasks, and support users according to their instructions. This strategy closely imitates human interactions and suggests a path forward for enhancing HCI practices.

Publisher

Springer Science and Business Media LLC

Reference24 articles.

1. Rogers, Yvonne HCI theory: classical, modern, and contemporary. Springer Nature

2. Dix, Alan (2010) Human--computer interaction: A stable discipline, a nascent science, and the growth of the long tail. Interacting with computers 22(1): 13--27 OUP

3. Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others (2020) Language models are few-shot learners. Advances in neural information processing systems 33: 1877--1901

4. Thoppilan, Romal and De Freitas, Daniel and Hall, Jamie and Shazeer, Noam and Kulshreshtha, Apoorv and Cheng, Heng-Tze and Jin, Alicia and Bos, Taylor and Baker, Leslie and Du, Yu and others (2022) Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239

5. Chen, Jieshan and Xie, Mulong and Xing, Zhenchang and Chen, Chunyang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang (2020) Object detection for graphical user interface: Old fashioned or deep learning or a combination?. 1202--1214, proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering