Towards LLMCI - Multimodal AI for LLM-Vision UI Operation

Author:

Barham Husam1,Fasha Mohammed1

Affiliation:

1. University of Petra

Abstract

Abstract

Human-computer interaction (HCI) has evolved significantly, yet it still largely depends on visual communication through screens and manual input devices. While this paradigm is likely to remain dominant for the foreseeable future, this research suggests that existing user interfaces (UI) can also be leveraged by Large Language Models (LLMs) to interact with computers. By integrating vision models into a multimodal framework, LLMs can gain the ability to understand and operate UI elements, enabling them to retrieve information, run functions, and perform various tasks just like humans. The framework utilizes a vision model to communicate UI components and information to the LLM, which then leverages its language understanding capabilities to retrieve information, and operate keyboard and mouse inputs.This paper introduces a new element to Human-Computer Interaction (HCI), called LLM-Computer Interaction (LLMCI), which combines Large Language Models (LLMs) with computer vision via intelligent agents. These agents process user text commands and use visual perception to recognize visual and textual elements of computer interfaces. This allows the Multimodal AI to independently perform complex tasks and navigate applications in a way that resembles human behavior. We present a proof-of-concept framework that illustrates how the agent uses LLMs and computer vision to handle interface elements, complete tasks, and support users according to their instructions. This strategy closely imitates human interactions and suggests a path forward for enhancing HCI practices.

Publisher

Springer Science and Business Media LLC

Reference24 articles.

1. Rogers, Yvonne HCI theory: classical, modern, and contemporary. Springer Nature

2. Dix, Alan (2010) Human--computer interaction: A stable discipline, a nascent science, and the growth of the long tail. Interacting with computers 22(1): 13--27 OUP

3. Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others (2020) Language models are few-shot learners. Advances in neural information processing systems 33: 1877--1901

4. Thoppilan, Romal and De Freitas, Daniel and Hall, Jamie and Shazeer, Noam and Kulshreshtha, Apoorv and Cheng, Heng-Tze and Jin, Alicia and Bos, Taylor and Baker, Leslie and Du, Yu and others (2022) Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239

5. Chen, Jieshan and Xie, Mulong and Xing, Zhenchang and Chen, Chunyang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang (2020) Object detection for graphical user interface: Old fashioned or deep learning or a combination?. 1202--1214, proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3