Abstract: Automatically generating data visualizations in response to human utterances on datasets necessitates a deep semantic understanding of the utterance, including implicit and explicit references to data attributes, visualization tasks, and necessary data preparation steps. Natural Language Interfaces (NLIs) for data visualization have explored ways to infer such information, yet challenges persist due to inherent uncertainty in human speech. Recent advances in Large Language Models (LLMs) provide an avenue to address these challenges, but their ability to extract the relevant semantic information remains unexplored. In this study, we evaluate four publicly available LLMs (GPT-4, Gemini-Pro, Llama3, and Mixtral), investigating their ability to comprehend utterances even in the presence of uncertainty and identify the relevant data context and visual tasks. Our fndings reveal that LLMs are sensitive to uncertainties in utterances. Despite this sensitivity, they are able to extract the relevant data context. However, LLMs struggle with inferring visualization tasks. Based on these results, we highlight future research directions on using LLMs for visualization generation. Our supplementary materials have been shared in this repository.
Contains the codebase used for prompting LLMs and performing comparisons with human annotations.We evaluated two proprietary and two open-source LLMs.
Proprietary LLMs. We evaluated OpenAI's GPT4-Turbo and Google's Gemini-Pro. GPT4-Turbo has a training data cutoff of December 2023 and Gemini-Pro's training data cutoff is described as "early 2023" According to Google AI documentation. We utilized the Application Programming Interfaces (APIs) for both of these models to generate responses for the 500 utterances in our corpus.
Open Source LLMs. We evaluated two open-source LLMs, Llama3, and Mixtral, on the Llama factory code base. Llama3 has 70 billion parameters and a context length of 8,000 tokens, with a knowledge cutoff of December 2023. Mixtral-8x7B-Instruct is configured with 46.7 billion parameters and similarly has a knowledge cutoff in December 2023.
Our experimental setup for the open-source models involved utilizing an NVIDIA H100 GPU coupled with a 48-core Intel Sapphire Rapids CPU, supported by 100GB of system memory. Both models were operational in 4-bit quantization mode and were enhanced with flash attention mechanisms to expedite the inference process. The inference duration for the LLAMA3 model was approximately 2 hours, whereas for the Mixtral model, it extended to about 3 hours.
To insure you have all the required packages for this project please run pip install -r ./requirements.txt
in the terminal. You should now be able to run the necessary scripts in this repo.
This project contains the following folders
Datasets: contains all 37 datasets used by the 500 utterances used in this study. All files are .csv
GPT_Gemini_Prompting_Scripts: Contains the scripts used to prompt the proprietary LLMs (GPT4-turbo and Gemini-Pro) evaluated in this study.
Llama_Mixtral_Prompting_Scripts: Contains the scripts used to prompt the open source LLMs evaluated in this study. Also contaions json results from llama and mixtral runs.
Output Analysis: Contains the scripts used to evaluate the responses generated by all LLMs evaluated in this study