AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors

Saelyne Yang, Jo Vermeulen, George Fitzmaurice, Justin Matejka

January 2024 · Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI)

Abstract

Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By analyzing 633 questions found in 5,944 video comments, we identified different question types and observed that users frequently described parts of the video in questions. We then asked participants (N=24) to watch tutorial videos and ask questions while annotating the video with relevant visual anchors. Most visual anchors referred to UI elements and the application workspace. Based on these insights, we built AQuA, a pipeline that generates useful answers to questions with visual anchors. We demonstrate this for Fusion 360, showing that we can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation. An evaluation study (N=16) demonstrates that our approach provides better answers than baseline methods.

DOI PDF

Figures

Figure 1: Overall architecture of our question-answer pipeline AQuA, which generates useful responses to questions made insoftware tutorial videos. Questions are accompanied by visual anchors, which a

Figure 2: Categories and Types of questions identified from the analysis. Each row represents a category and each block represents a type. Under each block, the areas on the left and right represent live chat and comment data, respectively. Our focus is on Content and User questions, as these are vital for comprehending the tutorial and can often be answered without the involvement of the tutorial authors or software vendor.

Figure 3: The system used for collecting questions with visual references. (A) Users can draw anchors on parts of the video they want to ask questions about, (B) which will be added to a temporary gallery. (C) Users can refer to each anchor in their questions.

Figure 4: Our Visual Recognition Module is composed of Image Captioning, UI Element Detection, and Optical CharacterRecognition (OCR). We use BLIP-2 [34] to obtain a general description of the visual

Figure 5: The system used in our pipeline evaluation study. The participant can see the question, the video that the questionwas asked about at the right timestamp and with the visual anchor highlight

Figure 6: Distribution of Likert scale responses on Correctness and Helpfulness. Full Pipeline shows the highest correctness and helpfulness scores in both batches. Responses of "neither agree nor disagree" are omitted from the chart for clarity and readability.

Figure 7: Results of the favorite answer selection. Answersgenerated from the Full Pipeline were selected as the fa-vorite most often.

Figure 8: We envision that our pipeline could be leveraged in the future to develop a tutorial video system that supports conversational, chat-like question and answering. Learners could ask questions by referring to specific parts of the video. The system would then generate responses to these questions, while also allowing users to easily ask follow-up questions.

BibTeX

@inproceedings{10.1145/3613904.3642752,
  author = {Yang, Saelyne and Vermeulen, Jo and Fitzmaurice, George and Matejka, Justin},
  title = {AQuA: Automated Question-Answering in Software Tutorial Videos with Visual Anchors},
  year = {2024},
  isbn = {9798400703300},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3613904.3642752},
  doi = {10.1145/3613904.3642752},
  abstract = {Tutorial videos are a popular help source for learning feature-rich software. However, getting quick answers to questions about tutorial videos is difficult. We present an automated approach for responding to tutorial questions. By analyzing 633 questions found in 5,944 video comments, we identified different question types and observed that users frequently described parts of the video in questions. We then asked participants (N=24) to watch tutorial videos and ask questions while annotating the video with relevant visual anchors. Most visual anchors referred to UI elements and the application workspace. Based on these insights, we built AQuA, a pipeline that generates useful answers to questions with visual anchors. We demonstrate this for Fusion 360, showing that we can recognize UI elements in visual anchors and generate answers using GPT-4 augmented with that visual information and software documentation. An evaluation study (N=16) demonstrates that our approach provides better answers than baseline methods.},
  booktitle = {Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems},
  articleno = {928},
  numpages = {19},
  keywords = {generative AI, large language models, question answering, software learning, tutorial videos},
  location = {Honolulu, HI, USA},
  series = {CHI '24},
}