mrCAD: Multimodal Communication to Refine Computer-aided Designs

William P McCarthy, Saujas Vaduguru, Karl D.d. Willis, Justin Matejka, Judith E Fan, Daniel Fried, Yewen Pu

January 2025 · Findings of the Association for Computational Linguistics: EMNLP 2025 (EMNLP)

mrCAD: Multimodal Communication to Refine Computer-aided Designs

Abstract

In collaborative creation tasks, people steer artifacts towards specific goals by \_refining\_ them with \_multimodal\_ communication over multiple rounds of interaction. In contrast, generative AI excels at creating artifacts in a single turn but can struggle to make precise refinements that match our design intent. To close this gap, we present mrCAD, a dataset of multi-turn interactions in which pairs of humans iteratively created and refined computer-aided designs (CADs). In each game, a \_Designer sent instructions to a \_Maker\_

DOI PDF

Figures

Figure 1: We present mrCAD, a dataset of humans play- ing a multi-turn, multimodal communication game in a 2D CAD environment. A pair of participants collab- orated to recreate a target CAD over multiple rounds. The target design is known only to the Designer, who instruct the Maker using drawing and text. The Maker manipulate the current CAD based on these instructions.

Figure 3: A: reconstruction accuracy for the 4 communi- cation conditions — multimodal+refinement, text only + refinement, drawing only + refinement, and multimodal + generation only. Using text only was less effective. Generation-only was less effective. Drawing only and multimodal are comparable in performance. B: usage of text across rounds — in the multimodal condition, par- ticipants used more texts in the later refinement rounds, suggesting a usage of text in conjunction with draw- ings to communicate refinements. C: usage of drawing across rounds — in the multimodal condition, partici- pants used more drawing in the generation round, and less in the refinement rounds.

Figure 4: The mrCAD dataset contains three subsets: the coverage set of 2249 CADs with 1-2 successful rollouts, dense set of 698 CADs with 3+ successful reconstruction, and the very-dense set of 27 CADs with 30+ successful reconstruction.

Figure 5: A Example rollouts from the dataset. Target CADs (top-center) were shown to Designers, who created instructions (left columns) that Makers followed (right columns). Dyads iteratively refined their CADs across a series of rounds (rows). B Examples of multimodal refinement instructions. Language and drawing mutually constrain and inform the others’ semantics. Many instructions don’t make sense without the accompanying drawings, and vice-versa.

Figure 6: A Designers’ instructions to generate CADs (round 1) involved lots of drawing and little text, whereas instructions to refine CADs (rounds 2+) used a balance of modalities. B The proportions of the types of root words in the dependency parse tree of instruction text. More verbs are used over rounds, and these verbs become more imperative. C Samples of 20 generation drawings and 20 refinement drawings highlights the rich detail in generation instructions, and more targeted modifications in refinement.

Figure 7: A Comparison of human and model movement towards target following instructions, normalized by distance at start of round. Only humans make reliably positive changes in responses to refinement instructions. Models made positive steps in generation but largely destructive changes when refining. B Comparison of human and model responses.

BibTeX

@inproceedings{mccarthy-etal-2025-mrcad,
    title = "mr{CAD}: Multimodal Communication to Refine Computer-aided Designs",
    author = "McCarthy, William P  and
      Vaduguru, Saujas  and
      Willis, Karl D.d.  and
      Matejka, Justin  and
      Fan, Judith E  and
      Fried, Daniel  and
      Pu, Yewen",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1248/",
    doi = "10.18653/v1/2025.findings-emnlp.1248",
    pages = "22905--22921",
    ISBN = "979-8-89176-335-7",
    abstract = "In collaborative creation tasks, people steer artifacts towards specific goals by {\_}refining{\_} them with {\_}multimodal{\_} communication over multiple rounds of interaction. In contrast, generative AI excels at creating artifacts in a single turn but can struggle to make precise refinements that match our design intent. To close this gap, we present mrCAD, a dataset of multi-turn interactions in which pairs of humans iteratively created and refined computer-aided designs (CADs). In each game, a {\_}Designer sent instructions to a {\_}Maker{\_}, explaining how to create and subsequently refine a CAD to match a target design that only the {\_}Designer{\_} could see. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. Crucially, {\_}Designers{\_} had access to two communication modalities {--} text and drawing. Analysis finds that players relied more on text in refinement than in initial generation instructions, and used different linguistic elements for refinement than for generation. We also find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay the foundation for modeling multi-turn, multimodal communication not captured in prior datasets."
}