Sina J. Semnani, Violet Z. Yao*, Heidi C. Zhang*, Monica S. Lam
In Findings of the Association for Computational Linguistics: EMNLP 2023, Resorts World Convention Centre, Singapore, December 6-10, 2023.
[Paper] [Github]
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus.
WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment.
Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM.
WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.
Mehrad Moradshahi*, Tianhao Shen*, Kalika Bali, Monojit Choudhury, Gaël de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, Sina J. Semnani, Jiwon Seo, Vivek Seshadri, Manish Shrivastava, Michael Sun, Aditya Yadavalli, Chaobin You, Deyi Xiong, Monica S. Lam
[Paper] [Github]
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents.
The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks.
We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
Mehrad Moradshahi, Sina J. Semnani, Monica S. Lam
In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), Dubrovnik, Croatia, May 2023.
[Paper] [Github]
Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.
We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. few-shot). Unlike most prior work in cross-lingual ToD that only focuses on Dialogue State Tracking (DST), we build an end-to-end agent.
We show that our approach closes the accuracy gap between few-shot and existing full-shot methods for ToD agents.
We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations.
We evaluate our approach on the recent bilingual dialogue dataset BiToD.
In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training.
Mehrad Moradshahi, Victoria C. Tsai, Giovanni Campagna, Monica S. Lam
In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), Dubrovnik, Croatia, May 2023.
[Paper] [Github]
Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice.
We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can obscure judgments on the quality of the model.
Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations.
Giovanni Campagna, Sina J. Semnani, Ryan Kearns, Lucas Jun Koba Sato, Silei Xu, Monica S. Lam
In Findings of the Association for Computational Linguistics (ACL), May 2022
[Paper] [Github]
Previous attempts to build effective semantic parsers for Wizard-of-Oz (WOZ) conversations suffer from the difficulty in acquiring a high-quality, manually annotated training set. Approaches based only on dialogue synthesis are insufficient, as dialogues generated from state-machine based models are poor approximations of real-life conversations. Furthermore, previously proposed dialogue state representations are ambiguous and lack the precision necessary for building an effective agent.
This paper proposes a new dialogue representation and a sample-efficient methodology that can predict precise dialogue states in WOZ conversations. We extended the ThingTalk representation to capture all information an agent needs to respond properly. Our training strategy is sample-efficient: we combine (1) fewshot data sparsely sampling the full dialogue space and (2) synthesized data covering a subset space of dialogues generated by a succinct state-based dialogue model. The completeness of the extended ThingTalk language is demonstrated with a fully operational agent, which is also used in training data synthesis.
We demonstrate the effectiveness of our methodology on MultiWOZ 3.0, a reannotation of the MultiWOZ 2.1 dataset in ThingTalk. ThingTalk can represent 98% of the test turns, while the simulator can emulate 85% of the validation set. We train a contextual semantic parser using our strategy, and obtain 79% turn-by-turn exact match accuracy on the reannotated test set.
Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2020.
[Paper] [Github]
We propose Semantic Parser Localizer (SPL), a toolkit that leverages Neural Machine Translation (NMT) systems to localize a semantic parser for a new language. Our methodology is to (1) generate training data automatically in the target language by augmenting machine-translated datasets with local entities scraped from public websites, (2) add a few-shot boost of human-translated sentences and train a novel XLMR-LSTM semantic parser, and (3) test the model on natural utterances curated using human translators.
Best performance can be achieved using a few shot approach where a small proportion of the train set consists of natural human translations of utterances from the English development set.
We assess the effectiveness of our approach by extending the current capabilities of Schema2QA, a system for English Question Answering (QA) on the open web, to 10 new languages for the restaurants and hotels domains.
Our models achieve an overall test accuracy ranging between 61% and 69% for the hotels domain and between 64% and 78% for restaurants domain, which compares favorably to 69% and 80% obtained for English parser trained on gold English data and a few examples from validation set.
We show our approach outperforms the previous state-of-the-art methodology by more than 30% for hotels and 40% for restaurants with localized ontologies for the subset of languages tested.
Our methodology enables any software developer to add a new language capability to a QA system for a new domain, leveraging machine translation, in less than 24 hours.
Silei Xu*, Sina J. Semnani*, Giovanni Campagna, and Monica S. Lam
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2020.
[Paper] [Website]
We propose AutoQA, a methodology and toolkit to generate semantic parsers that answer questions on databases, with no manual effort. Given a database schema and its data, AutoQA automatically generates a large se tof high-quality questions for training that covers different database operations. It uses automatic paraphrasing combined with template-based parsing to find alternative expressions of an attribute in different parts of speech. It also uses a novel filtered auto-paraphraser to generate correct paraphrases of entire sentences.
We apply AutoQA to the Schema2QA dataset and obtain an average logical form accuracy of 62.9% when tested on natural questions, which is only 6.4% lower than a model trained with expert natural language annotations and paraphrase data collected from crowdworkers. To demonstrate the generality of AutoQA, we also apply it to the Overnight dataset. AutoQA achieves 69.8% answer accuracy, 16.4% higher than the state-of-the-art zero-shot models and only 5.2% lower than the same model trained with human data.
Silei Xu, Giovanni Campagna, Jian Li, and Monica S. Lam
In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), October 2020.
[Paper] [Website]
Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model.
We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.
Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, and Monica S. Lam
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), July 2020.
[Paper] [Github]
This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21%.
Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, and Monica S. Lam
In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Phoenix, AZ, June 2019.
[Paper] [Github]
To understand diverse natural language commands, virtual assistants today are trained with numerous labor-intensive, manually annotated sentences. This paper presents a methodology and the Genie toolkit that can handle new compound commands with significantly less manual effort.
We advocate formalizing the capability of virtual assistants with a Virtual Assistant Programming Language (VAPL) and using a neural semantic parser to translate natural language into VAPL code. Genie needs only a small realistic set of input sentences for validating the neural model. Developers write templates to synthesize data; Genie uses crowdsourced paraphrases and data augmentation, along with the synthesized data, to train a semantic parser. We also propose design principles that make VAPL languages amenable to natural language translation.
We apply these principles to revise ThingTalk, the language used by the Almond virtual assistant. We use Genie to build the first semantic parser that can support compound virtual assistants commands with unquoted free-form parameters. Genie achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's generality by showing a 19% and 31% improvement over the previous state of the art on a music skill, aggregate functions, and access control.
Giovanni Campagna, Silei Xu, Rakesh Ramesh, Michael Fischer, and Monica S. Lam
In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 2018.
[Paper]
This paper proposes a novel approach to let consumers share data from their existing web accounts and devices easily, securely, and with fine granularity of control. Our proposal is to have our personal virtual assistant be responsible for sharing our digital assets. The owner can specify fine-grain access control in natural language; the virtual assistant executes access requests on behalf of the requesters and returns the results, if the requests conform to the owner's access control policies.
Specifically, we allow a virtual assistant to share any ThingTalk command--an event-driven task composed of skills drawn from Thingpedia, a crowdsourced repository with over 200 functions currently. Access control in natural language is translated into TACL, a formal language we introduce to let users express for whom, what, when, where, and how ThingTalk commands can be executed. TACL policies are in turn translated into SMT (Satisfiability Modulo Theories) formulas and enforced using a provably correct algorithm. Our Distributed ThingTalk Protocol lets users access their own and others' data through their own virtual assistant, while enabling sharing without disclosing information to a third party.
The proposed ideas have been incorporated and released in the open-source Almond virtual assistant. 18 of the 20 users in a study say that they like the concept proposed, and 14 like the prototype. We show that users are more willing to share their data given the ability to impose TACL constraints, that 90% of enforceable use cases suggested by 60 users are supported by TACL, and that static and dynamic conformance of policies can be enforced efficiently.
Giovanni Campagna, Rakesh Ramesh, Silei Xu, Michael Fischer, and Monica S. Lam
In Proceedings of the 26th International World Wide Web Conference (WWW), Perth, Australia, April 2017.
[Paper]
This paper presents the architecture of Almond, an open, crowdsourced, privacy-preserving and programmable virtual assistant for online services and the Internet of Things (IoT). Included in Almond is Thingpedia, a crowdsourced public knowledge base of open APIs and their natural language interfaces. Our proposal addresses four challenges in virtual assistant technology: generality, interoperability, privacy, and usability. Generality is addressed by crowdsourcing Thingpedia, while interoperability is provided by ThingTalk, a high-level domain-specific language that connects multiple devices or services via open APIs. For privacy, user credentials and user data are managed by our open-source ThingSystem, which can be run on personal phones or home servers. Finally, we create a natural language interface, whose capability can be extended via training with the help of a menu-driven interface.
We have created a fully working prototype, and crowdsourced a set of 187 functions across 45 different kinds of devices. Almond is the first virtual assistant that lets users specify trigger-action tasks in natural language. Despite the lack of real usage data, our experiment suggests that Almond can understand about 40% of the complex tasks when uttered by a user familiar with its capability.
Michael H. Fischer, Giovanni Campagna, Euirim Choi, and Monica S. Lam
In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), June 2021
[Paper]
While Alexa can perform over 100,000 skills, its capability covers only a fraction of what is possible on the web. Individuals need and want to automate a long tail of web-based tasks which often involve visiting different websites and require programming concepts such as function composition, conditional, and iterative evaluation. This paper presents DIYA (Do-It-Yourself Assistant), a new system that empowers users to create personalized web-based virtual assistant skills that require the full generality of composable control constructs, without having to learn a formal programming language.
With DIYA, the user demonstrates their task of interest in the browser and issues a few simple voice commands, such as naming the skills and adding conditions on the action. DIYA turns these multi-modal specifications into voice-invocable skills written in the ThingTalk 2.0 programming languagewe designed for this purpose. DIYA is a prototype that works in the Chrome browser. Our user studies show that 81% of the proposed routines can be expressed using DIYA. DIYA is easy to learn, and 80% of users surveyed find DIYA useful.
Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, Monica S Lam
In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), June 2021.
[Paper]
Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models: First, a BERT-LSTM with pointers parses instructions to ThingTalk, a domain-specific language we design for grounding natural language on the web. Then, a grounding model retrieves the unique IDs of any webpage elements requested in ThingTalk. RUSS may interact with the user through a dialogue (e.g. ask for an address) or execute a web operation (e.g. click a button) inside the web runtime. To augment training, we synthesize natural language instructions mapped to ThingTalk. Our dataset consists of 80 different customer service problems from help websites, with a total of 741 step-by-step instructions and their corresponding actions. RUSS achieves 76.7% end-to-end accuracy predicting agent actions from single instructions. It outperforms state-of-the-art models that directly map instructions to actions without ThingTalk. Our user study shows that RUSS is preferred by actual users over web navigation.
Jackie (Junrui) Yang, Monica S. Lam, James A. Landay
In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), October 2020.
[Paper]
Many computing tasks, such as comparison shopping, two-factor authentication, and checking movie reviews, require using multiple apps together. On large screens, "windows, icons, menus, pointer" (WIMP) graphical user interfaces (GUIs) support easy sharing of content and context between multiple apps. So, it is easy to see the content from one application and write something relevant in another application, such as looking at the map around a place and typing walking instructions into an email. However, although today's smartphones also use GUIs, they have small screens and limited windowing support, making it hard to switch contexts and exchange data between apps.
We introduce DoThisHere a multimodal interaction technique that streamlines cross-app tasks and reduces the burden these tasks impose on users. Users can use voice to refer to information or app features that are off-screen and touch to specify where the relevant information should be inserted or is displayed. With DoThisHere, users can access information from or carry information to other apps with less context switching.
We conducted a survey to find out what cross-app tasks people are performing or wish to perform on their smartphones. Among the 125 tasks that we collected from 75 participants, we found that 59 of these tasks are not well supported currently. DoThisHere is helpful in completing 95% of these unsupported tasks. A user study, where users are shown the list of supported voice commands when performing a representative sample of such tasks, suggests that DoThisHere may reduce expert users' cognitive load; the Query action, in particular, can help users reduce task completion time.
Jackie Yang, Gaurab Banerjee, Vishesh Gupta, Monica S. Lam, and James A. Landay
In CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, April 2020.
[Paper]
Although state-of-the-art smart speakers can hear a user's speech, unlike a human assistant these devices cannot figure out users' verbal references based on their head location and orientation. Soundr presents a novel interaction technique that leverages the built-in microphone array found in most smart speakers to infer the user's spatial location and head orientation using only their voice. With that extra information, Soundr can figure out users references to objects, people, and locations based on the speakers' gaze, and also provide relative directions.
To provide training data for our neural network, we collected 751 minutes of data (50x that of the best prior work) from human speakers leveraging a virtual reality headset to accurately provide head tracking ground truth. Our results achieve an average positional error of 0.31m and an orientation angle accuracy of 34.3 degrees for each voice command. A user study to evaluate user preferences for controlling IoT appliances by talking at them found this new approach to be fast and easy to use.
Michael H. Fischer, Richard R. Yang, and Monica S. Lam
ArXiv Preprint
[Paper]
This paper presents ImagineNet, a tool that uses a novel neural style transfer model to enable end-users and app developers to restyle GUIs using an image of their choice. Former neural style transfer techniques are inadequate for this application because they produce GUIs that are illegible and hence nonfunctional. We propose a neural solution by adding a new loss term to the original formulation, which minimizes the squared error in the uncentered cross-covariance of features from different levels in a CNN between the style and output images. ImagineNet retains the details of GUIs, while transferring the colors and textures of the art. We presented GUIs restyled with ImagineNet as well as other style transfer techniques to 50 evaluators and all preferred those of ImagineNet. We show how ImagineNet can be used to restyle (1) the graphical assets of an app, (2) an app with user-supplied content, and (3) an app with dynamically generated GUIs.
Michael Fischer, Giovanni Campagna, Silei Xu, and Monica S. Lam
In 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. (MobileHCI), 2018.
[Paper]
This paper presents Brassau, a graphical virtual assistant that converts natural language commands into GUIs. A virtual assistant with a GUI has the following benefits compared to text or speech based virtual assistants: users can monitor multiple queries simultaneously, it is easy to re-run complex commands, and user can adjust settings using multiple modes of interaction. Brassau introduces a novel template-based approach that leverages a large corpus of images to make GUIs visually diverse and interesting. Brassau matches a command from the user to an image to create a GUI. This approach decouples the commands from GUIs and allows for reuse of GUIs across multiple commands. In our evaluation, users prefer the widgets produced by Brassau over plain GUIs.