top of page
Human-Robot Friendship



DT2140HT211 - Multimodal interaction and interfaces - Course project

project OVERVIEW

Multimodal social interaction with the Furhat head under Covid-19 triage scenario

Responsibilities: UX research, Dialogue design
Keywords: Multimodal Interaction, Social Interaction, Human-machine Interaction, Conversational robot
Duration: 2 months
Belongs: DT2140HT211 - Multimodal interaction and interfaces



The sudden outbreak of the novel coronavirus epidemic has brought about tremendous changes in social life, and the ensuing problems have affected the development of various industries, with a particularly pronounced impact on the healthcare service sector. In this sub-epidemic, cross-infection between doctors and patients and strained medical resources are the main problems for patients to access medical care. Especially, when the immediate hit of the epidemic has occurred, it is becoming more difficult to consult a medical specialist. 

Introducing healthcare technologies is one of the important ways to solve these problems. Artificial intelligence and intelligent medicine were very popular in Healthcare Technologies, pushing medical digitalization to a higher goal in recent years. Human-Computer interaction is closely associated with the healthcare area, particularly conversational AI like chatbots are more prevalent. They are known as the most advanced and promising applications in Human-Computer Interaction. The application of chatbots in outpatient consultation and triage scenarios has good application prospects, and some medical institutions have started to carry out relevant research and application work. The chatbots reduce the work pressure on nurses and guides while meeting the patient's medical consultation needs and ensure the quiet and orderly environment of the consultation area and consultation room. They can effectively supplement immediate medication and personal clinical care under pandemics.

Competitive Analysis

Some medical chatbots under the COVID-19 context have already been established. In December 2021, the Rwandan startup Digital Umuganda developed the "RBC-Mbaza" chatbot to provide virus updates and information on current restrictions and regulations. Then the official illustrates the impact of such information initiatives: the "RBC-Mbaza" chatbot plays a key role in Rwanda's effort to manage the COVID-19 pandemic and has been used by over half a million people.


In March 2020, The World Health Organization(WHO) launched a Health Alert in seven languages with partners WhatsApp and Facebook, aiming to bring COVID-19 facts to billions. It uses Turn machine learning technology to develop a messaging service that can provide COVID-19 updates. 


Likewise, the British Government has deployed a WhatsApp chatbot to let anyone subscribe to official advice and the latest news about the COVID-19 pandemic, in the hope of strengthening infection prevention efforts.


Supervision Intake

At the very beginning, we are going to build a doctor assistant chatbot for general ​​ disease scenario use cases. During the discussion with our supervisor, she pointed out some issues of our project: our original topic is too wide to implement within a limited time. Then she suggests we narrow down our scope and work on a more specific use case. We followed her advice and adjusted our direction to a smaller angle: COVID-19 background consultants. Her suggestions helped us a lot and improved our project, making it more complete and focused.

Supervision Intake.jpeg

Case Study in COVID-19 scenario


Since the pandemic spread so rapidly, receiving COVID testing has become very common in daily life. More and more testing centers appear in the city. People with suspicions can go to the nearest test center to do a test. However, there are many challenges:


No matter when you can see a very long lineup in front of the testing center. People are crowded which increases the risk of cross-infection. Testing center staff are spending their day in appointment scheduling and answering routine questions of patients. Continuing or repeating the same actions and words is neither necessary nor productive. 

What’s worse, due to the time limitation, it is hard for the healthcare person to give undivided attention to every patient to promote preventive measures.

There are also some limitations in the “one-off” testing kit which allows people to do the test at home: users may face the risk of false negatives because the kit is unable to adapt to various contexts which can be only done through consultation. 

Last but not least, many people are fearful of having face-to-face testing or consultation.


The limitation of text-based chatbot

There are some undeniable limitations of text-based chatbots. Firstly, since they are a single modality, they cannot understand the human context from things more than text. This is a massive gap that can easily make users frustrated. The physicians are also concerned about the text-based medical consultation with problems regarding self-diagnosing and not understanding the diagnosis.

Secondly, much research pointed out that the biggest challenge in chatbots is to create a smooth interaction with humans. Since there is only one modality in text-based chatbots, accessibility and fluidity can be an issue for users. Under medical background, there is always very much text in related information and news, text-based chatbot requires cognitive efforts from users and causes cognitive strain. 

Research on multimodal chatbots has shown that intelligibility can be improved through modality combinations like semantic context, clear speech, and visual input.

Thirdly, text-based chatbots cannot be built personalized through a single modality so that they have no emotions, which Stone et al’s point out to be a big challenge for the field of AI. The text-based chatbot cannot relate to any low situation, they can never establish a connection with the users, which is crucial for medical consultations.



Modality Design​​​​​​​

The biggest challenge we had was focusing on making complex decisions and interactions simple due to the time limitation of the project. We had to narrow down our vision and make sure we utilize and reuse standardized API offers by the Furhat SDK and focus on a use case. Combining multiple modalities to perform an action has to make sense for the Interaction to be natural and efficient. For example modalities systems developed back in the day didn’t have the technological and Hardware advancement we have access to today. We combined two modularities, speech and Gesture to form a complete semantic interpretation by synchronizing the signals and data from the different inputs from the Furhat.

The combination of the modality and adding Speech, spoken language interpretation, gesture recognition, gestural language interpretation are all integrated into the application bridge that serves as a source for the launch which we access via the Furhat  API Application invocation. 


The important part of the information architecture is the mix of the modality to enhance robustness and naturalness it made more sense to have Auditory and gesture recognition combined. 


We combined gesture, gaze, speech modalities, and LED feedback using the incorporated furhat LED to enhance the user experience. The combination helps enhance the robustness of the multimodal interface making it more intuitive. Those modalities help the virtual assistant react to gestures such as a smile to wake the robot, start the assessment, and naturally and efficiently ask relevant questions simultaneously. ​​​​​​​

Dialogue Design

The dialogue design includes four parts(Figure 5). Firstly, users should smile at Furhat to wake it up and check-in. Secondly, the Furhat will have a short self-introduction and start a general symptom checker. If users have severe symptoms Furhat will call the doctor immediately. If users have common or uncommon symptoms, Furhat will do the severity check to see if the infection possibility is high or not. If the user has a high possibility of infection, Furhat will contact the doctor. Otherwise, people with a low possibility of infection and people who don’t have symptoms will be told some preventive measures.

Dialogue Design



The code was developed based on the speech lab and the research we did to find out how to do expressions with Furhat. We followed the guide from Furhat docs to create the skill and the guidelines for running the same from the docs. The most important class is an interaction where all the interactions between the person and Furhat are defined.

iShot2022-01-11 11.23.41.jpg
iShot2022-01-15 18.43.png

In this project, we used the Furhat robot head. We used Furhat to study and test the multimodal modality of COVID-19 medical consultation. It has an animated face and can match with a three-dimensional physical mask that the animated face is projected on. Furhat utilizes the latest animation models that are able to produce synchronized articulatory movements that match with the output speech. Highly accurate and realistic facial gestures can be performed on it. In addition, Furhat has a 3DOF neck which allows head-pose control to eliminate the Mona Lisa gaze effect. We implemented the project with our laptop and a Furhat We connected Furhat through a network and deployed the skill on it.


Technical level

At the start of the project we had many ideas of what we wanted to achieve but there were many limitations at the technical level, as we had to familiarize ourselves with the SDK and kotlin programming language. The level of knowledge, we had when we started around the subject versus now is big. We are really happy to see the result. There is room for improvement. The knowledge and understanding we acquired after finalizing this project equipped us to explore more and take this project even further. The most complex part of the task we had wasn’t the task itself but the knowledge that was required before solving the task at hand. We focus much of our time doing research and reading both the API calls, SDK, and also some related research connected to the modalities(Gesture and Auditory).  

We took the chance to make use of the course LAB and utilize the time we had at the beginning of the course to explore the direction of the final project. The basic code with build on top was something we coded and solved during the LAB which allows us to focus on understanding the SDK. Due to the time limitation, we focus on working on the core and at the same time utilizing the existing API for repetitive calls. One of the things we could have focused more on is user testing. Due to COVID-19, we were only able to perform 3 usability tests which helped us understand limited flows.  

Scientific level

Our work is directly related to the theory of HCI and human communication as it replaces a common interaction between two humans, with the objective of reducing the spread of COVID-19 and reducing the times of triage in the case of COVID-19 assessments. 

For the solution, our project separating long text and rebuilding it into quick Q&A, resolved the problems of not understanding the diagnosis to a great extent caused by the text-based single modality chatbot. The combination of auditory and vision improves intelligibility and makes Furhat personalized with emotions.

As we only implemented two modalities, the speech, reactions, and Led color, the latter not being as clear as we expected for the user because for some reason the colors are not indicative of the severity of the responses, we could say there are several limitations in comparison to a real human-like interaction. However, for the purposes of triage, only Furhat can be very handy as it mimics the basic conversation an agent would have with a potential COVID-19 patient. Which can meet physicians' needs of using chatbots for simple things in their clinics, like scheduling appointments and providing medical information. 


We invited 3 volunteers to evaluate our project, all of them have prior experience in using text-based chatbots. Before the evaluation, we asked them to pretend they have one or more COVID-19 symptoms and taught them the check-in methods. Then, they went to speak to Furhat one by one. When all of them finished, they were invited to separate rooms for a 15 minutes interview of the comparison between text-based chatbot and multimodal Furhat, and a filling scale form of four dimensions: novelty, usability, flexibility, and humanlike.


According to user interviews, we found volunteers gave more positive feedback on Furhat, they thought that communication with Furhat is more effective than a text-based chatbot. It took more time for users to find a specific answer from a text-based chatbot and they felt inpatient more easily. They are more willing to talk with Furhat because it looks very novel and interesting. One volunteer emphasized that he was impressed that Furhat has a facial expression which made him want to talk more with Furhat to see its responses.


According to user interviews, we found volunteers gave more positive feedback on Furhat, they thought that communication with Furhat is more effective than a text-based chatbot. It took more time for users to find a specific answer from a text-based chatbot and they felt inpatient more easily. They are more willing to talk with Furhat because it looks very novel and interesting. One volunteer emphasized that he was impressed that Furhat has a facial expression which made him want to talk more with Furhat to see its responses.

One volunteer said: "I don’t like communication taking place behind screens. Compared with the traditional text-based chatbot, I’m more willing to talk with Furhat face to face. But it could be better if it has a large corpus and performs more facial expressions naturally.”

Volunteers like this kind of combination and gave a high score on the novelty scale, but they also said that the current functions are so limited. The speech and facial gestures are not combined naturally, and the speech pace is fixed. What's more, the dialogues are pre-defined which lacks flexibility. The LED modality is hard to recognize and the red light sometimes gave them a sense of unease.


When we asked them about their suggestions, most answers focused on improving dialogue design and gesture design. Two volunteers suggested that the LED modality can be improved by showing meaningful insights. One volunteer thought we can add a haptic modality of card reading so that the Furhat can read the user's medical recording through an ID card to provide specific suggestions.

what's next?

Our solution is not a state-of-the-art solution, we believe there is room for improvement. The task we tried to solve could add value and minimize stress at the hospital and in public spaces. Something that could add value to the project as part of future work could be for instance enhancing the development by connecting the Furhat SDK API  to a more reliable source of information by doing so to increase reliability. As of today, we are seeing new virus variants and constant updates of restrictions and symptoms. Connect to an external API and automatically retrieve the updated covid information. That would have allowed us to tap into different and specific databases based on user needs.


What I have learned?

I found the combination of auditory and vision improves the usability and intelligibility of the spoken output. The two modalities can compensate for each other. The auditory modality greatly improves the communication efficiency through short Q&A. Compared with the text-based chatbot, users don't need to read a long text and unrelated content. It improves accessibility since people who are not tech-savvy can easily use it, which is very friendly to vulnerable groups like the elderly. The vision modality like LED and facial gestures supplement the learnability. The smile check-in function provides enjoyability for users, especially in a tense medical consultation atmosphere.

bottom of page