Introduction of PAVAI Prototypes: LLM-based implementation of Voice User Interface (VUI) Private Talking Assistant.

Minyang Chen
8 min readJan 2, 2024

Previously, I wrote a post on why offline voice assistant is the future. Developing a private, offline voice-based AI assistant has been on a wish list mine for some time now. I have named this project the PAVAI Prototype, which includes a web app called PAVAI.VoiceQ for UI and a hands-free interaction app called PAVAI.Talkie.

At present, I am in the process of clean up the experimental phase code, simplifying installation steps, and conducting tests before I can publish it on GitHub. The scope of phase-1 development is to create a voice user interface for the web UI and hands-free interaction, then connect it with all neural networks. I believe it would be advantageous to offer a brief overview of the project at this stage.

Enabling users to speak to the system as they would speak to another person provides a more intuitive and user-friendly interface, particularly for users who are not comfortable with technology or who prefer a more conversational interface. Moreover, cloning your own AI voice has a bit more human emotion and influence on judgment than simply reading text.

PAVAI VoiceQ — generated AI models represent as solar system.

Motivation

The motivation for creating PAVAI prototypes is to provide a more natural, intuitive, private and convenient way for users to interact with technology offline or online, while also improving accessibility, personalization, expanding user knowledge, creativity and continue integration with existing systems and incorporate latest AI technologies in speech recognition, text-to-speech and large language models.

Goals

My simple goal is to develop a Voice User Interface droid system similar to C-3PO from the Star Wars movie, which is programmed for etiquette and protocol intelligence, and proficiency in multiple languages and knowledge. Hopefully, it is more advanced than today’s voice assistants like Amazon Alexa, Google Assistant, Siri, and Cortana.

System Design Principles

  • Voice based natural interaction
  • Run in offline mode or private environment
  • Provide real-time transcribe, interpret and response
  • Build with open source AI models including LLM, Speech-to-Text and Text-to-Speech
  • Support all-in-one or scalable distributed deployment mode
  • Focus on address privacy and user protection
  • Continue knowledge integration
  • Keep in simple, Efficient and Intuitive

Essential Building Blocks and Technologies

  1. VAD is an essential component of VUIs, as it enables the system to respond to user input in a timely and efficient manner, while minimizing the risk of false activations or missed commands. By accurately detecting when a user is speaking, VAD algorithms can help to create a more natural and intuitive user experience, allowing users to interact with the VUI using their voice in a hands-free and eyes-free manner.
  2. Automatic Speech Recognition (ASR) Neutral network: This component is responsible for converting spoken language into written text that the VUI can understand and process. ASR technology uses advanced neural network models to identify and transcribe spoken words, even in noisy environments or with different accents.
  3. The Large Language Model (LLM) handles both Natural Language Understanding (NLU) and Dialogue Management. Once the spoken words have been transcribed, the NLU component analyzes the text to determine the user’s intent and extract any relevant entities or parameters. Apply LLM to understand the meaning and context of the user’s request. Tracking chat history helps control the flow of the conversation between the user and the VUI and allows LLM to determine the appropriate response based on the user’s intent and the context of the conversation and be proactive. The design choice is to use LLM with multilingual support for now
  4. Text-to-Speech (TTS): This component converts the VUI’s written responses into spoken words that the user can hear and understand. TTS technology uses synthesized speech to generate natural-sounding voices that can convey the appropriate tone and emotion. Experimented with various TTS frameworks that support offline and dynamic voice loading.
  5. Custom Data Commands: This component provides quick chat access to transcribe youtube videos and ingest into chatbot or translate non-english youtube video audio into english and text file. This helps enrich the background and context of the question quickly.
  6. Voice-to-Image generation based on StableDifussionXL for extra creativity and fun.

PAVAI Prototypes

PAVAI prototypes consist of two User Voice Interaction applications to assist user interaction needs for various use cases and tasks.

PAVAI VoiceQ (Prototype-1)

The UI component of VUI’s for user input and responses to the user, take full control of voice or text inputs. typically through a microphone, speaker or view result in display. The UI is designed to be minimal at the sametime intuitive and easy to use, providing clear feedback to the user and enabling them to interact with the VUI using natural voice or quick access commands.

Figure-1: PAVAI.VoiceQ — Components

Applicable Use Cases:

keep in-control of voice input, transcribe youtube video, voice to image generation, and multi-modal chatting, chatting with unstructured data.

Current Features

  • Voice Activity detection
  • Automatic multilingual speech recognition and text-to-speech generation
  • Support grammar synthesize correction model without changing context
  • Voice-to-image generation
  • Multi-modal Chatbot support unstructured audio or text file upload
  • Custom commands for youtube transcribe, summarize or translate and web search
  • Automatic history summation when chat history hit context max

Screenshots :

Figure-2: PAVAI.VoiceQ — input voice file english
Figure-3: PAVAI.VoiceQ — input voice english
Figure-4: PAVAI.VoiceQ — input voice french
Figure-5: PAVAI.VoiceQ — input voice cantonese
Figure-6: PAVAI.VoiceQ — input voice arabic
Figure-7: PAVAI.VoiceQ — input voice mandarin
Figure-8: PAVAI.VoiceQ — voice to image generation
Figure-9: PAVAI.VoiceQ — chat custom youtube command

PAVAI Talkie (Prototype-2)

Hands-free and Eyes-free interaction terminal application: VUIs enable users to interact with technology using their voice, freeing up their hands and eyes for other tasks.

ScreenshotsFigure-1: PAVAI.Talkie — Components

Use Cases:

Hands Free and Eye free Q&A or conversation or task assistant

Current Features

  • Voice Activity detection
  • Wake up commands
  • Keeptrack of full sentence
  • Minimize speaker talkback (echo response to microphone as input)
  • Support automatic history summation on chat history when hit context max
  • Support clone custom AI voice for text-to-speech

Overall, these components work together to enable users to interact with technology using their voice, making it more accessible and user-friendly.

Screenshots

Figure-2: PAVAI.Talkie — startup system check
Figure-3: PAVAI.Talkie — sample conversation in action

Key Characteristics

PAVAI have several unique characteristics that distinguish them from other types of user interfaces. Here are some of the key characteristics of PAVAI:

  1. Hands-free and eyes-free, meaning that users can interact with the system without needing to use their hands or look at a screen. This makes PAVAI hand free particularly useful in situations where users are busy or on the move.
  2. Respecting natural language input enables users to speak to the system as they would speak to another person in their own language.
  3. Multimodal input allows users to combine voice commands with other input methods that can provide a more flexible and intuitive user experience, particularly in complex or multi-step tasks. For example, ingest a youtube transcribed text to enrich the context of your question.
  4. Personalization to the user’s preferences and needs, providing tailored responses and recommendations based on the user’s history and context. This can help to create a more engaging and relevant user experience. Personalization can include voice cloning or tasks assistants.
  5. Always-on and accessible providing users with instant access to information and services whenever they need it. This can be particularly useful in situations where users need quick assistance on lookup information or setting reminders for repeating tasks..
  6. Emotional connection with users, using natural language and tone of voice to convey empathy, humor, or other emotional states. This can help to create a more engaging and memorable user experience.

By leveraging above unique characteristics, PAVAI can be intuitive, user-friendly, and provide real value to users.

Limitation and Challenges

However, it’s important to keep in mind that PAVAI also has some limitations and challenges, as it requires a deep understanding of the unique characteristics of voice as a user interface. Here are some of the key challenges that I found:

  1. Ambiguity: Speech is inherently ambiguous, and users may use different words or phrases to describe the same thing. So the LLM must be able to handle this ambiguity and provide accurate responses, even when the user’s input is unclear or imprecise. Another problem is localization of language can result in misinterpretation.
  2. Accuracy means being able to accurately recognize and transcribe spoken language, even in noisy environments or with different accents and speech patterns. This can be challenging, as speech recognition technology is not yet perfect and can sometimes produce errors or inaccuracies. One way to work around this issue is voice activity detection model to clearly differentiate noise versus user spoken language.
  3. Context means understanding the context in which the user is speaking, including the user’s intent, the current state of the conversation, and any relevant entities or parameters. This requires sophisticated natural language understanding (NLU) technology that can analyze and interpret the user’s input in real-time.
  4. Conversational flow to manage the conversational flow between the user and the system, providing clear and concise responses that guide the user towards their goal. This requires careful consideration of the user flow and interaction model, as well as the use of clear and natural language. The workaround is to fine-tune the model using previous user conversation data so craft a clone of the user personal model.
  5. User expectations is always high, it can easily become frustrated or disappointed if the system does not meet their needs or expectations.
  6. Privacy and security must be mindful of privacy and security concerns, and must take steps to ensure that user data is protected and secure.

By addressing these challenges and designing with the unique characteristics of voice as a user interface in mind, designers can create effective VUIs that provide real value to users.

Summary

Voice user interfaces (VUIs) are not a novel concept, as they allow users to communicate with devices in the same way they would with another person, making the interaction process easier and more intuitive. VUIs are considerably faster than typing to input information but are noticeably slower than reading or viewing output information from a computer system.

The PAVAI prototypes are still in the earlier stages of development and are far from being a clone of the C-3PO droid system. Nevertheless, the process of creating something new, continuing research, and innovating is an enjoyable aspect of this project. As always, I welcome your thoughts or suggestions on solving challenges related to implementing effective voice user interface interactive applications.

project code: https://github.com/PavAI-Research/pavai-c3po

Thanks again for reading… have a nice day!You deserve all the best: May all your wishes come true in 2024!

HAPPY NEW YEAR!!!

--

--

Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.