Part-1: Pavai.Vocei(C3P0) re-imagines C-3PO’s TranLang III communication module using AI in 2024

10 min readMar 6, 2024

Backstory

“I am C-3PO, human-cyborg relations” that is C-3PO self introduction in star wars movie represent as nothing more than comic relief at best. Sometimes people may be irritated by C-3PO statements that make no sense because he’s nothing more than a robotic translator.

However, C-3PO really stands for Commercial Crew and Cargo (3 C’s) Program Office for assisting communication according to the originator.

Upon revisiting the films, it becomes evident that C-3PO serves a more profound role than simply being a bumbling translator sidekick. In fact, he functions as a vital support operative, whose primary objective is to preserve the unity of the group and facilitate its members’ interactions in an effective manner.

C-3PO essentially assumes the covert role of a “silent mentor” by meticulously observing those around him and utilising psychological techniques to deliver precisely timed comments that delicately sway their emotions, often at the cost of his own dignity.

Through these strategic interventions, C-3PO successfully navigates the group towards its intended goals while maintaining seamless communication among its varied personalities. Well, he accomplish the object at expensive of making fun of himself — is this true AGI?

The Goal

The short answer is have some fun with re-imagination of future of robotics and droid with a sci-fi prototype.

The longer answer is that C-3PO was played by a real person (Kudo to Anthony Daniels) in the movie but it was such a forward thinking imagination at that time in the Star Wars movie.

Fascinated by C-3PO technical capabilities, I always wonder how to implement C-3PO with today’s technologies on-hand.

C-3PO’s language capabilities has been one of my primary sources of inspiration along with Voice assistants such as Amazon Alexa, Google Assistant, Siri from Apple, and Cortana from Microsoft, as well as the latest developments in AI. I believe there are opportunities to refine and update original C-3PO’s implementation.

My research and experiment will concentrate on several AI areas: how to automatic convert voice to text, how to generate human-like voice synthesis from text, then how to make speech-to-speech translation in real-time, memorisation, storage and thinking skillsets.

These AI capabilities make significant advancements and hold significant potential for numerous real-world applications.

More important computing cost went down dramatically in last few years. with the exponential growth in consumer grade PC computing capabilities. we can now running all models in a single home PC. So Exploring building own the private assistant at home or office is a true reality.

C-3PO Key Capabilities

In the Star Wars universe, C-3PO was designed as a protocol droid, equipped to aid in matters of etiquette, cultural norms, and language translation — this help integrate with human interaction much easier and give a sense of hybrid intelligence where human-and-machine work together.

With the ability to communicate in over six million forms of language, C-3PO serves as a robotic diplomat and translator across the vast and varied cultures of Lucas’ imagined galaxy. He is prone to anxiety; he talks a great deal; he makes mistakes and is loyal to his friends and more… see below:

Built-in TranLang III communication module
Perform Real-time automatic voice recognition (ASR) in offline mode
Perform Real-time text to speech synthesis with human-like voices
Perform Real-time speech to speech translation task
Self conscious on risks and dangers of the world ( we call this anxiety or risk management?)

Ready, let’s discuss the implementation approach taken to realize C-3PO capabilities by exploring open-source AI technologies in 2024. In which it can run all-in-one locally in a single home PC or distributed in a private network.

Challenges

Building C-3PO capabilities is a big overtakes using AI capabilities because this involve many AI and ML capabilities model and many challenges, they are:

Voice translation accuracy in a noisy environment, certain word may be treated as noisy
Visual capabilities, where the droid learn about the physical world risk
Memory storage processing in human is quite different than robot
Generative AI (LLM) is quite restricted because limit to text semantic only and not other form yet.

To limit the scope of the experiment, will exclude the visual understanding for now.

Experimental Build

To make things easier to read and explain, I broken down the writing into 3 parts as follow and their topics:

Part-1: Pavai.Vocei (C3P0)
- Concept and design for user interaction
- Automatic detection of voice activity and speech recognition
- A stylish voice synthesizer that is modeled after human-like qualities, — using reference audio
- Real-time speech-to-speech multilingual translation
- Speech-to-image generation aimed at realism
- Analysis and protection of PII (Personally Identifiable Information) data in inputs and outputs
- A grammatically correct synthesizer that does not change the context
- Integration of LLM (Language Learning Model) with various personas and tones.
Part-2: Pavai.Talkie (C3P0)
- Concept and design for a hands-free system
- Voice activity detection and automatic voice recognition
- Communication dialogs and multiple voices
- Startup self-diagnostics with an integrated system voice (Jane)
- User-selectable favorite reference voice (Mark).
Part-3: Pavai.Workspace
- The concept and design focus on work and productivity
- Enhance chatbot discussions with web data such as web searches, news, YouTube videos, and links
- Make lengthy texts easier to read with a long-text summarizer
- Dispatch web searches to agents or automated processes
- Use a speech-to-speech translator to overcome language barriers during meetings.

PART-1: Pavai.Vocei

This application provides a user friendly interface for tasks requiring a visual interface.

Voice Communication

Inter-human communication

Humans use voice is used for two main types of inter-human communication: speech and singing. Both are constituted by strings of carefully ordered and timed speech sounds, or phonemes. There are two main types of phonemes, voiced and unvoiced sounds. Also composed of five elements, they are dictation, detail, imaginary, syntax and tone.

human to machine communication

Humans input voice needs to be transcribed into text then translated into machine language for processing; on result convert output back to text format then convert to human synthesizer voice.

Oral Communication Challenges

communication by spoken language impose additional due language barriers and cultural differences. With only 195 countries in the world the official known languages is officially 7139 according to the ethnoloque guide and the top 10 most spoken first language in the world by far is mandarin chinese, spanish, english, hindi, portuguese, bengali, russian, japanese, cantonese chinese, and vietnamese.

Concept and Design

Exploring the use of the dynamic AI model to implement C-3P0 capabilities, I craft out a blueprint for the system and components and make Vocei and Talkie in a single code repository.

See overview diagram below:

Since spoken words represent the primary means of communication, converting them into written text via speech-to-text (also called automatic speech recognition) is necessary before feeding the transcription into appropriate models for processing. Subsequently, the output must be transformed back into written text and finally rendered into voice using a voice synthesizer — all executed instantaneously in real-time.

Pavai.Vocei (C3P0)

With a user interface we can do more careful step-by-step to record voice prompt inputs, testing and validation on input and outputs to build so we can build confidences on the system behavior to ensure the outcome is predictable.

The primary functionality to be implemented in Vocei (C3P0) is TranLang III communication module. Where users can perform various types of tasks using voice as primary input, transcribe into instructions and convert into another spoken language in real-time.

What can I do in Vocei (C3PO)?

You can use your own voice as prompt to LLM by simply recording the question you want to ask then get back a human-liked voice response
You can see alert when PII data detected on Input and Output
You can enable strong data protection by enabling anonymization
You can select AI assistant Persona, Tone and Voice Reference
You use speech to text to generate an image
You can enable grammar synthesizer correction without changing context
You can select favourite reference voice to respond to your questions
You can do seamless communication with real-time speech-to-speech translation

A picture represents thousands of words, here’s some Screenshots.

Note: PII detection warning is shown at the bottom.

Voice prompt generated image — C3PO work at office

Voice prompt generated image — Year of Dragon 2024

Quick Start Guide (locally)

As you see there are lots of models been integrated into this solution so first time setup will involved downloading lots of models to local machine. this step may take some time, because some model size is big so be patient.

I create a first_time_setup.sh step to ensure files been download first before running the app. Also, on startup the app will perform a self-diagnostics on resources and functionality then generate a report at the end. of course, speaking up the status with a voice — just like how a droid should work :-)

Step-1. Setup the prerequisites (Python and Poetry)

Step-2. Clone the repository code
        $ git clone https://github.com/PavAI-Research/pavai-c3po.git

Step-3. $poetry shell
Step-4. $poetry install

Step-5. install LLamaCPP optimized for your hardware or default to cpu

example for nvidia GPU card
$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" poetry run pip install llama-cpp-python==0.2.27 --force-reinstall --no-cache-dir

see here for more details: https://github.com/abetlen/llama-cpp-python

Step-5. $bash first_time_downloads.sh   ## to get all the models locally (ONE TIME ONLY)

Step-6. Run Vocei application 
        (choose 
         $./vocie_gradioapp.sh 
         or 
         $./vocie_fastapp.sh -- seem to be running faster than gradio)

To view the UI: http://localhost:7860 

Note: a system voice will speak up on the instruction too.

Special Notes:
Please ensure llamacpp installation is correct, if you find the system is slow,could be just running in CPU and not using the GPU.
Setup only validated in Linux system with Nvidia GPU only so far.

Deployment Options and considerations

Entire Voice applications can be run on a single PC, with all models operating locally, or in a distributed mode for large language model (LLM) providers such as Ollama or Llamacpp-python OpenAI servers. There are also other frameworks that are comparable to the OpenAI API, such as vLLM, which should work as well, although it has not been tested yet.

The LLM model for local use is set by default to employ the Zerphy quantized 4-bit model due to its multilingual capabilities. However, this can be replaced with any preferred model.

The speech-to-speech translation model utilizes a larger model for improved accuracy, with a size ranging from 18 to 30 GB. Ideally, the total VRAM requirements to run all of these models is 16 GB or more. However, it is possible to move some models to run on the CPU, which should allow for operation in low VRAM configurations, although further optimization is needed.

The reference voices used for the human-like Voice synthesizer are those of Star Wars fan favorite characters, used for demonstration purposes. These can be replaced with custom voices by recording a 15 to 30 second speech in a specific tone and placing it in the reference audio folder, then updating the configuration file.

Voice Synthesizer toward human-like

It’s important to discuss the real-time implementation of voice generation. It can have emotions like happiness, anger, surprise like humans do, less robotic sounds like on speech for example, we take pause when talking long sentences and adjust the voice pitch as we speak.

The implementation attempt here is able to simulate these effects for a more engaging experience when talking to the system. However, one side effect of generating this in real-time is increased hardware requirement but I think still worth it for a human touch experience.

Speech-to-Speech Translation

Real-time voice translation is one of the most useful advancements in today’s AI world, in my opinion. This is because we live on a planet with many spoken languages, and it is not difficult to come across someone who does not speak your language, especially when traveling for leisure or business meetings. Therefore, having the ability to do real-time voice translation can create a seamless communication experience. Furthermore, running this all on your local PC ensures privacy and safety.

What are challenging and performance considerations?

When implementing voice generation in real-time, it is important to consider the ability to convey emotions such as happiness, anger, and surprise, as this can make the speech sound more human-like.

For example, humans naturally pause during long sentences and adjust their pitch while speaking. The implementation attempt here is able to simulate these effects for a more engaging experience when using the system. However, one drawback of generating this in real-time is the increased hardware requirements. Despite this, I believe it is still worth it for the more human-like touch it provides.

What is next?

If you like my adventure and are excited to learn more about it. explore the code here and have some fun with it.

Github Repo: https://github.com/PavAI-Research/pavai-c3po.git

Friendly reminder:

Please keep in mind this an experimental alpha release with limited testing in Linux and Nvidia GPU. If you have instructions on a tested setup on another system, please let me know, so I can update the instructions.

Welcome suggestion and help to make this a reality.

“May the AI force be with you”

Still feel interested? continue reading other parts

Part-1: Pavai.Vocei (C3P0)
Part-2: Pavai.Talkie (C3P0)
Part-3: Pavai.Workspace

Credits

All incredible open-source community projects on AI development ecosystem to make this dream work possible.