hatem.ai – A GPT-based verbal conversationalist, built for automated sales calling.

Three weeks ago, I travelled to Sydney to spend a few days. One of my best mates was moving to Ireland, so it was worth the trip just to see him off. Whilst up there, I also caught up with my friend John Hatem, who was in my graduate program cohort at Hewlett Packard Enterprise (HPE) and has recently become one of their local government account managers. He invited me to work for the day in the HPE Sydney office, where there coincidentally was a sponsored lunch. Turns out, I has walked in on “Blitz Day”, in which all sales team go through customer sheets and make as many calls as they can. They made a total of 110 that day.

The average conversion rate for “cold” calling, or the “call success rate”, is around 16%. For most enterprise sales reps, generating at least 20 qualified leads a week will ensure there is enough in the pipeline when it comes time to close out a quarter. That would mean a sales representative would need to make at least 27 calls a day in order to fulfil their lead quota, or one every 20 minutes on a standard working day.

This doesn’t leave terribly much time to engage in the primary reason organisations have a salesforce: to sell. To engage with audiences rather than just find them. To navigate every step of the long sales cycles, to truly understand their needs and requirements, to engage with customers on a personal level. You know, human interaction.

Observing the Blitz Day got me thinking – with the voice replication tools and ChatGPT wrappers readily available as inferencing services, what was stopping a script from replicating “John” and making multiple calls at once? This script (we’ll call a “bot”) wouldn’t need to engage in long conversation. All a bot would need to do is organise a follow up call or meeting if there happen to be interest in a product or service. I figured, that I could make a web-app that imports their call sheets, makes a couple of quick GPT or Google API calls to build further context about the customer and/or their industry, have it made an outbound call, generate the responses using GPT with a voice that sounds like them, containerise each call instance and then… well… make 10,000 cold calls in 10 minutes.

I ran this idea past my dear friend John, as I have always done, except this time he caught me off guard. He, jokingly, said in an angry tone “you always tell me you could do these things and then don’t. Why not just fucking do it?!”

He was right. Why don’t I? After all, indulging in these kinds of projects was why I was taking this time off work.

The next day, I asked John to send over two voice messages to retrain a voice model to sound like him.

A sample from one of John’s original voice messages.

After just a few days, I had a working prototype. It sounds near-identical to his voice, albeit being just a bit robotic in its pronounciation of things. In its responses, it displayed all the delightful randomness we’ve seen of GPT over the years (asking about dietary requirements for a meeting? Considerate, but redundant). Some initial prompt engineering would be required to ensure it doesn’t go off on tangents. However, for the most part, the theory had been proven. We had a working call bot.

The first prototype, up and working in under 48 hours with just a few API calls. Intentionally sped up the moments between my answer and the bot’s reponse, due the delay of making said API calls.

As usual, my brain went wild with all the possible future features. Spreadsheet importing to autofill contacts and make search engine queries to better engineer the Personality Prompt for the bot. A LinkedIn finder and scraper. A profile could be built on each person; providing data far more useful than any CRM out there. Hell, we could pull data from a CRM to throw into the mix. I realised it could be an incredibly powerful cold calling sales tool even without the auto-calling aspect. Mass auto-calling and containerised deployment could come next, with a dashboard that monitors the stage of every call and giving the user complete control of how many were made simultaniously. It is, however, easy to get swept up on what are all exciting could-be V3 features. It was time to build the first step towards a V1.

Over the next few weeks, I worked the protoype script into a Flask-based web-app that is more sophisticated, stable and generates faster responses. I wrote a setup screen that helps someone easily train their own voice model in minutes, and made a nice dashboard UI which has the voices of both parties be represented in text by animated iPhone-style messages.

The speech bubbles were far more of a pain than initially anticipated, as passing dynamic server variables in real-time to the client side in Flask is neither trivial nor pleasant. Thankfully, there was XML to the rescue.

Within the Settings pane, a user could easily make or retrain their own voice model, simply by reading a few pieces of text. The “Personality Prompt” would dictate how the call was to be handled and navigated by the bot.

Current settings include voice model training and Personality Prompts.

Developing hatem.ai both taught me a lot about and gave me far more of an appreciation for “Prompt Engineering”, of which I noticed several major companies are now hiring for to get the most relatable and useful answers out of large language models. I spent a collective 6 hours working the prompt back and forth, often asking the model how I could modify the Personality Prompt to best engage in the conversation. This process also required me taking conversations I had with hatem.ai and getting John’s feedback on them (good ol’ fashioned manual reinforcement learning). His input often had the most influence on the changes I would make to the Personality Prompt. This led me to realise that in order to make hatem.ai great, I would need to build and integrate a tool that could construct an “optimal” Personality Prompt based on variables given by the user and/or automatically found about the customer as discussed above. Better yet, a set of conversational text based on either previous phone calls, or getting experienced sales people to make conversational text based on what they know. The larger the amount of specific conversational data, be it for enterprises sales or literally any other use case for hatem.ai, the better the Personality Prompt.

The roadmap for hatem.ai was starting to build. Currently, the average delay between a user finishing their input and hatem.ai verbally responding is about 3-5 seconds. A good 80%+ of this time is generating the vocal response from ElevenLabs. A few V1 features would need to be polished, including making the responses faster and the speech recognition more stable. Using the Whisper API instead of Google Recognition (via the SpeechRecognition library) would yield faster and more accurate user input to the bot. Some amazing work by Georgi Gerganov could make the responses from the bot even faster than they already are by running local instances of LLaMa and Whisper on my own server. I will play with them just to prove the concept. Another set of V1 features must include summarising calls (an easy final prompt back to GPT), importing contacts and some basic metrics keeping. V2 could see invites automatically be sent out with the call summaries to users, based on a rep’s calendar availability (Outlook and Google Calendar APIs first), but would mostly depend on the V1 feedback.

The first conversational elements from the first demo video, as posted on LinkedIn. Aside from a few audio tweaks to make it all sound less jumpy from the internal Mac microphonme, this was all recorded live.

By this stage, I’m thinking: the applicable possibilities are endless. Charities. Sales. Level 1 post-sales support. Verbal conversations in the car about any desired topic. Even just simple, verbal company. It made me realise that there is very little that does not stand to change as a result of the tools that are now available to us all. What stone tools once were to man, large pre-trained models are today to us avid technologists. We make literally anything with them. It’s a very exciting time for building great things, and I’m thrilled to be in it. In making hatem.ai, I aimed to make something that reflected this excitement.

That said, whilst developing this app, however, I found myself deeply pondered the negative repercussions of what an auto-calling or conversational bot such as this could do. I even thought about not posting my progress on this app and quietly adding it to my project portfolio. I asked several people, including mentors and friends, what they thought of this technology being made known available to people. Most of them referred to the movie Her. Others just commented that this was “coming anyway”. The reason for the curiosity for other’s thoughts and my hesitation around this app came from visiting my grandmother for lunch, just a day after I had written the first prototype. During that lunch, she was called twice by two different charities she already contributed to, asking for further donations. I got called a gym I hadn’t used since 2016, asking if I had any interest in coming back.

It suddently dawned on me that with (not too much more) further development, hatem.ai could mean that we get hundreds, thousands of calls a day. No number on a leaked datalist would be safe. Less robotic do the current scam calls get, and now they not just sound human, but interact like them too.

It is for this reason I made the decision not to make the code open source, regardless of how straightforward to some it might be to replicate.

Should commercial development and marketing of hatem.ai be done, it will have to be with serious precision and ethical caution – to finely tred the line between useful enterprise software and making customer being skeptical whenever they pick up a phone.

As long as I live in a world where I’m not called during my family lunch, I’m happy.

hatem.ai – A GPT-based verbal conversationalist, built for automated sales calling.

Related

Leave a Reply Cancel reply