Justin Uberti: The Future of Voice Interaction

Justin Uberti, creator of WebRTC and now Founder of Fixie.ai, shares insights into the development of AI.town, a platform for engaging with AI personalities through voice, and the potential impact of conversational AI on various industries.

Please support this podcast on Patreon! http://www.patreon.com/aiinsideshow

INTERVIEW TOPICS

- Introduction to Justin's background (WebRTC, Hangouts Video, Duo, Stadia)

- The shift towards conversational AI and voice interactions

- Fixie.ai and AI.town - enabling voice conversations with AI characters

- Transitioning from text-based to voice-based AI interactions, potential use cases

- Creating AI characters, enabling role-play conversations

- Ethical considerations and voice cloning technology

- The nature of human conversation (filler words, turn-taking protocols)

- Incorporating human conversational quirks into AI speech

- V1 vs V2 voice technology (speech recognition - text - speech vs. direct speech-to-speech)

- Open-source speech AI model Ultravox.ai, leaderboard for fastest AI models (thefastest.ai)

[00:00:00] Everybody knows backup and disaster recovery are huge concerns for IT professionals but what if your current solution isn't providing the protection you think it is? On June 20th, Cove Data Protection from Enable is partnering with Microsoft to show you

[00:00:14] the ways traditional backup and disaster recovery are leaving you vulnerable to ransomware and how a more modern cloud-first approach can close those gaps. This webinar will change the way you see disaster recovery so don't miss it. Visit gocove.com slash Microsoft to register.

[00:00:33] This is AI Inside Episode 21 for Wednesday, June 12th, 2024. The future of voice interaction with Justin Uberty. This episode of AI Inside is made possible by our wonderful patrons at patreon.com slash AI Inside show.

[00:00:49] If you like what you hear head on over and support us directly and thank you for making independent podcasting possible. Hello and welcome to another episode of AI Inside, the show where we take a look at the AI hiding inside of everything.

[00:01:09] Sometimes the AI that we're talking to so you know it says hiding inside the voice on the other end of the line. We're going to talk a little bit about that in today's show. I'm Jason Howell joined as always by my co-host Jeff Jarvis. Hello sir.

[00:01:21] Hello and actually right now Jason is somewhere in Italy enjoying wonderful food and we could I hope I've been enjoying wonderful food. Actually so okay so if it's 11 a.m. Pacific what time is it in Italy? I could be sleeping right now.

[00:01:37] Yeah it's the clock is not my friend at this point. The time zones are so completely different but I am enjoying the first part of my trip so thank you for mentioning that Jeff. This week in the next two episodes no live show if that's your preference

[00:01:52] but we will have episodes we have pre-records with some really interesting people doing really interesting things in the world of AI as is the case today but before we get started just want to throw a big thank you to those of you

[00:02:05] who support us on Patreon because that support drives the success of this show. Patreon.com slash AI inside show Brett and one of our newest patrons Charles Gilagli, Gilagli I'm sorry if I mispronounced it but I think you're awesome nonetheless and thank you for supporting us.

[00:02:24] Patreon.com slash AI inside show like I say every time we couldn't do the show without you. Alright so the last handful of weeks I mean this has been a topic that's come up

[00:02:37] on the show many times regardless of the last handful of weeks but I feel in the last you know the previous month or so we've seen a lot of examples that have really highlighted or a lot of news that really highlighted examples

[00:02:49] of how AI is becoming much more conversant from a voice communication perspective you know we see a lot of these multimodal systems like Project Astra that Google showed off at Google I.O. chat GPT GPT-4O you know all

[00:03:06] these things seem to really lean into voice conversation with AI as a pretty critical point of the experience and so we've you know certainly tackled that topic on the show many times over the last handful of months

[00:03:21] and we have the opportunity to bring on someone who is working closely with a lot of this technology not only that has a pretty cool resume and I've been following Justin's work for quite a while Justin Uberty is the co-founder

[00:03:34] and CTO of Fixie it Justin's really great to have you here thank you for joining us today. No it should be fun. Yeah this is gonna be great I just want to kind of like set the stage a

[00:03:44] little bit because I started following you I think on Twitter possibly even Google Plus because that was what Google did back then when you were at Google I know right when Google thought everything needs to have Google Plus tendrils in it.

[00:04:01] You founded WebRTC while at Google which is a really big accomplishment you drove the development of the products like Hangouts Video, Duo, Stadia so yeah I've been following your work and reading you on Twitter

[00:04:17] and all other places for quite a while so it's a real treat to get you here today appreciate your time. Thanks. And also I should also mention also a clubhouse you were a clubhouse for a while

[00:04:28] before co-founding Fixie.ai and becoming the CTO of the company as well so that's what we're going to talk about today. So given all of that your work at Fixie has you focused on a number

[00:04:43] of things they are related but I think the thing that really caught my attention was the voice enabled AI aspect of what you're doing we'll talk a little bit about AI.town in a moment really feels like this

[00:04:56] moment in artificial intelligence we're on the cusp of some very interesting advancements in voice AI things are getting a lot more human when you when we can see these examples of talking to AI with our voice

[00:05:11] it used to be like oh yeah you can do that but there's evidence that you know it's still a machine and I would say that there's still evidence but there's a lot less evidence of that now so you're keeping yourself

[00:05:24] at the cutting edge through your career and you're doing this now tell us a little bit about how your previous experience Google and everything prior and after has led you to the work that you're doing with Fixie and with AI.town.

[00:05:42] Yeah thanks for the introduction and I think that's actually a pretty good setup of I've been sort of very like just interested in this area for a long time of the notion that you know hey something's different about conversing with audio and video then just over text and

[00:06:01] I worked on Able and some messenger and it was like the first thing I did like you know when I joined the industry and I thought that was like magical the fact that you could just send a message and it would instantly appear somewhere else

[00:06:12] like across the world like immediately but as time and for the notion that we could then add audio and video to that mix and just the additional sort of like weight that that media carried I think what we're seeing right now

[00:06:28] with apps like Instagram, TikTok they're all going kind of video centric because it's just a much more salient medium that there's just something like a little bit more that hits a little different as we like to say at Cubhouse about things like voice that are different

[00:06:45] than just reading. So chat rooms have been around for a long time and at Cubhouse we saw that just the impact of people being able to sit around and almost shoot the bull to just talk and just the value that provided of just hearing a familiar voice

[00:07:06] and just hearing people laugh as you talk and these sort of things and I think that's just something that's not present in a text-only communication. So describe what you're doing at Fixie. So you know our thesis then is sort of the same that

[00:07:23] when we look at chat GPT an amazing accomplishment you know the ability to type in really almost anything and get out like a very human sending response but it still sort of feels in many ways you have a new command line

[00:07:39] and you type in text you get out text and when I talk to people who are sort of outside the AI bubble outside the industry about are using chat GPT oftentimes the answer I get back is I don't know what the type.

[00:07:51] So common oh yeah I've heard that so many times and actually I felt that too when I was first stepping into AI it was like I know this is powerful but I have no idea what I'm going to use it for

[00:08:02] and all it took was for me to continue to almost almost force myself to turn to it first for things that would bubble up and then it started become really clear to me. Yeah right but I think that it's just a little bit alien

[00:08:18] I mean first of all talking to this AI but just the overall interaction mechanism it's different than the way most people prefer to interact and I think that's the thing that we saw such on duo, on Google Meet that when people want to have

[00:08:33] an important conversation when convenience isn't top of mind the modality they want is either in person or voice and video and you know just the impact of the pandemic and seeing you know the use of meet and similar tools to show like the value

[00:08:49] of live synchronous communication and so it's no surprise than to think when we talk to AIs shouldn't we expect the same sort of thing shouldn't we expect that you know the modality that humans are given with born with ability to speak everybody can speak

[00:09:05] almost and it's just like a much richer way of interacting there's not just the textual content of what you're saying but all these you know things of tone of voice and like things that are sort of you're familiar you know what is your voice sound like

[00:09:21] can you be known just from the first word to say of who it is because of that unique timbre that your voice has and you know in the speech domain we call these parallel linguistic cues and I think it just leads to a thing where the voice just

[00:09:37] almost like it's processed at a different level in your mind it sort of hits a little differently and it's much more familiar it's much more engaging and I think that is a cusp of this you know basically time when AIs now

[00:09:49] will have the ability to really interact you know using speech and I think eventually embodiment and vision and so at Fixie you know we're basically creating you know some of the tools to kind of really empower this sort of technology for AIs to really interact

[00:10:05] you know just as naturally with voice as talking to another human and we're trying to figure out you know can AIs tell jokes can they be funny can they be witty can they be zany and AI Town is a place where

[00:10:17] you can kind of interact with these AIs and you know have these experiences of her voice. You know it's all been pardon me for a plug for a moment I wrote a book called the Gutenberg parenthesis

[00:10:29] which is really about the age of print and text as an exception and before society was conversational and then it became dominated by text and I think we're returning to a conversational society now and it's in fits and starts when radio came

[00:10:45] newspapers insisted that the year was not a good way to learn things that it had to be through the eye because of course newspapers made text for the eye and so I think we're coming you know kind of full

[00:10:57] bore but we're out of practice and we don't know how to do this I think as a group and so I wonder whether there are do you think people have to relearn how to have conversations whether it's with themselves or machines

[00:11:14] I think that it's still people have a sense of you know when they interact with their friends they interact with like their family they talk and so you know but to talk to a computer like I think that's a little bit of a chasm to be crossed

[00:11:30] but I think it's not a big one you know we were starting fixie you know I heard from people who basically said Justin I don't think people want to talk to their computer and I said to them look you're talking to your computer two three hours a day

[00:11:46] and they said figures well right they're saying like you know I'm not talking to my computer I'm talking to the other person on the other end of the call on my zoom or teams call or whatever I'm not talking to my computer

[00:11:58] and I said well but imagine on the other side it's an AI it's the same sort of thing you could say you're not talking to the computer you're talking to the you know it is having you're having a conversation

[00:12:10] with you're having a conversation with this AI or this intelligence or this thing that's helping you or coaching you or serving as a companion and you know you're going to think about that the same way that

[00:12:22] you do a zoom call in a couple years I think it's just people can they can talk much faster than they can type and there's just much more information communicated than than in like typing out a text message

[00:12:38] yeah what what kind of comes to mind for me as we're talking about all of this in the context of video conferencing you know apps which actually you've had a lot of experience with online video conferencing with with Hangouts video and duo and everything

[00:12:54] is that what I'm starting to see now with AI in chat apps is here's an extra group and we we are humans chatting with humans here's an AI that's also part of the group that brings that extra intelligence

[00:13:10] that that appears like another human in the group that you can integrate and pull in and it almost seems like with voice kind of hitting the point to where we're getting to this place where it really does sound human and is really great at at parsing

[00:13:26] the conversation and potentially being an assistant there to that we're not too far from that AI voice being a part of our conversation like our actual conversations on a group in a group sense online and being essentially another collaborator right there that we can

[00:13:46] use our voice to communicate with I don't think we're very far from that it maybe it's already happening and I just haven't seen it I think that you know absolutely these things are happening and I would say that the you know the voice

[00:13:58] is actually probably coming along a little faster than the assistant part and the point that I'm trying to make here is that we often expect our assistants to go take actions on our behalf

[00:14:10] and I think one of the things that current AI struggles with is the lack of a sort of internal review process or whatever or thinking about you know is this the right thing to be doing and

[00:14:22] as a result I think we're uncomfortable by asking AI's to go take actions for our on our behalf especially actions that are irreversible like sending an email or even someone's even like scheduling or things like that because

[00:14:34] they always you want you to review it to make sure oh did the AI make a mistake or not but one thing that the AI doesn't tend to make mistakes in is in dialogue these AI's they're large language models and the language part like this is what they're

[00:14:50] trained to do is to really talk and sound like they're a human because they've been trained on these enormous volumes of text and so they're very good at conversation and conversation also has this nice property where it's naturally self healing humans miss here things all

[00:15:06] the time in conversation I like that yeah right and like you know the conversation doesn't stop doesn't lead to like the wrong thing happening you say say what or is that true or I didn't catch that or oh I thought

[00:15:18] you meant like there's all these natural ways where a conversation if it starts to go off the rails can be brought back on just by the fact it's a two way exchange of information so let me ask the obvious question which you know I'm going to ask about

[00:15:30] anthropomorphization especially after GPT 440 and the hubbub around Scarlett Johansson but really more about the the question of and this is from the stochastic parents paper too where people are fooled is getting to believe that they're talking to a human the goal

[00:15:54] what's the ethical structure that you want to create around this technology that you're creating I mean our view is that you know this sort of wave is coming but I think that the right way to experiment with it and understand it is in a low stakes environment

[00:16:14] where it's primarily for you know chit chat and entertainment purposes and that's kind of what we've done with our site AI.town you know we're basically putting this environment where you can talk with a number of different AI personalities and you know the conversation

[00:16:30] is the key aspect you know these AIs they'll actually have their own lives they'll make social media posts you can text them you can have a voice call them and you know there we've got a number

[00:16:42] of really great reactions you know from people interacting with this sort of stuff you know in terms of anthropomorphization you know do we worry about you know people describing sentience or consciousness to these AI's like well no because like it's clearly sort of set up here in an

[00:17:02] entertainment sort of structure and you know I think that's the right way to kind of like edge into this technology and see where things go but you're the front and I'm not trying to lead you here honestly but I think that because you are

[00:17:18] a leader in this and a pioneer in this I'm not one who engages in moral panic I'm not one who gets all worried about it but I am fascinated by and I also think it's too early in many ways to set standards

[00:17:34] because we don't know what the stuff can do. Nonetheless you're at the early edge of this and you have an opportunity I think to define I'm trying to hesitate not from doing this I'll do it. What's bad use of this technology? I think there's some pretty obvious

[00:17:54] immediate harms that I want to point out of you know cloning people's voices and using them to defraud others. I think that's something that is going to be you know something that people will be very conscious across the industry

[00:18:06] and some of the leading sort of speech providers are already doing a lot to try to prevent this watermarking the voices insisting on consent and even forms of consent where you have to make a video and hold up an ID. I think there are real challenges

[00:18:22] around this. I do think we have some legal structures to deal with this but to be honest I think there was a point in time where people sort of felt you could trust something that was like published text

[00:18:34] and say oh if this text I showed it to someone I can believe that that's what this person actually said. That's no longer the case we're also getting to the case where you can't really believe that a photo

[00:18:46] is necessarily a photo of a real event that occurred with generative AI and there are some techniques for watermarking photos and everything but I think that's an area where again most people probably don't believe photos have quite the same

[00:18:58] salience anymore and I think we're probably getting to the same area with voice because the voice cloning is certainly possible. You can't say authoritatively that a recording that sort of sounds like person X is definitely person X and so even stepping away from

[00:19:18] what should we think about people talking to AI rather than other humans which one could say is really just rediscovering the art of conversation. I think there's some more obvious potential harms to a working in the space. I have thought about this stuff very carefully

[00:19:38] in terms of how might some of these things be used and have put some defenses in already against that. Back to your point of print sorry I'm going to nerd out from my print stuff but when print started

[00:19:50] with mobile type it was not trusted because there was no provenance and anybody could make this pamphlet like anybody can make a tweet and we created institutions and publishing structure to verify the authenticity. So I think that's the opportunity here is where do you

[00:20:10] get your AI from? Where do you get your voice from? What's its provenance? What does it know? What does it do? Who brought it to you? Are going to be important questions and those are human questions and those are opportunities.

[00:20:22] Alright we've got a whole lot more coming up but we do need to take a really quick break. Everybody knows backup and disaster recovery are huge concerns for IT professionals but what if your current solution isn't providing

[00:20:36] the protection you think it is? On June 20th Cove Data Protection from Enable is partnering with Microsoft to show you the ways traditional backup and disaster recovery are leaving you vulnerable to ransomware and how a more modern cloud first approach can close those gaps. This webinar will change

[00:20:52] the way you see disaster recovery so don't miss it. Visit gocove.com slash Microsoft to register. The secret to visibly firmer summer ready skin is here. Osea's number one best selling Andaria algae body oil clinically proven to instantly improve skin

[00:21:08] elasticity and transform dull dry skin to silky soft and unbelievably glowing rich yet never greasy. Andaria algae body oil is formulated with sustainably sour seaweed to help replenish the skin's moisture barrier and seven nourishing active botanical

[00:21:24] oils for results you can see and feel all over. The best part? It's a signature scent. A blend of freshly squeezed grapefruit, cypress and mango mandarin transports you to sun kiss summer days. This all natural scent is unforgettable. Everything Osea makes is clean,

[00:21:40] vegan, cruelty free and climate neutral certified so you never have to choose between your values and your best skin. Get healthy glowing skin for summer with clean vegan skincare from Osea. Get 10% off your first order site wide with code GLOW at Oseamalibu.com Oseamalibu.com code GLOW.

[00:22:05] Now you've mentioned the work that you're doing with AI town which I kind of showed it off. It's got a very colorful approach to it. It's almost like having, well it is, it's having virtual or actual spoken conversations. I guess you can text with some of these characters

[00:22:21] as well. I'm curious to know with these characters and you can actually go in there and you can create your own characters and kind of create a back story for them. It's a really cool kind of way to do this like you said in a low stakes environment

[00:22:37] somewhere where you can play around with the technology and in my playing around with it, it kind of reminded me of not that I have a specific example to draw from when I was younger when I was a kid playing with technology or games,

[00:22:53] much more rudimentary games but there was a little bit of like an oh this is fun, this is interesting, this is unexpected with all the different voices and the characters and everything. What are some of the surprising things that you've seen from users as they begin to

[00:23:13] interact with some of these characters on the site? We first got a taste of this when we, the first version of AI town was a holiday themed thing that we created called HighSanta.ai and this was our sort of idea of

[00:23:29] let's take some of the tech we've been building around making really life-like voice interactions with AI and kind of use it to power Santa Claus and his friends. And we sort of put it out and thinking like

[00:23:41] this will be an interesting thing to see if the models work, the service works, like you do P-bushing like this and our demand was way past what we expected. We ended up spending a lot of Christmas time kind of

[00:23:53] just talking about all and turning up servers and things like that and talking to our various providers to make sure the stuff didn't fall over. And it sort of tapped into a couple of realizations. One being that people really enjoyed having some of these conversations to fictional characters.

[00:24:13] I think when you think about what are people using it for, there are many people through print, books, tv, movies like there's fandoms, there's characters that people are really excited about and the opportunity to interact with them directly, ask questions about

[00:24:33] their canons and stuff like that is actually something that there's not a demand for. And it turns out Santa Claus is a really popular fictional character and so you have kids talking to did I just spoil the plot there?

[00:24:49] There's kids talking to Santa Claus but then there's adults talking to Santa Claus and this bad Santa character you have up here, a lot of conversations. One of our most popular characters where bad Santa is like

[00:25:01] you can read the description there and you'll find me in the alley behind your local run down mall and the question of can these characters be, we call them townies here in town, can these townies be funny? Can they be witty and a little bit

[00:25:17] like shoot off the cuff remarks and bad Santa definitely does. He can have some pretty awesome roasts, you can see just some of the things there yes exactly. And people have long conversations with him and it's just the kind of thing where we wanted to

[00:25:37] find some things where people could have a wholesome interesting conversation with something to say about personalities and try to find out what are the places where you'd find actual resonance with what people are interested in talking about and it turns out

[00:25:49] that these fictional characters are actually a pretty rich source of enjoyment and interesting conversations. What do you have to do to create the fictional character? How deep do you go into the description to make it work as a character?

[00:26:05] We have a set of plot we've optimized several times over, you can actually do it all through voice naturally and ask you a few questions and it kind of helps build out the backstory but we find that one thing that LLMs

[00:26:17] are really good at is role play. You say you are XYZ and you were born here and this is this and you're interested in this thing and you don't like this other thing and the LLM will just sort of run with it and we talk about hallucinations

[00:26:37] and LLMs as like a bad thing but when you're in this sort of fictional character thing, hallucination is a good thing because it's like oh what's my favorite food? What is my favorite food? Oh my favorite food.

[00:26:51] I love steak tacos or whatever and they'll go and keep going and just at any time just keep inventing new things to sort of fill out their personality or keep the conversation going and the LLM mindful that it has to sort of stay in character

[00:27:09] and not forget things that are already said in the conversation and it makes for a pretty interesting chit chat. Yeah, that seems to point out the fact that what you said about hallucinations, I think hallucinations related to AI it's very common to hear that be a negative connotation

[00:27:27] but it really depends on the context, right? If what we're looking for are facts then hallucination is actually a very bad thing. If what we're looking for is creativity, a hallucination could be an amazing thing because you don't know what it's going to give you

[00:27:42] and to a certain degree it mirrors what we as humans do in a certain sense when we're being creative, we're creating something out of nothing and that isn't entirely always defined by a specific piece of information

[00:27:57] or anything that's directing us to do it. We just put a line there because it seemed like it would be a good idea or whatever. Yeah, the Guardian had a story a week ago or two weeks ago arguing that AI could cure the downward spiral of human loneliness.

[00:28:15] Do you have that highfalutin goal or is it more just fun? I mean, that's a pretty lofty bar there. Yeah, it is. But to come back to the comment that was made earlier of having lost the art of conversation through our phones

[00:28:39] and text messages, I think there could be something to be said for that. One thing we've seen a use case for AI town is folks who speak English as a second language and might not be fully comfortable having a dialogue with somebody over English

[00:28:57] but you talk in a non-judgmental setting with one of these characters and you can even ask them to point out mistakes if I make any mistakes. People have written in to tell us this on our Discord. That is fascinating. That actually makes a ton of sense

[00:29:18] because my only example of this was a couple of decades ago when I knew I was going to live in Montreal and I was in Montreal, France, and my girlfriend at the time, we ended up getting some sort of CD-ROM that tells you about when this was.

[00:29:33] But it was some sort of CD-ROM that I used to try and learn French and everything and they have had their own little exercises on how to speak and speak it out and oh, you're wrong, you're right or whatever. It wasn't doing any sort of voice recognition

[00:29:50] obviously at that time. I think it's a great tool for these systems because they're very good at understanding the words that we're saying. They're very good at translation and oh, that's such a great tool. And you don't have a stage fright talking to them?

[00:30:08] Well yeah, you don't have to worry about being judged. That judgment piece isn't there because you know going in, hopefully you know going in that you're not actually talking to a humanized pickle. It's an AI. The pickle piece is great. It's one of our zaniest characters.

[00:30:25] But I think that not being judged is an important part. Yeah, I think that if we think about the loneliness thing I think that could be an important part of it. People are afraid to expose their real self sometimes and I think that having a confidant

[00:30:38] or you couldn't be your real self without the thought of being judged seems like just a thing that would be naturally a net positive and help build confidence in that sort of thing. Is that a business or is it a demonstration of the power that you're creating?

[00:30:54] It's a business in a few different ways. There are a number of partners we're talking to who have seen this technology and you say, wow, imagine that for our character XYZ we'd love to be able to enable that

[00:31:09] or we'd love to be able to put this technology inside this device or this other thing without getting into the exact details there's definitely a lot of interest in just this conversational scenario and you can imagine just a lot of potential things

[00:31:28] even just around talking to Santa Claus and everything where we forget sometimes in the industry that this technology, oh, we'd build it it still seems very magical to a normal amount of people. Yeah, right. Yes, like my wheels are turning.

[00:31:48] Suddenly you can, a jack in the box can come along and say, hey, you can go to our site you can actually have a conversation with Jack and I'm sure that's not happening already. That's right on the other side of what corporations

[00:32:02] are gonna think about doing because they want customers they want people to engage with their brand in that way this technology really encapsulates that. What are some of the more kind of challenging aspects of working with speech to speech models right now

[00:32:20] that maybe you didn't foresee at this point in time? Oh, man. Yeah, we could go on for a long time about this. I think this is one of the reasons that I think people have been maybe a bit reluctant to fully dive in on speech

[00:32:35] there are a lot of challenges because the human ear is just such a good discriminator in terms of knowing what sounds right what sounds wrong and you have one of the things that sounds wrong is when it takes the AI a long time to respond to you.

[00:32:51] Yes, they can see. Yes, exactly. And so like within this sort of duo and Google Meet in this sort of world we had a very strict set of standards of that we aim for 250 milliseconds time between when you stop speaking at a time

[00:33:09] and the person on the other side of the call would hear it. And so this like very, very sub second latency was something that we sort of designed for the system WebRTC like a million things went into that to make it more performant

[00:33:23] and really try to keep this natural low latency sort of thing built into the whole protocol because it turns out that humans when they have conversational turns they alternate turns very, very quickly and it depends on the language being spoken but sort of for English

[00:33:41] the typical time between someone's stop speaking and the time that the next person starts taking their turn is only about 200 milliseconds too. So if you let your latency get up too high two things can happen. One is you can start having over talking

[00:33:55] where somebody thinks that the person is done speaking but they're really not and they start talking and the other person hasn't yet seen to the floor or you can have things where there's a long delay and in voice, a long delay humans have interpreted as

[00:34:15] there's something about what I said that isn't quite right, isn't sitting right with the listener and maybe it's because they're thinking about it or maybe they're thinking about how to deliver a response that's not going to be a hard message but there's a very clear thing

[00:34:31] and this is documented in literature and once it goes past about 600 milliseconds people start ascribing that that delay was intentional because there's some sort of additional thinking going on that's being used to figure out how to say the thing they want to say

[00:34:47] and so anyway, there's a lot more detail I could go into there but I think the key thing is that the latency stuff isn't just like a nice to have it makes everything just a little snappier but it also changes like the overall semantic sort of feel

[00:34:59] of the conversation. So a lot of what we have built in terms of oh, here's HTTP and all these sort of things that are meant for delivering web pages those things are not what we use for delivering voice

[00:35:12] and so we use like WebRTC and this sort of technology because it's very focused around low latency and I think the whole sort of like AI ecosystem is going to go through a bit of a fork left right now where it switches in out some of the existing

[00:35:24] ways of doing things and switches in things that are more like WebRTC for getting to this low delay, quick response, and then there's a lot of conversation interaction. Well, I had a boss one that we saw. Oh, sorry. Sorry, we just demonstrated it.

[00:35:38] Yeah, I had a boss once who was very high latency and I had to warn people who come into meetings don't finish the sentence for it. You'll finish it, just give it a minute. So what strikes me that you must be doing

[00:35:50] a lot of reading research and research on the nature of human conversation. Yeah, right. Are there any kind of interesting insights that you can use to help people understand what's going on in the world? Oh, yeah. I mean, a couple of things I just mentioned there,

[00:36:07] but even the words like um or uh, people sort of think of those, oh, that's lazy speech. Those are things that when you don't know what to say, you insert them into your speech and really you should try to avoid using them at all.

[00:36:20] Well, it turns out that those things are really just part of a protocol and that really you're sort of working to see like should I, am I keeping the floor open or not? I think that's really what it comes down to is the other person should they speak.

[00:36:36] And by using um, um is just a quick way of holding the floor for a little bit where uh, it's holding the floor for a little bit longer basically indicating that a response is going to be coming. Right. Yeah, and so like this little you know,

[00:36:50] signaling happens on like almost unconsciously but everyone knows how to do it so like if someone says um, they shouldn't start talking right away even though like the time is now starting to accrue and the license they stop talking. And so that's one interesting thing

[00:37:05] and the other thing is that the sort of utterance, huh? Is itself like a special part of the protocol that turns out that huh is like the fastest thing that you can actually voice and so if someone says something to you

[00:37:21] and you have hard time understanding what's being said uh and you spent you know hundreds of milliseconds processing that but you still haven't quite figured out what's going on then rather than sort of like lose your time on the floor, people will say huh, you know,

[00:37:36] it's just a way of expressing I didn't understand what you just said almost like in a protocol term what we would call like a negative acknowledgement or a knack and then people sort of understand then that like the speaker will then come back

[00:37:49] and maybe rephrase their terms or whatever but just there's not a little fascinating protocol like things that are already part of conversation that like I think unless you really start looking into in detail it's easy to miss. That's fascinating to me especially because I'm also

[00:38:04] for the work that I'm doing in podcasting on an independent basis I'm always looking for ways to you know kind of improve my productivity and shave off some time and everything and of course AI generated transcripts have been a huge like boon for me

[00:38:19] what I notice is going through the transcripts those ums those us those huzz they carry a lot different meaning when you actually see it printed out in the transcript versus actually hear it when you hear it

[00:38:32] some like you said it makes sense it connects two things it keeps keeps the conversation going when you read it man it just looks like that person just didn't know what they were talking about you know that's a completely different meaning

[00:38:45] I remember so I one of the first interviews I did on on the media a show that's done out of NPR WNYC in New York and as we're walking into the studio the producer said I want to warn you we edit Bob a lot

[00:39:00] why are you telling me this I don't know what it was going on so what he did was he would restart set this is all the time knowing that they were gonna edit and not only that but they did a fascinating episode

[00:39:10] you might be interested in finding it I can try to find it for you where they do take out everybody's ums and us they smarten you up and there's a kind of a journalistic question there as to whether or not

[00:39:22] you're getting a true picture of someone or whether it's just a common courtesy that if you did a transcript of someone you would probably delete the odds because unless unless you try to make them look stupid it doesn't add anything it's really interesting how we

[00:39:36] imbue some sense of someone's intelligence and other things in how they speak yeah I mean like speaking on the spot it can be quite challenging putting together a complex paragraph there's not a processing required and you're sort of you start your utterance

[00:39:56] before you've even really fully figured out what it is you're going to say and so you see a quote in a newspaper that's been sort of all the urs ums filler words taken out and now to say to do the same for an actual

[00:40:11] sort of video or you know an audio clip it's doing the same sort of thing but I think to your point though there's not more heavy lifting going on there because it's kind of starting to distort you know what was this person actually like

[00:40:23] in reality what was the reality of the thing versus what was like the sort of condensed product that the reporter had a chance to kind of smooth out and yeah there's not much so do you add in those human works? Yeah I was going to ask that too

[00:40:37] but I listened to posh Brits they say sort of sort of sort of a lot we Americans say like or you know a lot there's certain kinds of pauses are you building that into the output? so this is probably a time to riff on

[00:40:55] the sort of v1 versus v2 in voice technology and that in many ways you know what open ai demonstrated was what I considered to be sort of the v2 of voice technology they previously open ai had their own sort of voice mode for chat gpt

[00:41:16] and the v1 is sort of where you have a speech recognition stage that turns speech into text then the LLM stage which is similar to what you're already seeing with chat gpt and then finally a text to speech stage which takes the output

[00:41:31] of the language model and turns it into speech now in this world everything kind of gets converted to text run through the ai and then omitted back as a speech and there's really no ums or urs in there because the language model doesn't think in that context

[00:41:49] it's never really been trained you probably could fine tune it to do so but it doesn't think it's part of a voice conversation and so these things don't occur naturally in the v2 world what we have is a model that consumes speech from the outside

[00:42:04] and emits speech on the other end it's never forced into text in the middle like in the v1 setup and in this mode the training is almost all you know to customize the model and make it you know fully speech to speech enabled

[00:42:19] it's speech and these artifacts of conversation these filler words are part of the training data and I expect it will actually become part of this because like I said to participate in this conversational protocol you need to be able to do these things

[00:42:34] although you know maybe you could say that the ai doesn't need to stop to think in the same way but you know I think there will always be scenarios and there are very technical reasons why where an ai may be slow in a certain portion to generate

[00:42:49] whether it has to retrieve information or something else or a protocol that I've just talked about to buy itself time of hmm or let me think or and it'll just seem very natural when it all comes together. I am and at the same time I can see

[00:43:07] many people recognizing this is an ai voice that's throwing in those interconnected pieces and criticizing it for it say well it doesn't need to do that we need to do that because we're human and we're not going to be able to do that

[00:43:22] at the same time like maybe if we allow ourselves to kind of move beyond that maybe that ends up making the conversation feel more like something we're used to it's a weird line there I think you can't please everybody as far as that's concerned

[00:43:37] well we're now in this new sort of v2 world I think and I think we'll be figuring out exactly how should we map these things from human conversation into how ai should interact I think it's going to be fascinating we're doing our own modeling

[00:43:52] I think it would be great to see how GPT4 works out with the voice mode but yes I think that there's a not too distant future where you've got to close your eyes and wonder am I talking to a person or talking to an ai

[00:44:07] and I think that will actually be extremely enabling and potentially helping technologies like chat gbt really cross the chasm and appeal to a much broader sense of people because all you have to do is talk to it yeah just fascinating work it is fascinating it really is

[00:44:25] and let us know that this comes right about the same time that i cq is being shut down right so there's kind of a it's really a next generation of human chat right we're done with you I see q we moved on yeah we don't we don't need

[00:44:40] that anymore it will be missed that was a very interesting sort of one of the first start ups of those guys I knew them and they basically are running like six servers in a closet with their original data center

[00:44:55] and they made a lot of tradeoffs to kind of make that work but yeah really an incredible achievement yeah no kidding made a lasting imprint on everything on the web communication on the web and we're hanging out with us a little bit

[00:45:10] and telling us a little bit about fixie and and ai.town fixie.ai is the site like I know you do more than just ai town you know in the technology behind that maybe maybe just real quick before we say goodbye tell us a little bit about

[00:45:25] like a project or two that you're working on outside of ai.town with fixie so I think one thing I'll talk about is this new thing two things we're doing one is ultravox.ai this is our speech ai model we've just open sourced this

[00:45:40] and released it we're building a community around it and in some ways it could be seen as the open source complement to GPT-40 that we're relating on some of the work that meta has done using Lama 3 and build sort of a front end for this

[00:45:55] multimodal extension of Lama 3 that can consume speech and can be used in things like ai town to get that sort of really fast human type of interaction where it can understand speech natively and so we've started up a sort of open source project

[00:46:10] around this and a lot of interested people who are very very very interested to see open source have the same sort of abilities that proprietary models like GPT-40 have the other thing is we care a lot about speed and so we've kept a leaderboard

[00:46:25] of models at the fastest ai this is our sort of way of keeping track of who's doing an amazing job on making LLM super fast that we can work in these low latency situations we just talked about for voice there are a number of familiar names there

[00:46:43] but we're working closely with some of these partners on powering ai town and we hope to see that these numbers continue to go down we're also going to add a voice category to this as well so we're going to be able to see who's doing a great job

[00:46:58] on multimodal voice so we care about speed we care about voice and that goes into everything that is what we do that's amazing you're doing great work you continue to do great work I will continue to follow you of course Justin

[00:47:13] ubertee fixie.ai you can find the links to the few other projects that Justin just mentioned at that site and then of course ai.town if you want to have some conversations with your characters and you can have you know it's like

[00:47:25] you have a phone conversation with them it's really cool it's worth checking out Justin thank you for doing the work that you do and thanks for talking to us about it we appreciate your time and appreciate meeting you today Jason Jeff is great to be here

[00:47:37] yeah I appreciate it best of luck and we'll talk with you soon alright and Jeff that is it for this week's episode GutenbergParenthesis.com for everything Jeff has going on right now I can show it here real quick just so that people know that

[00:47:52] it's a site that exists with with discount codes magazine and then of course the big kahuna the GutenbergParenthesis thank you Jeff next week we actually talk with someone who you know Nikita Roy from the newsroom robots podcast she's going to join

[00:48:13] us and I give you three guesses what that conversation will be about that's right newsroom and AI that's what we do right here sometimes it'll be great to have Nikita on AI inside normally records live every Wednesday at 11 a.m. Pacific 2 p.m. Eastern

[00:48:28] here on the text bloater YouTube channel at youtube.com slash at text bloater but I am still away on vacation so instead you can find a YouTube premiere for each for this and which is already past and then the next two episodes so just go to that channel

[00:48:43] at the same time and you will see a premiere that you can do live chat with people as it plays for you we will publish it of course to our podcast feed later in the day we can also you can support us on

[00:48:55] our patron at patreon.com slash inside show where you get a whole bunch of awesome perks and you can also be you don't have the opportunity to be an executive producer of the show which gets your name called out at the end like Doctor Do Maricini and WPVM 103.7

[00:49:13] in Asheville North Carolina one of these days I'm gonna like find a way to listen to that on the radio to see because now I'm super curious thank you for your support y'all and thank you just go to AI inside dot show for

[00:49:25] everything that we have talked about today it's all published there everything you need to know thanks so much for watching and listening we'll see you next week on another episode of AI Inside bye everybody