AI Positive
January 24, 20241:01:40

AI Positive

On the premiere episode of the AI Inside podcast, hosts Jeff Jarvis and Jason Howell discuss AI copyright issues with Common Crawl Foundation's Rich Skrenta regarding news outlets limiting access to content they publish publicly, impacting the integrity of Common Crawl's internet archive. In recent years, the archive has been used by LLMs as AI training data, and the implications of restricting information have a dramatic impact on the data quality that survives.

INTERVIEW

  • Introduction and background on AI Inside podcast

  • Discussion of the recent AI oversight Senate hearing Jeff testified at

  • Introduction of guest Rich Skrenta from Common Crawl Foundation

  • Overview of Common Crawl and its goals to archive the open web

  • Discussion of how Common Crawl data is used to train AI models

  • News publishers wanting content removed from Common Crawl

  • Debate around copyright, fair use, and AI’s “right to read”

  • Mechanics of how Common Crawl works and what it archives

  • Concerns about restricting AI access to data for training

  • Risk of regulatory capture and only big companies being able to use AI

  • Discussion of recent court ruling related to web scraping

  • Hopes for Common Crawl's growth and evolution

NEWS BITES

  • Interesting device announcement from CES - Rabbit R1 with Perplexity AI integration

  • Study on actual risk of AI automating jobs away in the near future


This is AI Inside, Episode One recorded Wednesday, January 24, 2024. AI positive. This episode of AI Inside is made possible by our wonderful patrons at patreon.com slash AI Inside show. If you like what you hear, hop on over and support us directly. 

And thank you for making independent podcasting possible. Hello, everybody, and welcome to AI Inside, the premiere episode of this show, where we take a look at the AI that's hiding underneath the surface. I mean, AI is everywhere. 

It just kind of makes sense. AI is inside everything. And that's what this show is all about. I'm Jason Howell, one of two regular hosts on this show. I'm thrilled to be here. Also thrilled to be joined by my friend, Jeff Jarvis. How you doing, Jeff? Hey, Jason, it is so good to be on the real show. I know. 

I've never worked shopped this before. We're now public and live, and I'm delighted. Totally. We were a little back story for people who didn't know, but not too long ago, I was working for the This Week in Tech, twit.tv podcast network. And for their club, Twit, which is their subscription based platform, Jeff and I were working behind the scenes for probably a good four to five months on this show. We were kind of creating it, playing around with ideas, had some really fantastic interviews with people. 

And obviously, I'm no longer at twit anymore, but they gave us the blessing to do this show independently. So that's why we are here doing this. And we kind of benefit because we did all that work to create it. And now we can just kind of hit the ground running. And thank you, Aliyah and Lisa for letting this happen. And we're delighted to be to be running now. 

Yeah. Real quick before we bring our guests on, Jeff, why don't you for those? I can't imagine many people that are watching or listening right now don't know who you are, but for those who don't, I know you've got a huge list of all the things that you've done, but give people kind of the the Clipsnodes version of who you are and your interest in AI. I'm just an old media guy in all senses of the word old, old newspaper, magazine person, teacher, I just am retiring air quotes, retiring from CUNY's Craig Newmark Graduate School of Journalism. And we'll be, I think, working somewhere else, working on education. And then I'm also writing books I have out now, should plug them, the Gutenberg parenthesis and magazine. And then later this year, I have a book about the Internet called The Web We Weave. So that's me. 

Heck. And my interest in AI is the same one that Jason and I have said all along. We are not expert. We want to learn and we'll learn together with each other and with you as we have smart guests on and deal with the news and figure out what this phenomenon is and its impact. 100 percent. And I am just absolutely thrilled to be able to hang out with you each and every week, Jeff. 

Same here. I'm super honored to be able to be in this position, to be able to learn about artificial intelligence at a moment that feels like a real inflection point in technology. I mean, the time is ripe right now for following this technology and seeing where it develops, because it really has the potential to really impact so many things in our life. 

It already has and it really feels very early. So so with that being said, I mean, I think our first guest ever on this episode is, you know, the perfect addition, the perfect inclusion for the very beginning here. Welcome to the show today, Rich Screnta, who is the executive director of Common Crawl. How you doing, Rich? I'm great. Thanks, Jason. 

Thanks, Jeff, for having me on the inaugural episode of your new show. Yeah. Rich and I go way back to when Rich founded Topics Online. 

That's right. We both believe in dialogue and community and conversation online. Topics was a great pioneer at the time. 

Yeah. And now, Rich, we're going to talk a lot about the work that you do and your team does with the Common Crawl Foundation. But I guess give us a little bit of setup before we get there. And I know, Jeff, we want to kind of kind of set up an event that, you know, that you were involved with a couple of weeks ago. You were in DC at the Senate. 

We'll get there in a second. But real quick for people who are new to Rich and the work of Common Crawl Foundation, tell us a little bit about your role there, what Common Crawl Foundation is all about, kind of when it started, all that kind of stuff. Why'd you want to Common Crawl last year? I'm executive director for Common Crawl Foundation. Common Crawl has actually been around since 2007. 

Started 17 years ago. By Gil El-Baz. Gil was a founder of a startup called Applied Semantics, which was one of Google's first acquisitions. And when Gil left Google in 2007, he had this idea that web crawls were a really important resource for the world, but that they were only owned by Waldgartens by the big companies like Google and Microsoft. And he thought that they should be more of a resource like open source. So he started personally funding a project to crawl the internet and make it available for free to researchers or to anyone who wanted to use the data. 

Maybe you wanted to build your own alternative search engine, you know, try to compete with Google or just do research. So for 17 years, it's been crawling the internet. And over this time, it's become a really a tremendous resource. It's cited in over 10,000 academic papers. And it's a huge archive. It's nearly 10 petabytes, 250 billion pages. And recently it's become the primary training data set behind nearly every LLM. 

It turns out if you have an AI, they just have a voracious appetite for training tokens and Common Crawl is a great source of training tokens since it basically is a copy of the internet. Fascinating. And it's fascinating to see like the initial kind of purpose, the initial idea of what this was created around. And then over time, as we see technology evolves in ways that we don't always, you know, predict or expect. And here we are. 

It's being relied upon by so many LLMs, like you said, we are going to get into all that. Thank you for setting that up. That's perfect. Before we get into kind of the meat of the interview, Jeff, you, like I said, just a few minutes ago, you were in Washington, DC, Senate Committee on the oversight of AI, the future of journalism. This was just a couple of weeks ago. And I know you've talked about it here and there a little bit on this week in Google and everything, but maybe we start with that because that I think is a really big reason why Rich is on the episode in the first place. You gave some testimony, you mentioned Common Crawl Foundation in your testimony, you talked a little bit about like, how did you get this invite? First of all, like, how did this happen for you? 

It came out of nowhere. Basically, I was the beard because what they wanted to have was a hearing where they had lobbyists and legislators nodding to each other and saying, how wonderful they are and all the laws they want to pass right together. And so they couldn't call a hearing unless they had somebody there who wasn't one of them didn't necessarily agree. 

Not that I got called on much, but I was there to kind of give cover to the idea that it was a hearing. And what I thought would be important just to give a little bit of background is I think it's good for the conversation with Rich. But when I went into it, Rich and I just happened to reconnect. He reached out to me and I learned more about Common Crawl. 

I was fascinated by it and mentioned that in the testimony. But what came out from the big media companies? I had the one side of me. I had the trade association, Reed Lobbius, for the combined newspaper and magazine industries. And the other side of me was the National Broadcasting Association or NAB, National Association of Broadcasters. 

And then next to him was the CEO of Cunninghast, Roger Lynch. And what they were saying is kind of frightening because beyond the issues that came up otherwise about deep fakes and Section 230 and other things, what they were saying was that they think that they have a right to be paid for any use of their content by AI. And they separated out in Roger Lynch did talking about training data and output, which is very common nomenclature now, that there's data that's needed to train the large language model, how to speak and how to parse words. 

That the word white and house are often together, for example. Right. And that's training data so that it can do whatever it then does. But then on the output side, if I ask it to write a song in the style of Bruce Springsteen and it quotes Bruce Springsteen, well, then should it have the rights to Bruce Springsteen's lyrics or not? That's the debate that's going to occur in court. But what the publishers were saying was we should be paid for all of that and it should be licensed from us for all of that. 

And they shouldn't even be able to learn from us unless they pay for it. And I said, whoa, journalists read each other's stories all the time and learn from it and use their information in that. So should the machine not have the same right as a journalist? 

And I got in trouble for that saying machines don't read, they don't learn. OK, should the companies that make the machines have the same right? Should Google or Wikipedia or anyone have the same right as WCBS radio or or or TV company that does the same thing? 

And the position of the media people is, no, we deserve this right. Well, if that happens, it reduces fair use to the nub, to nothing. Because what it says is that, no, there is no fair use. 

I would think that training a model is fair use because it's a transformative use because it's understanding how to learn from something. And the whole reason for copyright is so we can all learn from each other. But they're saying, no, shrink fair use. And they say that there was a very interesting part at the end where I said that if you shrink fair use, there's no copyright. And they said if you expand fair use, there's no copyright. 

So that's kind of the crux of where we are now. And as I talked to Rich now to bring him in, I didn't realize that, for example, the New York Times has said, take us off of Common Crawl. And most of these companies opening I have have done just like robots.txt provided the means to not be crawled. 

OK. But if we get to the point where only three things on the Internet are crawled and used to train these models, we know what's going to result. There was a story just today that said that most news organizations have turned off the door to crawling. And but meanwhile, the far right wing media organizations love it because now they have a greater scale of influence in these things. So there's huge issues going around about this. And Rich finds himself in the middle of it. 

Caught in the middle. Common Crawl was creating this incredible resource to democratize the entire web to to save the text for scholars to use to study. Now it's being used to train these models. These models are controversy. Media companies are trying to get up on arms thinking they have a bag of gold waiting for them at the other end of a suit. Look at the New York Times suing Open AI now cutting off negotiations. So that's the atmosphere in which we arrive. 

And Rich is on the hot seat. And how does it feel? I mean, I think what's interesting about that and I want to hear what you have to say. I think it was really just fascinating is what I was just saying just a few minutes ago that the common crawl was created with a certain something in mind to really be of service to people just cataloging the Internet. And in essence, it's been relied upon in these other ways that maybe weren't and necessarily predicted in advance. And now that's placing some pretty intense pressure on what you all do. 

And I would say what you have done from a very kind of pure and understandable direction. How does how does it feel when like to a certain degree, does it feel like the rug has been pulled out a little bit? And it's like, OK, now we've got to recalibrate. We've got to reiterate why we exist in the first place. Well, I do think publishers are being hasty when they when they robot span us or if they send us letters and say, hey, we want you to take us out of your historical archive. 

Because they're mad at LLMs. You know, I think it's a shame because we see ourselves in our role as being archivists. And we are collecting this material for many purposes, not just for LLMs, but for people doing research, sociological research, studies of the web, creating data sets for machine translation, creating alternative search indices and LLMs. And so when you know, if you're if you're mad at LLMs and you say, well, I want to be out of common crawl, you're you're also denying your content to to researchers that are using it for, you know, myriad other other purposes. 

So that's a shame. You're also removing it from what's essentially a time capsule. So, you know, an archive that's going to be used by people in the future, which is also a shame. I thought, Jeff, your comments in your congressional testimony that copyright was never intended to apply to news were extremely interesting. You know, the idea that if a news organization wants to erase their stories from from an archive like common crawl, they're literally really erasing a record of what's happened. And that's bad for for civic society. 

It's not good for, you know, for democracy. Rich, explain a little bit, if you would, just the mechanics of common crawl. Number one, it only does text. So it doesn't do images. 

It doesn't have child porn, that kind of stuff. But number two, do you go behind paywalls? Do you do only the free web? How much of the web are you able to do? Yeah, so each each cycle, which is about every two months, common crawl goes out and takes a sample of the web. 

And, you know, we can call this about five percent. And it's it's somewhat random. And common crawl, its robot CC bot goes to great lengths to crawl politely. And it's only taking content that has been published on the open public Internet. So it respects robots dot text, the robots exclusion protocol, which is a means where web masters can say, hey, we don't want robots to crawl us. 

That's been in use since 1994. CC bot doesn't accept cookies. It doesn't log into websites. 

It doesn't bypass pay gates. It only takes information that web masters have put on the on the web so that people can see it. And it will have a per site budget and it'll take a certain number of pages from each site, archive them and then move on. And we add that to the archive. And so each cycle, that's about an additional three to five billion pages that we're adding to the to the total. And so in the end, do you go? 

Do you add up to 100 percent at some point? Well, the web is basically infinite. So each cycle, we we re crawl about 50 percent pages that we've seen before in order to see if they've changed. And then 50 percent of the pages are brand new pages like URLs that we know about, but that have never been visited before. 

So when the New York Times asked to take down the history that you had, that was stuff that had been available for free on the open web. Correct. Yeah. 

And to make a point about that, I think it's somewhat ironic. You know, I've been working in search for, you know, well, decades at this point. And I remember when Google came out, I was actually using it way back when it was Google.stanford .edu. And there was debate about, you know, is there a business model that could support search engines? And I think it took until about 2004 before people had worked out that business model. And I find it ironic that news publishers are, you know, they pay millions of dollars to the SEO industry in order to get into search indexes like Google. But now they're very hastily opting out of what is really, in my view, search 2.0, LLMs. 

It takes us months to do crawls and it takes the LLM companies hundreds of days to train models, you know, which is why you see with some of the products out there like chat. GPT, you know, the index, you know, it's awareness of the web ends in April of 2021. So to pull your content out, you know, it could be quite some time, you know, if ever before they're able to get it back in. 

And I think that they should be maybe more, more interested in exploring the potential of having their content in these indices and figuring out new ways to gain attribution, you know, to expand the distribution of their ideas and maybe find new ways to monetize their content. Had you been met with any or had you met any resistance similar to this prior to, let's say, the last, I realize you've been with the company a little over a year. So you haven't been there since the beginning. But has Common Crawl seen any resistance similar to this prior to where we are right now with, you know, these LLMs using the data set for, you know, creating their AI systems and everything like that? Or is this a relatively new thing that the news organizations have really clued in on the data set and cared about it? No, it's it's it's brand new. It just started last year, in fact. And in fact, our chairman, Gil El-Baz, said, you know, in 15 years of operation, Common Crawl hadn't hadn't taken a single bite of its data out of the archive. 

So this is quite a new phenomenon. And it's, you know, if you're a if you're a data hoarder or data preservationist, you know, we sort of consider it to be a library. It's a sort of personally distressing to have to go into the library and, you know, take books off the shelves and, you know, throw them away or put them in the basement or something. 

Make them unavailable to people. So what is what's the fate of our public web? It's present and its past tense. What what's the how's the picture going to change? I think that's that's that's up for debate that's being considered right now by by legislators. 

And I think it's in peril. You know, if you know, depending on on legislation, depending on what websites do how they adapt their terms of use, their stance towards robots. You know, it's it's possible this could be the end of the web. You know, the web was created, right by Sir Tim Berners-Lee in 1989. With this idea that the information should be free, right? 

If we could all access, you know, the world's information that that would be a really good thing for humanity. And Common Crawl, you know, is it's the largest copy of the web that exists. The by by orders of magnitude and it's it's multilingual. It spans over 230 languages. 

So every country, every language across across the planet over almost two decades now. So to say to sort of close the lid on that and say, well, you know, we want to first stop stop archiving this or maybe even destroy parts of it. I think it's I think, you know, Common Crawl is a you know, it's a national treasure or a global treasure even. And it would be a tremendous shame to to be short-sighted and lose lose this, you know, the value of this resource. Yeah, archival is such an important thing, especially when we're talking about something as pivotal and important as like the technology of the Internet has become. I mean, there's obviously no question about how how important it is to nearly every facet of life. And so to lose some sense of history, some piece of history around that. 

And to do so without like sometimes it's a matter. I think this, you know, everything that's happening right now feels so fresh in some ways. It it feels like things are happening so quickly that it's I think it becomes easy for people who are, you know, taking a look at their services or taking a look at their, you know, their journalism that they're putting online and to be fearful of what could happen. And so as a result, they jumped to the conclusion that they need to retract. And what is the long term damage ramification of that based on that kind of like immediate decision to to go in the opposite direction before truly understanding? Do you think this is a matter of not understanding? 

Not understanding what this what changes like this means? Or do you think, you know, the people who are making these decisions truly believe that they do have a solid understanding of the ramifications of this? Well, I think it's new technology. I think people are kind of freaked out by it because the capabilities of LLMs are surprising, you know, to the researchers that have invented it and to all of us. And I think publishers to maybe, you know, I think maybe they're still mad about, you know, Google in 2004 and what it did to the economics of the news industry. But but having said that, I don't think that the, you know, the right response is to is to pull all the content down. 

And I don't I don't think that's good for the world. I think that, you know, I support the right to read on behalf of AI agents, the right to learn. You know, you and I are allowed to walk into a library and read books, copyrighted books, newspapers, magazines, and we can learn from them. We can form mental models and then use them in our work in our life. We can apply them. That doesn't give us the right to plagiarize, you know, or, you know, engage in hate speech or libel. 

You know, if we're doing a, you know, a report for, you know, for school, we have to cite our sources and such. We have to obey the laws in whatever place we happen to be operating in. But that's at the point of application. You know, if we're if we're mad at an AI agent, like a new LLM, the way to correct its behavior is not to take training material out of the library. 

You know, it's not to burn books. It's to teach the LLM how to how to behave properly. And if we want these new agents, the LLMs to be best aligned with us, to be most useful to humans, they really need to have the access to the broadest set of training data. If we want them to know what what names look like, they need to see a lot of names. If we want them to know what songs are, they need to look at songs or listen to songs. 

If we want them to understand the structure of language, they should read a lot of books. And, you know, that's just that's just how it is. And if we deny them access to that material, it's it's it's going to hold back the development of this really important technology. Yeah, because the development of the technology doesn't end when these things go go away. If the cat's already out of the bag, it's it continues to to go in whatever direction it's allowed to. And to your point earlier, Jeff, you know, the news organizations pulling out of the data sets, except, you know, certain certain sides within certain news organizations from the far right, keeping it in. 

What does that do? That creates a very biased data set at the end of the day. And what we really hope for as users of these things is we want something that tells the truth, you know, as much as absolutely possible. 

That is even handed that isn't too slanted in one direction or the other. And that all that that desire as a user requires openness in order to get a true balanced kind of input of data. We should be doing the the exact opposite. We should be looking because the web is a biased database. It is it is it is that which and publishing's history is a biased database of those who've had the privilege to publish. Yeah, that's true. Yeah. And so we need to look at what's missing there rather than making more things miss. I should mention, too, that when I used the metaphor in my testimony of saying that the machine has a right to read and learn and use information, something got mad at me because that was anthropomorphization and the machine can't read or learn. 

I know, I know. But if I had said in the testimony, does Google have the same right as WCBS, as I said earlier, to read and learn that people would have concentrated on Google and not like in Google. And we're trying to get this at the level of principles and precedence. And so I should mention that that when Richard and I first talked, he reached out to me. 

I can't even remember exactly why. Rich was this whole discussion. But by the time we finished our conversation, we agreed that we need to do something and get something together. So we've been talking about the plans are still nascent, but getting together an event where we bring in AI people, media companies, journalists, researchers, law experts, policy people to discuss the peril to the open Internet, because that's what this really is. Rich, what do you hope we might accomplish next? What was the message we got across? 

Yeah, that's exactly right. I mean, I think we have to defend the open Internet. And I think I think we need to have a thoughtful discussion about these these topics. I mean, you know, content creators are upset about about this technology. And I think, you know, many of their concerns are valid, right? They're concerned about attribution. They're concerned about how are they going to get compensated for their content? These are reasonable questions. And we should have a thoughtful debate. 

And, you know, ideally one with more more light than heat in it. And so, yeah, Jeff and I have been talking about, you know, who can we get in the room and really hopefully generate, you know, some ideas and thoughtful discussion about this stuff. So, Rich, I'm curious too, is it's not as if you're new to all of this? You worked at IBM Watson, you've been a software engineer and engineer for a long time. You've been around startups. 

And I imagine lately you've had contact with these new AI companies. What do you think of them? What do you think of the work they're doing? 

What do you think of the LLMs? What do you think is next? Just kind of how do you assess the state of that industry, that brand new industry? 

I think it's tremendously exciting. I mean, I'm AI positive. You know, people ask me, you know, hey, Rich, are you an AI doomer or are you AI positive? I think it's far more likely that AIs are going to be just this incredible tool for humanity, you know, drug discovery, right? I think it's way more likely they're going to cure cancer than that they're going to turn into murderous terminators. 

And far more likely they're going to figure out how to deflect an asteroid than, you know, then decide to kill us all. And especially, you know, like, why do I think that? Well, you know, if they've read all these books, right, you know, then they know what we know. They're going to share our values. They're going to share our hopes and dreams. And so they're going to have built in alignment with us. And, you know, the folks at the AI companies are putting a lot of effort into making sure that these systems behave responsibly. 

I mean, that's a huge component of what the LLM training, you know, is with the red teaming exercises that they do to make sure that these systems, you know, behave responsibly. And sometimes people complain that they've gone too far and, you know, the agent's almost, you know, being too nice. But they're proceeding cautiously, which I think is wise. But some of the, you know, folks we're talking to are just doing astonishing things with this technology. You know, poolside is doing, you know, English to code prompting, which is just, you know, marvelous. This idea that, you know, finally, you know, after 50 years in the industry, you know, English has now become a programming language blows me away. 

Right. This has been the holy grail for so long. And the fact that it's actually become a reality now, you know, in 2023, 2024. 

You know, I'm just, I'm floored that this technology has finally come to fruition. Let me ask you about some alternatives. For the news industry, one thing that I've thought of is that rather than cutting off, why don't they join together to create an API for news and hand out, sell keys to it so that if a large language model that has, doesn't have fresh data since 2021 needs fresh data, it had a place to go out and get it and serve that to a user and pay for it. Are there other models that you could imagine about what the relationship should be between publishers and media and the AI and the media? 

Jeff Jarvis, Jason, Howell, Rich Skrenta, Common Crawl Foundation. This can't be a bunch of legalese at the bottom of your website in terms of use because computers can't really read that or understand it. If you have to pay a lawyer to read that, we can't have a lawyer read 200 billion web pages. 

It's infeasible. It needs to be a machine-readable license or something that can be processed in bulk. New York Times, they're a huge organization and they have the resources to go to folks and charge them money. But the long tail of publishers don't have the resources that the New York Times has. So I think having some organization or some system that they could plug into, that little publishers could plug into, in order to say, here is the way I want my content to be used. 

I think that would be really helpful. There was a fascinating case that I just saw today where Metta lost a case against Bright Data. A federal judge ruled that Bright Data was scraping public information, crawling public information off of Facebook. Facebook was objecting, saying this violated their terms of service. And the judge said, well, they weren't a customer. They didn't sign in. 

They weren't then subject to your terms of service. And I think that I want to get to the precedent of that and what you think that might mean for this discussion. But just as an aside, I linked to that story this morning a couple of times. 

Then we're back to it later. It now had a new pop-up that said I had to agree to courthouse news services, terms of service, before I could read the story. Yeah, I got that when I hit the page. So there are lawyers in a place called Courthouse where I guess reading this and figuring it out. But that's saying that what is otherwise public now gets a new kind of wall around it. No, we don't just have a pay wall around it. We now have a TOS wall around it. 

And it's going to be just an ongoing war. You're not a lawyer or don't play one on TV, Rich, but do you think this meta case has any impact on the discussion at large? I haven't looked at the particular meta case. 

I know you sent that to me this morning, but I haven't had a chance to look at it. But I do think it's interesting because we did have a publisher come to us and they sent us an angrily worded letter. They said, well, you must have logged into our website and you must have signed up for it. You must have accepted our terms of use because otherwise there's no way you could have gotten this content into your index. 

And I was just kind of sighed and I thought, well, geez, it's a shame that they don't understand how their CMS works. But publishers typically will put contents on the open web for a period of time, whether it's 24 hours or seven days or 30 days in order so that Google will come by and index it. And then they might pull it back behind their pay gate after that point so that then when you click on it, then you get the wall and you have to sign up and pay money to see the article. But they dangle it as bait in order to get the clicks. And during that time, CCBOT will often come by and say, oh, here's a free article and then make a copy of it. 

And it has no idea that in the future it's not going to be available anymore. So, you know, it's sort of already the case that this happens sort of at scale, you know, on the publishing web. That's fascinating. I love how you explain that because that makes a lot of sense. I mean, it makes a lot of sense that about how confusing that could be for someone who works for an organization but doesn't quite understand exactly how it's working behind the scenes. You're doing it above board, but yet what they see is no, that's content behind a paywall. 

How the heck did you get access to this? At the end of the day, what do you want? Like, are you hopeful that organizations like these publications come around or like where? What is the direction that you see this heading where we're at right now that that fear prevails and that this really does break the internet the way you were talking about earlier? 

I am hopeful, well, I'm going to try and say this in a positive way, but maybe with a little bit of an edge to it. I think that publishers will come to regret asserting their right to be forgotten. I think that there are hasty decisions to opt out of these indices, to opt out of our archive, to opt out of a chance to learn how to explore this new technology over the next few years. You will consign them to darkness, to the dustbin of the internet. 

It's not a forward-thinking model for them to embrace. We were talking internally and some folks within Common Crawl said, well, it's a shame if these big publishers with all this great content want to take this stuff out of the archive. I said, that's true, but I guarantee you I can go find another end billion pages on the internet to replace them. 

We're taking like 5% of the internet each pass. There's plenty more I can go get to make up the gap. Our current archive is 10 petabytes, but we'd love to 10X it if we could. I'm sure there'll be plenty more content out there that we can use to fill the gaps. Do you think it was a mistake for open AI or bad precedent for open AI to do deals with actual Springer and Associated Press to pay for content that's set up this? 

Okay, how much is mine? I can't comment on their business decisions. It seems obvious that they would go that route. I'm concerned that if New York Times wants to go charge folks like open AI, great, but what about the open source models? If somebody at Stanford is making an open source model, do they don't have the money to go pay for this premium content? 

Does that mean that the premium content won't be in the open source models, that it's only going to be in the premium models from the big companies that can afford to pay these license fees? That seems like a big problem to me. Yeah, I think there's a huge risk here of regulatory capture. 

And the big guys being the only ones able to do anything because of liability, because of paying for content, because of regulation. The fight over when I was at the World Economic Forum event, the AI Governance Summit a few months ago, there was much discussion of open source models. And there's one school of thought that says it shouldn't be open source because then the guardrails can be foiled. I think that's ridiculous. 

Most people do think that's ridiculous. And Jan Lacune, for example, at Meta says that if we don't have open source, we won't have accountability. We'll only have a few big companies doing this. And that's the end, not just of the open web, but of the nascent web. 

We don't know what can be built. Yeah, that's right. And I completely agree. 

So it's pretty depressing all at all. But we have to be, I think this is so complicated and head scratching and it involves these companies that some people just don't like because their technology or they're big or they're crazy or whatever. It's like a rebellion. It is. It's about our web and our speech and our knowledge. 

And that's not where the level at which we have to have this conversation. It's not evil AI company steals our knowledge and screws it up. It's that we're all going to be using these tools. 

And if these tools are ill taught to use the anthropomorphism, it's going to affect us all. It's going to be less. You said before, Rich, that you think that the LLMs will be the new search engine, given that they aren't good at facts and tend to be stale. 

When do you think that day comes? When is it a new search? Well, LLMs, I mean, they are an index, you know, and they are an index of the web and these other datasets. And, you know, the more popular the concept, the more accurate they are. So they do know some facts, but as you get to the periphery of the topic, you know, the less accurate they become. But that's true for a regular search index, too. 

Right? You know, if there's lots of topics you can search for on Google and you get, you know, you have to wade through a lot of inaccurate results. Right, true. You know, so, you know, you're using your human brain to filter the results and decide what's accurate. And you kind of have to do that with the LLM as well. You know, it's maybe a poorly paid intern, you know, and it comes back with a quick answer. 

But, you know, you got to, you want to check its work maybe before you run off and go in front of a judge with it or something silly. But it's definitely a search index. And it's, you know, it comes back with a quick, concise and summarized answer, you know, basically from the entire corpus, which I just think is a really cool trick. And I, you know, I'm AI positive. 

I think these things, you know, for me, it's not inappropriate to use the word learn, you know, is a synonym for index, you know, and I've got my little analogy, you know, a robot walks into a library, you know, it should be allowed to, you know, to read the books, you know, just like, just like you or I should. But we're not, we're not far away from that where, you know, robots will be walking around, you know, and, you know, if they see something like a, you know, a sign, a poster, a book, you know, must, you know, will they have to avert their eyes, you know, right, or if they, you know, if they walk into a store and there's music playing in the background, you know, you know, will they have to turn off their ears, you know, so they don't inadvertently hear, you know, copyrighted music, you know, it's absurd to think that that, you know, they would have to do that. And we might, we might be there already, right, there's like, you know, Azalean Tesla's driving around on roads with, with cameras, you know, picking up stuff, you know, they're seeing things like billboards and stuff, you know, they're ingesting this content, it's theoretically in all kinds of data sets, you know, already, you know, ring cameras, you know, things like this. There's just tons of content being archived all over the place. And, you know, you could train models against all this stuff, which I think is, is really cool. 

But the number of, of agents that are going to be operating, you know, the walking around or flying around or whatnot is, it's going to vastly increase, you know, in the near future. And I think it's, I think it's in our interest to allow these things to learn as much as they can, if we want them to be helpful. I think that's a really great way of illustrating kind of the predicament. And I think at the core is just this idea of like, what it means to be human and to have these rights as a human. And then to also have a machine or a computer or an AI or whatever performing similar tasks and yet looking at it through a different lens. And as not okay over there, it's okay here, but it's not okay over there. Well, it's also, as you say that Jason, it's also that the, that the computer can be an agent for the human. 

And so limiting the machine limits our use of it. Let me ask one more question, Rich. I'm curious that Common Crawl operated to a great extent kind of under the radar. People who were savvy knew about it, but I'll bet somebody, some of the publishers who were in it didn't know about it. Other people didn't know about it. Now they're finding out a lot more. 

And on the positive side, I would imagine that attention to a good cause might bring funding and attention and new projects in your dreams. If you're not too distracted by all the current fights, what, what can Common Crawl become? What do you want it to be? Yeah, we have, we do have, I would say an ambitious set of ways we would like to improve the index. Specifically, we'd like to, we've had many requests to push into underrepresented languages. 

This is a big topic within the EU. You know, the big languages are well supported. I think we're, Common Crawl is about 46% English. You know, we've got lots of German and lots of Spanish, but you know, if you go into the smaller EU languages like Catalan, you know, we have some, but do we have, you know, the right proportion? 

I don't know. We have, you know, we should measure it and then we should, we should explore. And things like say, you know, there's like 500 regional language languages in India, you know, and I'm sure we're not doing, you know, the best job we possibly could, you know, there. We had a request to crawl more deeply within Indonesia. 

So pushing more deeply into these other underrepresented markets is a big focus. We want to increase the quality of the crawl. We have a bunch of duplicate pages within the crawl that are sort of like fuzzy duplicates. 

So they're not like bite for bite duplicates, but like if you sort of go and extract the useful text from the page and then compare it, you know, it turns out that many of the pages that we've collected are the same. And so we could actually increase the token density of the crawl even within the existing footprint by removing those. And then we'd like to make the crawl bigger. You know, we're crawling three to five billion pages each cycle. I would like to 10x that. 

There's just more of the web out there that we could be getting each cycle. How much were you crawling five years ago? Let's say just if you had to pick a number out like how much has that increased over time? I think it's been in that range. It's been in the like one to three, three to five. I don't have the historical stats. But the total archive is about 250 billion pages at this point. 

There's a whole bunch of charts and graphs and stats on our website that you can go through. Well, Rich, this has been fascinating. It's been so, so happy that we could actually have you on on the first episode of this show. I think the work that you're doing as you've illustrated is really important. I am an archivist at heart. So any, you know, anyone that's actually paying close attention to the archival of things that are incredibly important, like the internet, like the data out there. You know, obviously there are there are things to figure out over time as the data gets used in different different ways that maybe we didn't see before. 

But that doesn't make it any less important. And I just really appreciate that you and the folks at Common Crawl Foundation are doing the work that you're doing. Thank you so much for coming on and talking all about it. We really appreciate you. Thank you very much, Jason. And thank you, Jeff. 

Thank you, Rich. Yeah, and on the next call. And before before we let you go just, I mean, I don't know if this is for Jeff or Rich, but you talked a little bit earlier about a possible event. If people want to be sure that they don't miss information around this, because it sounds like it's still in the planning. 

Is there a place that you would want to point them? Just would that be like Twitter or Mastodon or anything like that? Yeah, we'll put it out on the Common Crawl Twitter. And I'm sure, you know, Jeff will tweet it out as well. Of course. 

Tweet and Mastodon and threads and all the things. Don't forget Blue Sky. And Blue Sky, that's right. 

Don't forget Blue Sky. It gets forgotten. Rich, thank you again. Thank you, Rich. It's been wonderful talking with you and really nice to meet you. 

And we'll have you back in the future. Thank you again. Awesome. Thanks a lot. My pleasure. All right. Keep up the great work. All right. 

And that was that our first interview. Awesome stuff. Relevant news even. Yeah, absolutely. 

Bring in the heat. So when we were testing this show in the club, we kind of had like a little news area ahead of the interview. And then we thought, you know what, that's maybe a little awkward because then we're asking the guest to kind of like hang tight or, you know, maybe they don't have a lot to say about certain things. So we thought, but we do want to have a little bit of a moment in the show to be able to highlight at least a couple of things that we've been following and tracking in the world of AI. 

So why don't we do that now, Jeff? And I think by the way, some weeks we may not have a guest and we may have so much news going on. That's what we'll discuss and explore. So we're figuring this out as we go. Yeah, totally. I mean, this is this is a work in progress. 

This is clay that we are molding, building the ship as we fly it all the all the all the things. So but I did find a story and I know it's maybe it's a little stale, but we maybe not stale as the word actually has some current current newsiness to it. But it was announced at CES this rabbit R one device designed by Teenage Engineering. And I am like super interested in this here. I'll pull up a little article and show it off. And I'm sure you've you've taken a look and done some thought around this, but it's this $199 kind of device that is meant to be an illustration or an idea of what a dedicated AI device might be. We've seen the humane AI pin. You know, there there I think right now there is a lot of head scratching and figuring out as far as like what would an AI hardware device or an AI specific assistant that isn't our phone. 

What would that actually be? And the rabbit R one seems to be doing something right. I mean, they had 50,000 pre orders in five days. It is only $199. 

So the price to entry is pretty low. But you know, having that design aesthetic from Teenage Engineering gives it kind of like a cool retro kind of factor to it. It's got this little scroll wheel navigation 360 degree flippable camera, this giant almost like walkie talkie button on the side. So you can push to talk to launch what they call rabbits, which are scripts. 

And I don't know. I'm interested. I pre ordered. So I'm going to have did. I did. 

I did pre order, although it's probably not going to get to me until sometime in June. What was the story and fast company was great because it says the design drove this because the design is fascinating. And the story has has close ups of the various elements of the big button of the round thing of of of how that operates. And I'm glad for that. 

Because I don't think we've seen. Yeah, we all have our Android phones. You and I have Android or iPhones. 

Those of you who are apostates to the Android cause. But let's be honest, we all have been bored by the design steps that were made in recent years. It's kind of a phone is a phone now. So to see part of the interest of this device is not just the AI because it's kind of queeralist about how really useful it's going to be. How is it going to really relate to the applications? Are we going to trust it to do things for us? All those are questions, but it looks cool. 

It looks fun. I remember holding a. A. An iPad for the very first time and I called it hand candy. Right. It was it was it was just neat to play with and so on. So I'm excited about seeing a device, even if it does turn out to be fairly useless and only 200 bucks to try to rethink the aesthetic design, the physical design. I think is the fact that AI motivates it is kind of fun. Yeah. 

And actually they had some news about this really within the last few days, which is that well, this is just five days ago that perplexity. AI is LLM is going to be driving this or they've got to deal with a complexity. So so essentially the hardware right is $200. If you were to buy a year's worth of pro access for perplexity, that's $200. So basically if you buy this, you get to a year's worth of perplexity pro included. So it kind of depending on how you look at it pays for itself perplexity as an LLM, which I really need to dive into it because I haven't used it. 

And this is kind of like my my launch pad to do so. I've used Claude a lot. I've used chat GPT, of course, have not used perplexity. But apparently has no knowledge cut off. So the information, you know, we were talking about earlier in the interview about how much of a challenge that is and supposedly perplexity's system doesn't have that knowledge cut off. So it's very up to date, which makes sense why rabbit might want this in their hardware AI device so that if you're launching or what do they call it the rabbits or whatever those agents are that the information that you're actually getting back is up to date. I think that's so essential on these things. And we'll see. I'll be eager to see how it works. 

But you don't really need any advice to do this. But how do these agents work in terms of taking over an application, you know, making the proverbial dinner reservation or making the plane flight or buying something. Yeah. When are we going to, the key thing, and I heard this a lot at the World Economic Forum that I attended is that is the first the stage we're in now is generative. The next stage is autonomous. But when are we going to trust the thing to be autonomous? And by the way, if it messes up, whose problem is that same same with Elon Musk and his cars, right? 

When do we cross autonomy with AI? It's going to be a very interesting thing to follow. Very interesting. Also very interesting is the study that you threw in here. Tell us a little bit about it. So out of TechCrunch, MIT's CSAIL, I forget what that is, Computer Science and Artificial Intelligence Laboratory, put out a study about what jobs in fact will be stolen. 

And there's been a lot of talk about this, obviously. Open AI in one breath says, you're all going to have more time to hang out with Grandma because you're not going to have a job because the machine is going to do it all for you. And they put out a study listing all the many jobs that are going to be affected. And you hear now in political circles that this is one of the things we have to watch out for is how this is going to replace jobs. On the other hand, what CCL comes along and says is that it's not going to be as bad as we thought because it's not going to be economically feasible to replace some of these tasks. What they focus on, which surprised me, was visual monitoring of things. That an AI can look at something and say, yes, that lemon is good or no, that lemon has a blemish. 

And how does that happen now? Or baker, as they said. So they estimated that some proportion of these tasks is going to be more efficient and economical to still have humans do than have the machine do. 

That we're not near that point. If you had to install the machine and figure out the machine and understand that it's somebody who's trained to do something that a baker spends 6% of their time checking food quality. That's something that could be automated, says the study. But a bakery employing five bakers making $48,000 a year could save maybe $14,000 were it to automate food quality checks as a portion of their jobs. 

But the study says a bare bones from scratch AI system up to that task would cost $165,000 to deploy at $122,000 per year to maintain. Oh boy, yeah. So it's a little bit worrying to hear that we're still cheap labor. And that's why you hold on to just meat bags at the end of the day. Exactly. 

Put the meat bags to work. But it says that we find that only 20% of the wages being paid to humans for doing vision tests would be economically attractive to automate with AI. Humans are still better economic choice for doing these parts of these jobs. 

So I think the answer in this and so much with the eyes, we don't know. People rush to these presumptions that, oh my God, it's going to get rid of 80% of jobs. I think Goldman Sachs had some high number here about how many jobs it was 25% of the entire labor market in the next few years could be automated. Take all of that with a grain of salt the size of Utah. 

And I think don't jump to conclusions about what these impacts are. Yes, we want to watch. Yes, we need to be careful. Yes, we need to watch all this. 

I'm not saying we I'm not an accelerationist who says let it all happen with no guardrails. But don't presume the worst because then you're going to try to cut off the opportunity as well. Well, yeah, it's kind of the same, you know, similar to the conversation we were having with Rich about the data set. It's like you know the rush to panic and to remove the data. 

What does that actually create long term for, you know, for the benefit, the positive, the upside of that data to exist in the first place. And so it sounds like what this report is saying is the job replacement thing, whatever you want to call it, might happen in some ways. Just not as quickly as we as people might expect. Like there might be a time somewhere down the line on that, you know, years and years down the line or wherever that is, where this sort of thing, you know, where the costs do come down potentially or whether the ease of replacing human workers with, you know, autonomous workers increases becomes much easier. 

But we're nowhere near that at this point. And actually, in reading this, I, it really had me taught like top of mind for me was in preparation for this very show and launching this show. I used AI tools, generative AI tools to come up with baseline ideas for imagery and stuff like that. And then I hired a graphic designer or actually a friend of mine who is a very talented graphic designer said he'd do it for me. But, you know, and I passed information on to him. And he was the one that actually created this like, could I have relied upon the AI to do these things? 

And, you know, and that's another graphic designer that, you know, doesn't have the work in their lap. And I'm sure people do that. Yeah, I suppose I could have. 

But at the end of the day, I'm really happy I didn't because the quality that I got on the other side from the human attention to it was just much greater. And I don't know, maybe that changes down the line, but I do not believe that we are there yet. I like Rich's word hasty. I think we just need not to be hasty. Yes. 

And see what things can do. And we're still in charge. They're not taken over yet. And we'll be back another week. 

Yes, exactly. And we will be back another week on here at AI inside. I want to give another quick thanks to our guest, Rich, Rich Screnta from Common Crawl. It was so great having him with us to talk about just kind of the idea of crawl and all that data set, the importance of archiving and the ways that it's being used and the unexpected nature of how it gets used. It's just really fascinating. 

So common crawl dot org to check out more of their work. Jeff, this was a lot of fun. We should do this again next week. Great fun, my friend. Yeah, we will go and we're gonna. 

We're gonna. That's right. Each and every week we were talking as we were preparing the show and we were likening it to feeding the beast. It's like, all right, we've created the beast. Now it's time to feed it. 

So we will feed it each and every week. Jeff, what do you want to leave people with? This is your opportunity to plug anything you got. My quality is just GutenbergParenthesis.com where you can ride by both my current books, The Gutenberg Parenthesis and also magazine and Elegy to the Forum just at a time when sports illustrated is dying and time. Time magazine is laying off and poor magazines are shrinking all over. It's time to pay tribute to them, which I do. 

So go there. There's discount codes and I'll be grateful for anybody who gets it and gives me any reaction. Excellent. 

Well, you are a fantastic, fascinating person and I'm so honored to be able to do the show each and every week. Same back, my friend. Thank you. 

It's so great to hang out with you. As for me, you know, I'll just say AI inside dot show. I mean, I realize you're watching the show right now. So maybe you already know that, but we do have a web page that has all the information that you might need to subscribe to this show. You know, straight RSS, all the podcatchers, every episode, audio and video will be published there. 

So just a few minutes of boring stuff here at the end of episode one just to kind of set the stage. But we do record this show live every Wednesday. So the release date for this podcast will be Wednesdays. 

Our normal record time is 11 a.m. Pacific 2 p.m. Eastern and we'll stream with a podcast live when we do that to just search for yellow gold studios on YouTube or on Twitch and you will find it. But next week's a little different. However, we have one of those situations where it's like we have to change the time next week. So next week we are recording live Tuesday, January 30th. That's 1 p.m. Pacific 4 p.m. Eastern. And this might happen like when we were kind of setting up for the show, it's like, yes, we've got the standing time. 

Yes, live is important. But at the end of the day, when we're inviting certain guests on everything like that, sometimes we have to, you know, attend to their schedule requirements or we might have scheduled requirements and we can be nimble. So we're going to do that. We will communicate the times of the live recording on our socials in advance so you can find us on socials or, you know, search for yellow gold studios. 

If you actually if you go to the YouTube channel for yellowgold studios, I will be creating the live event well in advance of the actual next episode. So you can go there and you can follow it from there. So just keep that in mind. But subscribe at AI inside.show. You can support us directly via Patreon. We do have a patron possibility there, patreon.com slash AI inside show and then just do search, you know, on all the major socials, you'll find probably us if you do a search for AI inside show or Jeff Jarvis or Jason Howell. And you can you can also actually send us an email if you like. 

There is a place on the website to contact us or you can send an email to contact at AI inside.show. All right. I think that's that's about it. We've reached the end of this episode of AI inside. Thank you so much for our inaugural episode and thank you again, Jeff for being here each and every week. Thank you, boss. Yay. I'm so happy it was a success and thank you, Rich. And we will see y'all next week on AI inside. Bye everybody. Thank you.