[00:00:00] A very unique dynamic occurs when we're talking about generative AI. So LLMs and image generation and video generation, companies and models are scraping the Internet for millions of works, images, videos, content of all kinds that belong to other people. Articles, conversations, film content, artwork, tutorials, all human created content meant to provide value in exchange for people coming and paying attention to them, hoping to monetize through that attention and through the gain of the reputation of that content being valuable in the first place. But now, with generative AI, we can get the use and essentially the access to all of the value in that content, of all of the value in those art styles and those answers, without ever seeing the attribution of where it came from, without those proceeds being shared to its creator or to the person who researched and found that answer. It can now be recreated, mimicked, or even the exact content pulled back out of these generative models.
[00:01:16] What does that mean for AI? What does this mean for copyright and its sustainability in this new landscape, this new gray area in a changing Internet? And is the only solution the violent hand of the state? Or is there a technological or cultural solution to this technological problem? And it can be argued that OpenAI has become the ultimate middleman. It never even sends its users to the content or the people that made its product valuable to begin with. So is OpenAI a breakthrough or is it just a different kind of scam?
[00:01:57] Let's unpack it in today's episode of AI Unchained.
[00:02:10] So I have been wanting to explore something for a little while and I've been thinking about OpenAI from another perspective. There's all sorts of conversations about the ethics of plagiarism and copying and what kind of data that they're using, as well as a whole range of lawsuits against OpenAI for using proprietary data. Now, many of you who've been listening to me for a long time know that I don't even really believe in intellectual property, at least not in the state enforced way. I believe all you have the right to do is protect that information in the same way that I protect my passwords. But there is something really interesting about what OpenAI is doing and how they are using the information and the fact that they're not open at all.
[00:03:11] And so I wanted to dig into a lot of the conversations around plagiarism and the ethics of copying other people's information for training with a model and what I believe it entails and why I think we can see OpenAI as a scam, not in the way that they don't deliver a product or they're lying about what it can do, but that they have positioned themselves to have their cake and eat it too, using a whole bunch of other people's proprietary information, tons of stuff that they did not have permission to use and that they actually keep doing this. And there was a recent controversy about the voice they used for ChatGPT 4.0 that OpenAI really loves the take and get anything that they want for free and use it however they want side of the idea of an open Internet and open source software and intellectual property only as long as they don't have to participate in the give side of that equation at all. And that's what I want to talk about in this episode. The scam of OpenAI for those of you who don't know, this is AI unchained. Welcome. I am Guy Swan, your host. This and this is where we explore the concepts behind AI and also how we can get to a decentralized self hosted future where we are in charge. We are in control of these tools for us and not just at the whim of some giant centralized server somewhere, as has increasingly been the case. A huge thank you to Swan Bitcoin the best place to buy Bitcoin easily and literally set your life up into a Bitcoin standard. Whether you're not. Whether you're just buying a couple of dollars worth of Bitcoin and you just want to have a normal recurring savings account or if you literally want to get your business your retirement. If you want to go the full Monty and be on a Bitcoin standard, Swan Bitcoin is a fantastic place to do all of it. Check them out@swan bitcoin.com guy and of course got to keep that Bitcoin safe. You got to hold your own keys. Do not leave it with Swan. Do not leave it with an exchange. Do not leave it anywhere. Hold it on your ColdCard hardware wallet and you can get a discount with code Bitcoin Audible. The details will be right in the show Notes all right, so let's get into this. So there's an AP article recently just getting into there's basically a slew of lawsuits being put against Microsoft and OpenAI for their use of proprietary data, where they are arguing that this all falls under the fair use doctrine of which I believe they have even won or at least made headway in one case by using that argument. But I want to read the introduction of this article in particular and if you want to read it, I will have the links to all the ones that I'm referencing here in the show notes so it starts off A group of eight US newspapers is suing ChatGPT maker OpenAI and Microsoft, alleging that the technology companies have been purloining millions of copyrighted news articles without permission or payment to train their artificial intelligence chatbots. The New York Daily Times, Chicago Tribune, Denver Post and other papers filed the lawsuit Tuesday in a New York federal court.
[00:06:53] We've spent billions of dollars gathering information and reporting news at our publications, and we can't allow OpenAI and Microsoft to expand the big tech playwork of stealing our work to build their own businesses at our expense, said a written statement from Frank Pine, executive director for the Media News Group and Tribune Publishing.
[00:07:15] Now, one of the earlier lawsuits and I may have actually talked about this on the show, maybe in a conversation with Tvetsky or something potentially. But I remember an argument by, I believe it was the New York Times saying that you could essentially prompt and this was shown as proof of the fact that OpenAI was using their article, but essentially that you could prompt out the entire article. Like, basically like 90, 92%, something like that. Some almost the exact same article that the New York Times had actually published. If you had a prompt, that would essentially lead ChatGPT to pull out those weights. And my original thinking on this was that that was kind of a silly argument or that the claim made no sense. Because if you were trying to get the entire article out and trying to say that you don't have to go to New York Times to read the article, well, that's ridiculous because you need, you need to have any idea what you already have to have to know what is in the article in order to then construct a prompt in order to then get that article. Like if you're actually trying to recreate that article from ChatGPT.
[00:08:40] But I had a very shallow understanding of the situation and more specifically a shallow understanding of what the complaint was and probably the article too, or whatever it was I read. It may even been like a tweet thread or something. Probably hadn't really dug into it at length either, or, you know, maybe I just didn't pay close attention. But the claim was more that the weights, the essentially the quality of the ChatGPT output was built off of the New York Times article. Now, coming from the context of like, I don't really think you should be able to, like, there's no such thing as intellectual property. Nobody can copy something and have stolen it from you. You know, like if I steal your car, I've Taken your use of the car, if I quote, unquote, steal, if I copy something that you own or some data that you own, I haven't taken it from you. In fact, you wouldn't even know otherwise unless I publicly announce or show that I have copied it. And OpenAI, I think, had a very, I think they've got a legitimate argument for the case of the fair use doctrine where they are modifying the work in such a way that it isn't the same thing at the end result. And you know, you could kind of say that's the nature of the LLM, right? And all of this is perfectly accessible on the web. So it's not as if, in a sense we are all pseudo LLMs in some kind of ironic or really abstract, unspecific way. As I read a bunch of articles and then I build a model for how I think about a certain situation or a certain OpenAI, you know, I read articles for them, against them, you know, what's the ethics bad? And I start to compare that to rationalizations or logic that I've had in previous circumstances or, you know, what is sustainable, what I know about economics and, you know, what makes sense in this situation. So I'm building a model of this event. You know, I'm building a model of what their, what their whole business is from.
[00:10:59] Very shallow information, really. You know, I'm just building it from stuff that I read on the Internet. Most people don't, aren't really aware of how vapid most opinions are and how little information there really is about the situation that the typical person knows. So my writing is going to be influenced by a bunch of New York Times articles that I've read. And you can make the case that the LLM isn't really a whole lot different. They publish it so that people can read it. So is the LLM reading it? Is that, is that bad? You know, is that different somehow? Well, here's the thing. In my opinion, the thing that I think has come to make me change my opinion on this situation or this element of how LLMs work and what OpenAI is doing is that having especially, especially because as LLMs have continued to progress and we've learned more about them, one thing that I learned more than anything else is that high quality data is actually the most important piece of the puzzle is that it's not about just getting a ton of data. It's about getting good writing. It's about getting good answers to questions. That is how you train a good LLM.
[00:12:20] And then going back to the idea of fair use. Like if I use something on, let's say I make a YouTube video and I just pull a bunch of stuff, I mix, you know, somebody else's song or I'm using their footage or something like that. If I am in a sense using content that is not mine at length in the creation process, and very little, or no, none at all, is actually. My content is actually something that I have created, well, then one thing I can't do is commercialize it. One thing. Like I specifically, it's okay, you know, YouTube even gives me a warning that says, you know, it's fine that you use this stuff, but we cannot commercialize this. We cannot run ads on this piece because you have used too much of this song or too much of content pulled from somebody else's copyrighted material. And I think there's a very strong ethic there is that I cannot, you know, I don't really have a problem with someone copying a movie online or copying a book, but copying that movie and then selling it in place of its creator, I think is very unethical. Whether or not it justifies state intervention or not, that's a, that's a very different question. And I think whether or not we justify government violence in a situation like that, I think there is every market incentive to protect information in that way, to make sure that the creator is actually getting paid and not some scammer or duplicator who is actually getting paid for the content that someone else created. And I think the incentive is there for the audience. You know, what. What kind of idiot in an audience wants to fund, you know, if somebody makes a fantastic film who wants to fund the bootlegger on the corner for their favorite movie rather than the person who created the movie, because you're going to. You may very well discourage them from ever making any more movies or simply remove their ability to do so because they cannot make money off of it because they can't actually make a return. This is why I this one of the reasons why I think a decentralized economy is actually so much so important. And because most of the. A huge part of the problem in today's corporate environment is that the creators don't get the money anyway is the people who are actually boots on the ground making the film or making the, the art or whatever it is are often three steps removed from the actual capital that comes in through.
[00:15:10] Through the actual production. And you're far more likely for producers and lawyers and, you know, record labels like, these are the People who really make the money and they are not creators, they are administrators. This is why I think the gig economy and crowdsourcing funding and, you know, things like social media where you can actually tip and pay people directly. And the, the.
[00:15:39] This is something that I really love about lightning too. For anybody who's not super familiar with what's going on with the bitcoin space and what's going on with the Lightning Protocol is there's a. There's an element called lightning prisms where in a lightning payment you can automatically spit up, split up that payment to a very specific percentage to the different people who are participating in the production or in the distribution of that content. Like a great example is Jack Spirko of the Survival Pod has been doing this for a long time. And one of the really cool things he, he does and did for me when I came on the show, which I didn't even realize, I think he told me about it while I was on the show maybe, but it was right around there, as is the first time I'd ever actually really seen it in practice. But there is the podcast index lets you put in your RSS feed your payment split for a lightning thing, so that it's just basically built in the software automatically. And when somebody tips, when somebody zaps his episode that has. In which I am in it, I think like 40% of it or something like he just set aside some amount of zaps and maybe it was less or whatever, 10%, who knows? I didn't really look at the percentage, I don't really remember. But essentially if somebody zapped him, let's say it's 10% just for the sake of simple math is if somebody zaps him 100 sats on fountain or on Podcast Index or in one of these Lightning enabled podcasting services, 90 of those sats would go to him and 10 of those would go to me. And it was crazy because I would literally go up into Fountain and I would just see this extra pool of sats that were coming in and then I would notice that it was from Jack Spirko's podcast. But what's crazy is that you can actually build networks of distribution and build networks of data sharing that actually have that built in. So you can actually leverage something like BitTorrent without the risk or without the downside of someone distributing your content either for free or by charging people for your content.
[00:17:50] Charging people like delivering your content on, not on your behalf, but just as if they are the ones who created it to an audience. And then Them getting paid in your stead so the creator doesn't get paid. But if you built a network out where basically anyone could, quote unquote, be a seeder, kind of like BitTorrent, and essentially anyone who is seeding the content is therefore enabling that content to be available and essentially saving you hosting fees, which might be $1,000 a month, $2,000 a month, who knows? When you're dealing with an enormous amount of bandwidth, especially if you're talking about like a high definition movie, you're crazy. Crazy costs for bandwidth. That's why platforms charge such insane fees. I mean, Audible charges 60%. 60%. Audible makes more money off of my audiobooks than I do. Actually it's Audible makes more money off of that 60%. And then we split. The author, the person who wrote the book, and me, the person who narrated the aud, split what is left. So the author makes 20%, I make 20% and audible makes 60%. They did not make any of the stuff. They own the network and the hosting. They are administrators for the content that is created. Now if you could have a decentralized version of this, if you could have a peer to peer network, what's crazy is that we could have an automatic 10% split with whoever was hosting the data so that they could basically install the exact same software, download the audiobook, and even better if they download it as one of the first people to download it, and then when somebody comes up and buys it again, they buy it, quote unquote from the network. Let's say they download, let's, let's, let's say I'm hosting the audiobook and then you come online, you buy the audiobook from me. And because I am, quote unquote, the host in this situation, I, I get paid all of the amount because I created the, let's say it's a book I wrote and did the audiobook for, so I get a hundred percent of the lightning payment. Then you are now a seeder because you have downloaded and you were listening to the audiobook. But I want to incentivize you, A to share it out and B to continue to host and seed my book so that other people can get it. Because if, you know 10,000 people get online all at the same time to try to download it from me, nobody's going to get the audiobook. It's gonna take days basically to download even just 100 megabyte audiobook. So if somebody comes back online or if somebody comes online and they connect to you to download it, and they pay for the audiobook. It could be built into the software with a key send. Split that because they downloaded it from you. You get 10% of the sats and 90% go to me because I'm the one who created the content and you have assisted in helping me to host it. And then you have also taken a bigger risk. Nobody knows if the audiobook is any good. Maybe I wrote a really crappy book. So you took a risk in buying it first and then hosting that information, not even knowing if anybody wanted it or if the book. You didn't. You couldn't read any reviews. You had no idea. You took a risk on me as a creator. Therefore it's perfect. I'm perfectly happy with giving you 10% to both make my content more available on the Internet or on this network. And for taking the risk on something that you had no idea of the quality of the product. Now, there's an explicit reason to incentivize these things. And the difficulty if this software and this network has tons of creators, the difficulty of leaving that network and trying to go somewhere else to get it, like off of a different platform when this is easy and you can actually get paid 10% rather than your 100% of a tiny network or a bootleg network where, you know, people are trying to distribute somewhere else, where I am not, where the creators aren't going to go. Because literally this content would be, quote unquote, stolen from them or taken from them illegitimately. I think all the incentives are perfectly aligned to create something like that on the free market and that it would be very likely. I think DRM would happen perfectly naturally because people want to be rewarded for their work. And more importantly, the audience wants to reward the people who make stuff that they like. I think people really discount that. Sure, people like to get stuff for free, but they also like the fact that stuff that they enjoy exists. As someone who in the past has been an. A prolific pirate of material, I'm also one of the. One of the people who has purchased more media and more content than anybody else I know. So I was not only someone who got a lot of stuff for free, but very often the stuff that I got for free, I already owned. It just sucked having it in DVD or Blu Ray form because I wanted it on Plex. I wanted it on my.
[00:22:52] On my.
[00:22:54] Basically under my control. I wanted to feel like I owned it. And it was very difficult. Like, I remember the first time that I ever bought something on the itunes video store back when it was still iTunes and. And it was for Apple TV, one of the very early versions, the crappy versions that is. I bought.
[00:23:14] What was. It was the first season of Archer. Like, when it came out, I don't know. But it was something that I really enjoyed. And it was a season of something. What was it? I don't know. Who cares?
[00:23:26] And I remember thinking, oh, this is great. Now I've kind of finally got a, you know, free or easy system to do this. And then I wanted. I had a western digital media player, which is basically a hard drive with a media interface on it, and I wanted to move it to that. And I realized I couldn't actually get it off of itunes. And then I tried to watch it on the PlayStation. Like, I went through all of my media players, all the places that I went to, or all the places that I had to plug media into and the ways that I am typically watching media. And I literally had it on access on nothing. I literally could watch it on my Apple tv, which was being a pain in my butt at the time. I hated it. Now I love it. Apple TV is the only thing that's actually decent. All the Amazon sticks are all slow Roku. I hate it. But regardless, I despised the total lack of control I had over the content. And I ended up pirating it anyway just to get it in the form that I wanted it so that I can move around a file and do whatever I wanted with it and watch it on any platform or any device that I had access to. Now all of the devices have essentially caught up so that you can do that if you purchase or you have a subscription to the content, but you can't watch it offline. So I still, to this day, stuff that I have access to and stuff that I have purchased end up pirating it because I want it under my control. I don't feel like I own any of it. And funny enough, the only thing that I can quote, unquote own is stuff that I quote I technically get illegitimately, but hilariously, it's stuff that I've already purchased in some other regard. So the technological environment really alters the dynamic of these things. And LLMs are a very interesting new place, a new element in this whole thing, because it's. It's another gray area on top of the gray areas that we already have. So where do I think OpenAI's defense that using the New York Times, the Chicago Tribune, these, all these publications that it makes sense that if they spend an unbelievable amount of money collecting this information, paying for writers paying for investigations and research and all of this stuff, to collect all of this. And then you can get it through an LLM by just typing in a prompt. And I don't mean necessarily that you're getting the exact same article in the example that was originally used. Because my original thought of like, oh, you're not. You're not able to read the exact article because you have to know, already know what's in it in order to recreate it, but that you can pull out that information from it without going to the article. You can pull out the context or the potential answer to a question without having to read the article or go to the creator, the producer that has actually made that information and that quote, unquote answer available.
[00:26:28] It's very much like using a middleman, where the middleman doesn't actually connect back to the original source.
[00:26:36] The situation is increasingly, the more and more I think about it, it's extremely analogous to someone copying a movie and then selling it to you rather than, again, pirating is giving it away for free. It's just copying it among other people. So it's slightly different because someone is not trying to profit illegitimately by defrauding the customer. That's what I think it is. If I'm passing off this movie as mine when it isn't, or I'm selling it to you under the, essentially the guise that I'm selling you a legitimate copy, there is kind of an uns. There's an implicit statement or there's an implicit claim that you are paying the content creator for the content you are receiving. So the middleman that isn't sending any funds back to the creator for the purchase or use of, of that material is essentially defrauding the customer of that kind of unspoken agreement.
[00:27:40] This episode is brought to you by Coin Kite, the makers of the Cold Card hardware wallet, including the Cold Card Q which is their new version. I'm. I have multiple. I think I have three of the Cold Card Mark four or maybe two of the Mark Fours and one of the Mark Threes. I've had the Cold Card for a long time and I love this wallet. If you were looking to keep your Bitcoin safe and you want to know it's safe, you want a Bitcoin only device that is just no nonsense and has all of the kind of advanced edge case security features that you would want, if you're going to do a more complicated setup or you want to compare, you want to protect from, you know, multiple different variables. You want a brickme PIN where you can have a fake pin that if somebody punches it in, it just destroys the device. And so you can always recover it from backup. And you want, let's say you want to activate NFC on the device and you want to be able to tap to pay on your phone like I do. I absolutely love this setup. Just plug it in, it says, ready to sign. I sign a transaction, I tap it to my phone, then I hit sign, and then I tap it back to my phone to send the transaction back, and boom, I am able to move my bitcoin. This is how I do my business. It has all run off my cold card. I get paid in bitcoin. I pay my employees who help me run this show in bitcoin. And the combination of that ease of use with also the security of knowing if my phone is stolen and somebody gets access to the device, my keys are fine, my bitcoin's fine. They can't do anything with it. If my phone is destroyed in a flood or I just drop it off a building or something, I restore my wallet and my cold card can still sign for anything. I've. I've lost nothing. There is nothing like the security of knowing your bitcoin are safe and accessible when you need it. Check out the cold card or the tap signer or the open dime or the SATs cards. There are so many fantastic bitcoin security
[email protected] the link and details plus a discount code bitcoin audible are right in the show notes.
[00:29:45] So the thing about OpenAI is that they're essentially taking the fair use argument, but then they are charging for it. Their LLM is specifically of high quality because of the proprietary data that they are training it on. And then they are turning around and absolving people of the need to actually go to that service or to learn from New York Times themselves in order to summarize or recreate their own writings. Think about the fact that I could just say I could punch in a whole bunch of different data points and a bunch of different sources for pieces of information or important research about a topic and then say, can you write me a short article in the tone and with the professionalism of the New York Times. And I can do that. And it will literally sound like a New York Times article depending on the popularity of an author, like maybe even a book author like George R.R. martin. George R.R. martin is also suing them because it's clearly been trained on his books. And you can get it to write it in the the wording and the terminology, the vocabulary and the style of George R.R. martin, if you want. Now again, in a general sense, I don't think that that's necessarily wrong, except that they are charging for it and they refuse to release the weights that they have trained with other people's proprietary data. That is what gets me. And what makes it even worse is that they are open AI.
[00:31:35] They pontificate, they spout bullshit all the time about making AI available to everyone and open and accessible and safe, and yet they've done the exact opposite of being open. And even worse, on top of being closed source and charging people and clearly, so blatantly, clearly just sucking up everyone's data and, and using that to be profitable. On top of all of that, they close source the model. They are extremely secretive and closed off about letting anybody have access to their model so that they have the biggest lead so that they can stay ahead of everyone else. So the analogy that in my mind I think explains or helps to picture what exactly is happening is that I take a whole bunch of content from other people on YouTube in order to make my YouTube video, and then my YouTube video is unique in some way or does something that really encourages a lot of different customers and people to watch my video, but none of the content is actually new, it's simply in the way that I put it together. So I argue that under fair use I have the right to do this. And in a general sense I do not disagree with that. I think that's perfectly acceptable, except that then I put it behind a paywall and I don't let anyone. I start attacking other people and do everything I can to make sure that people cannot download my video and do the exact same thing with it. So I'm okay with fair use when it benefits me, but I'm going to crack down. I'm going to make sure that it is on lockdown so that nobody can fair use what I made from the stuff I stole from somebody else.
[00:33:32] That's why I can't help but frame it as kinda scammy. They're trying to have their cake and eat it too. And they constantly spout a bunch of word salad nonsense about open source and open access. And they couldn't be anything further from it. We don't know the weights, we don't even have GPT 3.3.5, we don't have the original model. And so many other meta has been so great about releasing their models. Llama 3 is fantastic. Grok from Xai and Elon Musk. They have Completely open Source as a 300 billion parameter model or version of the model. OpenAI is one of the only ones that literally refuses to, to open their weights to make it available.
[00:34:24] How they are doing everything and what is being used to train their models and for people to host it themselves. And granted this is true of a number of other companies as well. I don't think anthropic like Claude 3 or whatever isn't open. Gemini and the Google models aren't open. Apple just released a bunch of micro models, like a bunch of small models that are doing a bunch of very specific tasks, but they're very, they seem to be more research focused. They. I don't think they are the end result models that they're going to be using in their products. So it's not as if I'm saying that everybody else is doing this and OpenAI is the only one that isn't. But at least Google and Apple and Anthropic aren't talking about opening everything and making it available for everybody in the same sense that OpenAI has. And that OpenAI was built around the idea of having open access, of open sourcing AI. And they started with open source models and then they just went closed source and they just haven't come back. And they just kind of acted like it was going to be temporary, you know, oh, we just got to make sure it's safe and I want to make sure that. I note that I don't have a problem with closed source. A lot of people think that it shouldn't be allowed or you know, they, or I should get everything for free. And I don't think that. I don't think that. In fact, with the quality of ChatGPT, I don't even mind paying my subscription for it. It doesn't really bother me. It's a small price to pay for what I get is very valuable. But I also don't really use it for any of those proprietary ish purposes. Like I'm not going to write an article that's in the. The. I'm not going to have ChatGPT write me an article. I just, I just don't. It feels stupid. It feels like cheating on a paper. It feels like it feels pointless. Like if I, if I'm gonna publish writing, it's gonna be my writing. And even when I use ChatGPT or an AI or an LLM in order to come up with a description of my podcast, I do not use what it says. I use it in order to. Sometimes I forget about stuff that I talk about on the podcast because I do these. I do five episodes a week, you know. You know, it's so easy for me, especially if I'm a week ahead on publishing. So, you know what I'm saying now I'm going to be publishing next week, tomorrow. I've forgotten what the hell I talk about on this episode. In fact, what was it I was just talking about a little while ago? I was talking about the. Oh, the idea of the lightning prisms. I guarantee you, if it wasn't for the fact that I brought it up right now tomorrow, I would have forgotten that that was ever a tangent that I went down. So it's really helpful in that context and the fact that it can read the transcript of my episode and basically give me a rundown of the things that I talked about. And it says it in a really weird way and kind of like supers, like in this episode with Guy Swan, we explore and it just sounds, sounds like the lowest common denominator of, you know, content on the Internet. Sounds super markety. And I just don't like that. So I don't like. I don't use that wording and.
[00:37:32] But it is very useful in kind of pulling that information out of it. But I always try to make it my own. I feel like it's disingenuous to not do that and also kind of useless. You know, I end up, if that data ends up being used to then train a model to, you know, do podcast descriptions. And then all of my podcast descriptions are just generated by AI. You know, we've got the equivalent of AI inbreeding. I'm just, it's just being retrained on shit that it spit out. It needs human judgment and that human touch in order to actually improve the AI where we're actually just degrading it by giving it back, but by altering the weights that it already has with just the output of those very same weights. That's why when I said early on, talking about how like making tutorials, which I still think is something, something that I've kind of forgotten about until literally just right now is when you're troubleshooting or whatever, is that you can take the conversation that you've had with ChatGPT, if you're using it to help you troubleshoot some sort of a problem or an issue on a piece of software or something, and after you figure it out, you can basically just be like, all right, chatgpt, can you write me up a tutorial? Like basically a step by step process on What I did to solve this problem and write it as a tutorial so that I can then publish it online so if somebody else has this problem, they can find it. That's an extremely useful and clever way, in my opinion, to use this in a super productive. In a very valuable way. But one thing that I think is really important is especially in the areas where it's trying to explain where it's genuinely writing, as opposed to just giving the steps. Like, if it's purely instructional, it's like, I don't really care what it sounds like. It's just instructions, you know, it's like a recipe. You don't need to make sure that they say teaspoon the way that you say teaspoon. There's no style to a bullet list. And in the same way, the instructions themselves I don't think are super important to. To do this to, but all of the. The. The kind of leading of the reader telling the story of the problem. I think you should make it your own. And it only takes like five minutes. You know, 80% of it is essentially written. Change that 20% to make it just sound like something that you would like to read. And that just doesn't sound gimmicky or stupid. And I think that's our obligation to just bump it up in quality a little bit, to make it more human and less machine. Something else that I mentioned, going back to OpenAI, specifically, something else that I mentioned about them having their cake and eating it too, and that this is not the only case in which just using proprietary data to train their model and then try to be closed off and not let other people access it or do anything with it. I think that's pretty hypocritical. But then on top of that is you could say it's like, oh, well, we just wanted a really good model and we didn't really think about it. We weren't aware of it. They could kind of play dumb, except that they keep doing this. So another recent lawsuit is actually from Scarlett Johansson on the use of her voice or on the mimicking of her voice. So OpenAI actually reached out to her about using her voice for GPT4O for their newest model that you speak to, because Scarlett Johansson was the voice in. For the AI in the movie her. And so she was the AI and with Joaquin Phoenix. Joaquin Phoenix. And it's a story. If you haven't. If you haven't watched that movie. It's actually really good and I highly recommend it. But Joaquin Phoenix basically falls in love he has a relationship with the AI and Scarlett Johansson is the AI. Well, because of that, OpenAI was like, oh, you know, obviously, how cool would it be if we had Scarlett Johansson do the voice of our AI that you can interact with and that shows emotion and, you know, is very expressive and human. Like. Well, she refused, but her tonality, her phrases, the way she talks and the way she talks in her. In the movie is kind of very iconic. You know, it's very friendly. It's very. You know, you can say that it's kind of generic, that you would just want something to sound inviting and warm. But lo and behold, when the demo comes out with the voice, the AI voice referred to as sky, as the female voice, it sounds a lot like Scarlett Johansson. And specifically, it sounds a lot like the AI from the movie Her.
[00:42:31] I was getting, without even thinking about it, I 100% got those vibes. It's the same sort of deep warmth of the female voice, same kind of mood and style of talking. It was. It's extremely similar. Now, I would not say that they illegitimately used her voice, but I think they were actively trying to. They were like, oh, this is a great template. This is a great way to think about it. So we're going to just try to mimic the voice from the movie her. So now she is suing, basically saying that they. They still just did what they wanted, what they asked her to anyway, even though she refused. Now, part of me is a little bit, like, it's a bit of a stretch, like, I don't listen to it and go, oh, that's Scarlett Johansson. But I wonder if they kind of had something else when they, like, looked into it, if maybe it is literally actually a modification of her voice in some way to just kind of change the tone enough that it doesn't quite sound like her, but it might have actually been trained on it. I don't know. I would find it interesting to think that, or I would find it interesting if there was no further connection, if they did not have anything that, you know, maybe some. Even the words and the phrases or something, the way it speaks is almost exactly like from the movie or something in some way, because I would doubt that they would bring the lawsuit if it was simply that, oh, it kind of sounds like her, but maybe. Maybe that's all it is. But again, it may very well be another example of them trying to have their cake and eat it too. Now, just so you can get an idea and judge for yourself, I've got a clip of both sky from the original demo, and then also Scarlett Johansson from the movie trailer, basically when they are both initialized so that you can hear the voice itself, the tone and the style and everything.
[00:44:23] So here is the clip from the demo of when he first initiates his conversation with GPT4. Oh, Mark, I'm doing great. Thanks for asking. How about you? Oh, you're doing a live demo right now. That's awesome. All right. And then here is from the trailer of her. This is when Joaquin Phoenix initiates the AI computer. Hello. I'm here.
[00:44:51] Hi. Hi. I'm Samantha.
[00:44:55] Good morning, Theodore. Morning. You have a meeting in five minutes. You want to try getting out of bed?
[00:45:02] Now, obviously, those aren't the exact same voice, but it's pretty similar. I mean, it. It sounds like it could be out of the movie. Like, if you had told me that was a clip that was actual dialogue from the movie on the first. The first thing, I mean, aside from the content, it's like, oh, you're having a demo of GPT4. Oh, aside from that, I wouldn't really have questioned it. I would have just thought that. Okay, yeah, I mean, that was. That was all of the associations and the, like, that sounded exactly like from the movie, but obviously it's not the exact same voice. I don't think they built a model entirely from her voice, or if they did, if they. Let's say they actually, you know, used sampling of 30 seconds worth of really good content from her when. When the computer was, let's say, just in the introduction. I think, like, the first scene they have together is a couple of minutes and they're having a conversation back and forth, and Joaquin's just, like, kind of like fascinated that he's having a conversation with a computer. Like, they could have done that and then just mildly tweaked the.
[00:46:05] The audio output so that it didn't have exactly the tonality or exactly the.
[00:46:12] The. The makeup of Scarlett Johansson's voice. Now, whether we could say there's enough evidence or enough similarity as if there aren't two people who have similar voices or to talk the same way, you know, like, whether or not this is lawsuit worthy is, you know, an interesting question, but again, it just feels like OpenAI loves to use everybody else's stuff and not let anyone use theirs. Now, of course, if you have to go to their platform and give them your email and let them spy on you and connect to everything that you do and let them read all of the information off of your device, well, sure, they'll quote unquote let you use it, but it won't be released for you to tweak or for you to fine tune and host somewhere else, even though they're taking a ton of proprietary information and using it to train their model. Now there's another big implication with AI that I've increasingly been seeing or realizing, especially with Google and search results when I've been looking for stuff. I've noticed something interesting about the way I interact with the web now and how AI has kind of become that middleman between me and an article or me and the actual researcher, the source that the, that a answer is being pulled from. But I think it's worthy of its own episode because it's a very, it's a very unique thing and it's something that I had not anticipated going into it, that it changes the monetization model of the Internet. I think it really alters in a huge way how we even interact with things like how we even traverse the web or the need to traverse the web. And if the web is built on attention, on advertising, so you go to their website and they sell advertising because you're there to look at it. If I can get that information without going to the website, what does that mean? What does that mean for the entire way that we fund the Internet today?
[00:48:24] But I think there's a very interesting conversation there, not only on just the idea of the web itself, but what are the solutions and you know, how does Lightning and Bitcoin fit into this as well as the pair stack and the things we've been talking about on the pair report.
[00:48:40] And I just think there's a lot to unpack there in the implications. So I want to give it its due time to actually unpack that stuff. Now. The number of lawsuits that have come out now against like OpenAI is going to spend a lot of time fighting in court.
[00:48:58] And I would wonder if they just open sourced, if they just made the model available, the weights itself available and they released the content that they trained it on, if it was open source. I wonder if they would get away with the fair use. I mean they may get away with the fair use claim anyway, but I'm not 100% sure if I would actually support that anymore in the more that I've dug into it. Like another one of the lawsuits. Actually separate lawsuits have alternate. The Intercept and Raw Story have all filed lawsuits in New York against ChatGPT for again the same thing using proprietary data in order to their copyrighted data in order to train ChatGPT and another one is actually about Microsoft's copilot which is actually being accused of not listing the source information. So when people look up or find some sort of an answer, they don't even say who it's from, which I assume they probably do now. But there may be been cases. I mean if there's a lawsuit, I assume somebody has basically shown or thinks they can show that they were used as a source for answers and Microsoft isn't actually sourcing them, isn't actually referencing them in doing so. And this goes the same with accessing code snippets like actually licensed blocks of code that are then regurgitated from a lot of these LLMs and copilot being another big one without again attribution to the actual people who wrote the code. You know, when I'm using content from stock footage or from you know, envato elements or music or copyright free music like you have, you attribute the person who actually did the content. But not only is the LLM not attributing, you know, where does the main weight come from when it's generating, let's not even say LLMs, let's do any content, right? Is how much data has been trained on for the lora, I mean the, the like Sora, the video generators, the animated diffs, the stable diffusion granted, stable diffusion, animated, if these things are entirely open source. So I think they, they do fall under the fair use because they're not charging for it, they're not closing it off behind walled garden, enforcing you to host, get onto their servers and you know, using you to data harvest in order to make their money, when they're taking away the monetization of the potential potentially of the content that they've actually generated those weights from. I think stock footage is actually a really great example of this because when you have a model that can just generate any footage that you want and it's believable enough, it's good enough quality, which you need tons of high quality stock footage, well described, well tagged, you know, with a very large model in order to train those weights like Sora, when you do have that, you literally kill the stock footage business. I mean I think stock footage, stock imagery, stock video, like all of these things, I think that entire market has, is basically right now carving the name into the gravestone like that is going to be extremely dated. And maybe there will still be a market specifically for buying or accessing that footage for training LLMs, but I think most people will simply find or generate a lot of that content like Envato Elements, one of the services that I've used quite a bit for that sort of thing, very often I can't find anything good, you know, or it takes a really, really long time. And it's funny now they have already introduced or they have already integrated AI so you can do generative AI for a stock image specifically. So I think they're well aware that this is threatening their entire business model. And maybe their, their new model is to train an AI that people will pay for in order to generate this stuff. But if they have paid for all of this footage and generated, you know, this AI and generated a model so that other people can download it or generate it from them, but then OpenAI then uses all of this same stuff to generate to create their model and then charges people to use theirs. Well, you know, what's the, what's the value in Envato Elements in actually paying for any of that footage? Why wouldn't they just do the same thing? And how long does that go on before everybody's just trying to copy everybody else's stuff? Until we've reached this weird point, like going back to the article about the inshittification of the Internet is that everything is just AI generated and we're actually starving for the very content. We don't have a way to monetize. We don't have to have a way to actually provide that economic feedback mechanism for the things that are actually producing the stuff that is needed to train these models to begin with, actually having footage from the real world.
[00:54:24] Basically, the incentives of the creation of content on the web and the ability to actually reward for people to actually for to be economically sustainable in the creation of that content are really getting messed up. There's something really interesting going on with the economic incentives here, and I think it's worth paying very close attention to, or at least it seems, it seems like something is unraveling here.
[00:54:51] It's a little bit different than what I may have expected. Like some of it, yes. But I don't think I realized I had not considered what the next episode of the show is likely to be about with the situation with the web and not actually needing to go to the website and what that could mean. Because one of the analogies that I used is, I thought a great way to think about AI or LLMs or generative AI in general is it's like a compression algorithm, an extremely lossy compression algorithm that can't really pull out the entire original product, the entire original picture or whatever. As it was trained on it, but it pulls the weights of the patterns from those pictures, such that you can recreate a picture of a cat that's extremely similar or that has the specific realism of somebody else's picture of a cat that actually existed. But that's the thing, is that what that means is that I can get a picture of a cat without actually going to a person, to anybody's content page or to anybody's website or their social media or whatever it is, who literally take pictures of cats.
[00:56:10] And if that can't be funded, if that can't be rewarded either, socially, monetarily, whatever it is, if that cannot be made economically sustainable, then what happened? Do we have pictures of cats or do we just have a million AI generated cats? I don't know. It's an interesting.
[00:56:28] It's an interesting twist on where this is going. But basically all of these different companies and models and stuff, even stable diffusion stability AI was in court with Getty Images, which I don't even know what the outcome of that is. I haven't really looked into it since. I just have it saved as like, oh, this is going on. Maybe it's still still going on.
[00:56:51] The midjourney has been hit with a legal case and it's interesting to unpack that because, you know, if they scrape the web for a bunch of images, you know, let's say on what's the one that a lot of. DeviantArt. DeviantArt IO, I believe, where there's a ton of different artists and you can find the artists based on name. One of the interesting things about the Generative AI stuff is that I've actually learns the names of some of artists, which I actually can't think of them off the top of my head, but I have them saved for prompts and I have them listed, like with their style or whatever, because sometimes I do want to just generate something that has a certain look. But there's actually a stable diffusion cheat sheet that is literally a giant. In fact, maybe I'll host this somewhere so that you guys can get it, or I'll find the link to it. I'm not sure where it is, but I'm sure I think I got it off of GitHub somewhere. But there's a stable diffusion cheat sheet that has just a huge list of all of these different artists names and their style and what it looks like if you use them in the prompt in order to generate the stuff. And it's like, you know, I'm generating things that are, you know, done in a style of somebody else because they use that data to train. But of course, they're not attributing the authority. They're not linking back to their content.
[00:58:20] I'm not. They're not getting paid anything. Granted, stability doesn't even charge. It's an open source model. So it is an interesting. It's an interesting conversation. I'm curious what you think about the ethics of that sort of thing of literally, quote, unquote, copying. Like, I could paint a picture that looked like somebody's artistic style with the specific purpose of making it look like them. But I wouldn't say. I mean, that's. That's not. Not allowed. You know, like, they can't just claim that because it looks a certain way that that means they own it. Like, again, I don't believe in intellectual property, especially not in the sense of a style. Like, if somebody likes that style. I mean, you can argue every movie style is a copy of a previously successful movie style. And that's how the. That's how art evolves. To begin with. Somebody creates something new, something unique. Then other people copy that type of editing or that type of cut, or, you know, doing a Tarantino and putting the end of the movie at the very beginning and then having the whole story play out, and then you see what the context was, and it changes everything about what you saw at the beginning of the film. These were unique things to unique artists at one point in time. And then they just evolved. They became, you know, through osmosis, they just get soaked into the. The context of the art form itself. It gets adopted and they push art forward and, you know, artistic creation and all of this stuff is about sharing. It's about being dynamic. It's about taking something that you know and love and trying to make it your own or adding some new twist to it. And when it comes to generative AI is the idea that, like, obviously this is not policeable if you can just generate millions of pictures that are based on this one artist style. In fact, there's another one. There's.
[01:00:18] In fact. Where's Pinocchio? I think it's open right now. Hold on a second. Yeah, Pinocchio. Where is it?
[01:00:25] So there's a tool called Zest zero shot Material transfer. Okay, that's not the one.
[01:00:34] That's a good example, though, is that you can take a material or a type of backdrop or something like that, and you can transfer it to any object. So let's say you have an apple on a table. This is the image that they have. In the picture or the example for it, there's an apple on a table and then you take a copper, something made out of copper and take that image and you say transfer the material. Well, now suddenly you have a copper apple and it will generate that. But you can do the exact same thing with a style, okay? It's called instant style. So you can take an image or a painting or something and take any other image and use the painting as reference or the artist work as reference. And then you can just immediately it will take that style and it will imprint it on this other thing. So if you want it to look like a Picasso painting or you want it to look like Van Gogh, a Van Gogh painting, you could just make any of your family portraits look like a Van Gogh painting that clearly cannot be policed. This is something that's so easy to do. This is very much like the fight against piracy. You know, this is going to be the never ending whack a mole. Especially with something like stable diffusion. And these models that are out there and aren't closed source, they're just open. And there's a billion Lauras like, like even on Civitai. Civitai, they've got like Lauras for a bunch of different celebrities, which I'm no doubt, like most of that's probably just used to generate porn, which is so funny and also weird in the context of, you know, celebrity nude shots online or whatever. You know, it's like always been a thing that, oh, they faked it, they photoshopped their face onto something else. But now you can literally just Gen AI as many as you want and people can just generate Lauras. They can literally train them on. In fact, I believe there's a, there's another model. Hold on a second. I think I have this one because. Yeah, yeah, of course I do. Because the face fusion is the one that I used with Salma Hayek and I was. Because I was working on a Salma Hayek meme and I was trying to do the video and you can actually live change your face to somebody else's face and you only need one image, one image of this person as reference. This is not policeable. So it's gonna get really weird because celebrities now have to deal with essentially an infinite amount of images of them that aren't actually of them. And if there's just hundreds of thousands, millions of models out there and anybody can just train them on their own computers with one image of this person's face, or you know, just 100, 200 images of celebrities which. Good lord. What can you find? What else can you More easily find 200 different images of? This is the new technological landscape. There is no putting the genie back in the bottle. So where are the limits?
[01:03:29] Where are the walls? What are, what are the, the kind of guardrails? Not, not political guardrails, not like, you know, what are the. What should we do? Regulation. I mean, it in the sense of the market. Like what will the market determine in that implicit, that unspoken agreement between the content creators, the people who create that content and, you know, put themselves out there, and then the people who consume that content. And how does the platform and the technology to deliver that change? Very similar to the context I talked to recent before of, you know, splitting it with whoever is hosting the model and letting it out in a distributed kind of peer to peer fashion like that. I don't. That platform does not exist yet, but theoretically it can. And I think that's the sort of. I think we need to build technological solutions to technological problems. And when you have a new element in the technological, in the content generation side of technology, you need to have a requisite or a parallel development in the content distribution technology, in the networking technology and the monetization technology around it in order to actually meet that in the real world. And that's the only real solution. So the question is, what does that look like in the real world? Like how do we build that solution and how do we do it with as little beating each other over the head and putting people in prison and stealing massive amounts of money from each other? How do we do it in a cooperative fashion? How do we push forward rather than hit each other as we try to resist change in the future as much as we possibly can? But I want to leave you. Well, we'll go ahead and close this episode out because I don't want to get too far into something else that I think will take a long time to unpack. But I want to leave you with a report from Copyleaks. This is a. There's an article from Axios that I kind of went down a rabbit hole on that I had not seen this before and it is just a couple months old now, but copyleaks ran a series of tests against a bunch of different authors, the New York Times, a bunch of different research on content and stuff, and ran a ton of different prompts for ChatGPT and then looked at a similarity score in the content that it was being pulled from. And depending on what it was like, let's say it was computer Science was it that was actually the largest amount, where it was literally 100%, where everything that the ChatGPT did in a thousand outputs related to computer science were actually so similar to something very specific that was used as an input that it was just considered essentially plagiarism of that thing. But this changed hugely based on what was actually being output. Like in theater, the lowest similarity scores were 0.9%, humanities subjects was 2.8%, English language was 5.4%, et cetera. So depending on what the subject of question was, it varied quite a bit of what was considered the degree or the percentage of plagiarism of some other explicit content or copyrighted work or paid for published information.
[01:07:15] But ultimately the statistic of across the board everything that they tested, the average was 60%. 60% of the output of ChatGPT for any and all questions on all the topics that they covered in their study, the overall similarity score was 60% to some other explicit content from somewhere else that someone else had written. And obviously they were not being attributed. They aren't getting a share of the money that everybody pays to use the LLMs. And of course they, and no one else can host this LLM and run it themselves because OpenAI has closed it down and they've taken everyone else's content, created this model based off of it, and then determined that they own the model, nobody else can have it and you have to pay them or log in and connect to their silo in order to get it. So what do you think of that? Is that, is that wrong? Is this just the new environment and we just have to deal with it? Am I just being a statist or am I not being hard enough on them? I'm really, really curious your thoughts because this is, this is new, this is new ground. You know, we don't. This is not even something that, you know, if you went back 10 years ago, I never really would have thought that this would even be part of the conversation. I never would have thought that you would just be able to punch something into a computer and then just kind of generate stuff. It's a really, really wild world that we are living in.
[01:08:56] So I don't know, those are my thoughts and I am curious yours. So hit me up, tag me on Social on Nostr, have my pub key in the show notes as well as my Twitter handle Heguy Swan and I'll catch you on the next episode. A huge thank you to Swann. Bitcoin, the best place to buy Bitcoin and CoinKite the makers of the cold card hardware wallet to keep your Bitcoin safe for supporting the show. I am Guy Swan. This is AI Unchained. And until next time, everybody. Take it easy. Guys.
[01:09:35] Over Protecting intellectual property is as harmful as under protecting it. Creativity is impossible without a rich public domain. Nothing today, likely nothing since we tamed fire is genuinely new. Culture, like science and technology, grows by accretion. Each new creator building on the works of those who came before. Overprotection stifles the very creative forces it's supposed to nurture.
[01:10:05] Alex Kaczynski, SA.