Black Cats & Generative A.I. – How LLMs are looked at from a Copyright perspective
25 October 2023
Developers and users of Generative A.I. ("Gen A.I.") can find themselves having to navigate copyright challenges but to do so requires some understanding of both how a Gen A.I. model manipulates copyright works as part of its training and development and how copyright law might apply to such use.
So what exactly happens to content when it is used to develop Gen A.I., and how might this be analysed from the Singapore law perspective? This article explores what might be one way of how the law might apply.
This is our second article published on Gen A.I. and copyright. The first ("Did Singapore Solve Copyright Issues in the Training of AI Models?" dated 25 August 2023, can be found here).
By Jeffrey Lim
Introduction
More than a few IP and technology enthusiasts, thinkers, advocates, influencers and readers are watching the proceedings in the US courts (Before the US District Court of the Northern District of California, San Francisco Division ) between Open AI and its opposite parties (Case No. 3:23-cv-03223-AMO) (the "US Action") where various parties have brought a class action lawsuit against Open AI and its affiliates ("Open AI") to challenge the alleged use of their copyright works in training Open AI's large language model ("LLMs"), ChatGPT (A news report summarising the claim in July 2023 can be found at: https://www.latimes.com/entertainment-arts/books/story/2023-07-01/mona-awad-paul-tremblay-sue-openai-claiming-copyright-infringement-chatgpt).
As at the time of this article, the matter is still proceeding, and it will be interesting to see how the legal issues under US copyright law are resolved.
If a court agrees with Open AI on (1), then copyright is not engaged at all and then claimants trying to stop the development of LLMs by using their work will need to look elsewhere other than copyright to bring a claim for compensation / stop orders in the use of LLMs. The same also happens if a court decides agrees with Open AI on (2).
Counsel for Open AI has filed a motion to dismiss the claim, and in doing so, they bring to the fore key battleground / focal points in contention under copyright law including:
- (on the basis that copyright protects expression and not ideas) whether copyright is indeed engaged when LLMs are being trained – i.e. whether the process of training an LLM merely involves the extraction of ideas (concepts contained in the content) from expression (the particular works / content itself); and
- whether the defence to copyright infringement on the basis of fair use would apply to the development of LLMs through the use of content.
If Open AI is correct on (1), then copyright is not engaged at all and claimants trying to stop the development of LLMs will need to look elsewhere other than copyright to bring a claim for compensation / stop orders in the use of LLMs. While copyright might be engaged, the same effect also occurs if Open AI is correct on (2). This article will look at (1) more closely before touching on (2).
Does the process of developing an LLM involve the "extraction of ideas"? Should any of the steps taken be properly characterised as merely "extracting ideas"? What is "extracting" anyway?
"Under the hood"
What happens "under the hood" in the development of an LLM? Not being developers of LLMs ourselves, we can only point to secondary sources for descriptive statements and one article that was a useful reference point for us, is the article "Generative AI exists because of the transformer", published by the Financial Times 12 September 2023 (accessible at https://ig.ft.com/generative-ai/), which we recommend as providing a most engaging read.
The article describes the creation of text-based LLMs and describes what appears to be a process of adaptation and modification, which can be broadly summarised (at the risk of over-simplification) as follows:
- Words in the target text are translated into a "language" that the LLM can "understand" – this includes breaking up text into basic units called "tokens", which may comprise one or more words, that can be encoded;
- Then there is embedding – a process by which the proximity of words to each other are given values (i.e. vectoring or adding values), thereby constructing a way to map the relational distribution of words to each other and allowing patterns of arrangements / statistical relationships of word orders and occurrences in sequences of text to be observed;
- Using transformers to process the sequences en masse, and therefore engage larger corpuses of text, allowing greater sampling / mapping of text to establish / observe relationships between words, thereby allowing the LLMs, through statistical models, to analyse the likelihood of the appearance and arrangements of words within a specified context.
This allows predictions of words in sequences and allows LLMs to create coherent expression. In this way, an LLM can generate text with semblance of creative intelligence.
Ideas vs expressions – consider a black cat
Let's take a moment to think that through.
Ideas can be expressed in many ways, and the line between the idea and its expression can be notoriously difficult to draw. Say I put, in text, a description of a black cat. It might look like this (and no, we're not anywhere in Edgar Allan Poe's league, so bear with us):
"Sable and matted fur, carpeting a sleek 4 limbed pint-sized feral beast. Green eyes intensely staring above an ivory tooth and fleshy pink tongued hiss. A serpentine tail coiled as if to strike."
Here, the "idea" is an impressive, edgy black feline hissing at you. And indeed, the explanatory phrase
"an impressive, edgy black feline hissing at you"
is another expression of the same idea.
The first expression differs substantially from the second even though, at one level, it is the same idea, but the expression is clearly, and qualitatively, different.
Now the idea of protecting expressions is that an author can have a monopoly over the expression (either the first or second version) and that you can bring an action for copyright infringement when someone takes that sequence of words, and replicates it. The author cannot (theoretically at least) bring an action if the only similarity is the mere idea of the black hissing cat.
So far so good.
But let's consider a cat which has a wound on its forehead. Let's add the detail that the wound was inflicted by an evil wizard when it tried to murder the cat's mother – consciously portraying an evil cat version of Voldermort attempting to kill a young Harry Potteresque kitten.
Now the idea of copyright protecting ideas and not expressions is that an author can have a monopoly over the expression … and that you can bring an action for copyright infringement when someone takes that sequence of words, and replicates it. The author cannot (theoretically at least) bring an action if the only similarity between the infringing work and your text is the mere idea of the black hissing cat.
Now we are adding detail that isn't generic to all hissing cats, and we're beginning to steer into the question of whether we are appropriating Ms J K Rowling's work (i) which is a form of expression too, as (ii) we are adding more articulation and detail which was developed through the intellectual labour of another author. However, is the taking of evil cat Voldermort and the poor kitten an expression and not just an idea?
You could make arguments, but we can say that there is certainly a lot of ink spilled in judicial pronouncements which would support the argument that it is.
A comparison – Translation / adaptation VS model development
Now let's dangerously oversimplify and fictionalise for the sake of a thought experiment.
We could say, hypothetically, that a developer of a hypothetical LLM essentially undertakes the following steps:
- Step One – Copies the text from publicly accessible (note: we did not say publicly available) sources.
- Step Two – Converts (note: this could be the "translation" process into tokens) the text (each word in each text of each sentence) into tokens.
- Step Three – Feeds the LLM with data to enable it to identify statistical relationships between words as contained in different contexts (note: this is transferring the "translated" text in step 2 and studying not the "ideas" but the likelihood that certain expressions could arise in different sequences).
- Step Four – Extracts the statistical model and allows the LLM to use it independently of the original documents, so as to create words that in turn can be assembled as created text to deliver ideas.
The so-called "hallucinations" of Gen A.I. tell us that the expressions-to-idea loop doesn't necessarily always work as intended. But, voila, you have a working LLM.
Now let's compare this with a more ancient process that existed before Gen A.I. and LLMs:Step One – A translator buys a copy of a book in English.
- Step One – A translator buys a copy of a book in English.
- Step Two – The translator converts each word into Chinese. And anyone who has struggled with multiple languages would know that not all words will necessary translate directly, so some work is needed to select the right Chinese words.
- Step Three – Grammatical re-arrangement and, to some extent re-expression and rephrasing is executed.
- Step Four – We train students to read the translated texts with the English text, and they begin to understand how to translate for themselves. There may come a day when that student will write his / her own bilingual text.
Voila, you now have a human who is, at least at one level, textually bi-lingual. (My Chinese teachers would be horrified to hear this over-simplification of this learning journey).
Is the comparison valid? If not, in what way is it different? What would the differences tell us?
Is the comparison valid? If not, in what way is it different? What would the differences tell us?
Because, simply put, Steps 1 to 3 in the second example (the translation of an English text by a translator into Chinese) is indeed covered by copyright. It is in fact a thriving business and a source of copyright licensing as the sale of translated copies of bestsellers attests.
Is the comparison valid? If not, in what way is it different? What would the differences tell us?
Because, simply put, Steps 1 to 3 in the second example (the translation of an English text by a translator into Chinese) is indeed covered by copyright. It is in fact a thriving business and a source of copyright licensing as is the sale of translated copies of bestsellers can attest.
Step 4 of the second example is another thing. I do not pay royalties to the writer of every Chinese textbook or translator from whom I've learnt my Chinese (as shabby as it is).
Questions, questions, questions
Does the fact that steps 1 to 3 of the first example of the fictional LLM developer mean that the LLM developer had to infringe copyright along the way to achieving its step 4?
And even if so, can we say that this is why a defence of fair use would be needed in order to defend the developer of an LLM from a claim of copyright infringement? In other words, if steps 1 to 3 areinfringement, could one argue that it is part of a continuous (and contiguous) process to training a model, and so the initial use of the content in steps 1 to 3, ultimately come under the wider umbrella of acts covered under a defence of fair use?
Put another way – would parsing the steps into the 4 phases be inappropriate because we need to see the entire operation as a whole, taking into account the intended commercial use case / purpose of the LLM and it's propensity to create competing works with the original content? Would we need to consider if there is appropriation / substantive "copying" whenever "new" text is "generated" by the Gen A.I.?
Or is a broad interpretation not appropriate and we should in fact indeed be rigorous in parsing the steps because that is how copyright legislation is designed – i.e. because it is a bundle of rights each separated into activities such as copying, modifying / adapting, etc. which would apply to each expression?
Conclusion
If it sounds like the analysis is getting metaphysical, it is because we are trying to tackle new technologies with legal frameworks and analytical tools created by the corpus of copyright law that has existed well before Gen A.I. was a gleam in the eye of the model developer.
What is clear is that, even in the antiquated lens of copyright laws, there are plausible arguments that come up on either side.
For now, we must await the pronouncements of judiciaries or the action of Parliaments / legislatures the world has over to give us clarity.
For now, perhaps we can continue contemplating the mysteries of black cats and their hissing.