A.I. & Copyright – Did Singapore’s Copyright Act 2021 solve copyright problems in the Training of A.I.?
25 August 2023
When Artificial Intelligence models are trained against data, the potential for copyright infringement exists under the laws of some, if not most, countries. Did Singapore's Copyright Act 2021 solve this problem? This article explores what Singapore's changes to its laws mean.
By Jeffrey Lim
Introduction
"I come to bury Caesar, not praise him." So said Brutus in Shakespeare's Julius Caesar. Thus, this article comes to raise questions, not answer them. So, on that 'promising' start, we ask: Did Singapore's Copyright Act 2021 ("CA 2021") solve copyright questions in respect of training and developing A.I. models?
In some jurisdictions, lawsuits by content owners against A.I. developers are ongoing, with the potential for more legal action by rights owners to follow (See for example "Legal Challenges Surround OpenAI: A Closer Look at the Lawsuits" by K L Krithika for Analytics India Magazine, 21 August 2023). A common aspect of these complaints is that A.I. developers have infringed copyright by making copies of various works to train or fine tune their A.I. solutions.
But does this training, without express permission by content rights owners, constitute infringement? Are their legal defences available? And what if Singapore has amended its copyright laws to answer such questions? Singapore's law may well present a legislative model for other countries to consider and also position Singapore as an attractive hub for A.I. developers, to say the least.
"But does the training of A.I. models against content without the express permission by content rights owners constitute infringement? Are their legal defences available? And what if Singapore has amended its copyright laws to answer such questions?"
Scraping data from the Internet to build A.I.
First, some background. The rapid democratisation and advent of generative A.I. ("GAI") has thrown a greater spotlight on the reality that the development and fine tuning of A.I. solutions is a data intensive project. The more data there is, the better the quality of that data, then the more robust and accurate the A.I. solution.
The Internet is a treasure trove of data – good and bad, admittedly – but much of it is "available" in the sense that you can extract it. Even if you exclude the content that is behind firewalls, paywalls or gatekeeping features designed to keep the public out, there is still a lot of potentially good quality data.
This data includes text, photographs and video. And these may well be protected copyright. Indeed, it does not take much to secure copyright protection. In most countries, it generally comes into existence automatically, with little more needed than having a citizen of a Berne Convention member country (which means nearly every country in the world) create an original (i.e. not copied) work in a material form (which can include in digital form).
Doodle a stick man with frizzy hair that you imagined on a piece of paper, and you have a copyrighted work. Indeed, copyright can be found in bland factual reports, structures of data tables, and functional drawings. Not every work needs to be Shakespeare to qualify for copyright protection.
"Doodle a stick man with frizzy hair that you imagined on a piece of paper, and you have a copyrighted work. Indeed, copyright can be found in bland factual reports, structures of data tables, and functional drawings. Not every work needs to be Shakespeare to qualify for copyright protection."
Collecting data
Scraping data from the Internet, or any other name one might choose to harvest data from the Internet, essentially involves creating a first copy that is needed for processing to get at that data in the content. There will be other steps to prepare that data for use in training, for sure, but that first "raw copy" is a copy.
Unless you are looking at solutions like federated learning (which isn't all that applicable to all models and could present other legal issues if not used in the right setting) making that first copy is a necessary step.
If the owner of the copyright in that content did not give you permission to make that copy for the purpose of training an A.I. model, the question of copyright infringement needs to be addressed.
This is true even in situations where you subscribed to a database – if that owner of the rights in that database limited your rights to use that data for only certain purposes, and if A.I. model development is not part of those licensed purposes, you must ask whether copyright issues arise.
Think of it this way – if the database owner discovers that its users have a new use for its data, the owner would want to monetise it.
Enter Singapore's "Computational Data Analysis" provisions under the CA 2021
Singapore, ever forward-looking and practical in its legislative approach, took a stab at, and a step forward in, addressing this issue in the CA 2021, which has provisions relevant to the training of A.I. models that were passed into law even before widely accessible GAI took the world by storm around a year later.
Specifically, Division 8 and sections 243 and 244 of the CA 2021 provide that it is a permitted use to "make a copy" of a work or a recording of a protected performance for the purpose of "computational data analysis", which:
"… includes
(a) using a computer program to identify, extract and analyse information or data from the work or recording; and
(b) using the work or recording as an example of a type of information or data to improve the functioning of a computer program in relation to that type of information or data"
The provisions appear, at first glance, to clear a path in the copyright forest. If so, it should be a boon to A.I. developers, including the developers of GAI solutions.
In a short aside, the provision is only targeted at copyright. No other issues are tackled, and so matters connected with A.I. development, A.I. adoption or the impact of A.I. on stakeholders are left for another commentary, to say nothing of privacy, governance, emergent A.I. regulation, ethics, and liability in deployment. Indeed, just recently, privacy authorities from a number of countries have joined together to remind everyone of the potential for breaching privacy laws when scraping data (See "Global expectations of social media platforms and other sites to safeguard against unlawful data craping" –Australian Government, Office of the Australian Information Commissioner, 24 August 2023).
"Copying" is permitted under the CA 2021 – but is that all that is needed?
Returning to copyright, we consider the parameters and ask: What are the specific acts that sections 243 and 244 enable?
We observe that:
- the only permitted acts are "to make a copy", "storing", "retaining" and to "communicate" the work; and
- all references to "computation data analysis" itself are descriptive of the purposes for which the acts mentioned above are undertaken.
Whilst it is possible to over-generalise what is encompassed in the practical development of A.I. models, the operations generally undertaken include processing the data to execute the training and we venture to observe (being cautious that there may be nuances and exceptions) that the steps to train A.I. include processing the data into a form that can be used to train an A.I. model.
Consider the following equivalent of a "stick figure" doodle of the training process (and like any stick figure, this is overly simplistic, but this is to focus on a legal point):
What is happening in the steps in the "red box"? Something more than mere "copying"?
We will cover further analysis in another article, but for this one, the clear question in copyright is whether such processing to extract the data points is re-rendering the raw data into a form that can be used for analysis (e.g., comparable, even, to translation), or whether it is merely extracting ideas from expression?
The answer may be even both, neither or something in between, but it is worth noting that:
- on the "only-translating" side of the analysis, if the conversion of the copyright work into a format for training amounts to a form of adaptation, then those steps are a separately copyright-protected act, and "adaptation" is not an act (and a verb) that appears to be permitted under the "computation data analysis" provisions;
- on the "only-extracting-ideas-from-expression" side of the analysis, there is an open question as to whether copyright is even engaged at all; and
- if we land somewhere in between, would we have a "transformation" of the work such as to engage the fair use defence – which then is a segway to comparative reviews on "transformative use" under US copyright law, and even discussions on fair learning (see Casey & Lemley, "Fair Learning" - Mark A. Lemley & Bryan Casey, Fair Learning, Texas Law Review (forthcoming 2021) (draft available at SSRN.com).
"Was "to make a copy" "storing", "retaining" and to "communicate" collectively intended to (by necessary implication) cover all the copyright-protected acts undertaken as part of "computational data analysis", or was intention for the permitted acts under sections 243 and 244 of the CA 2021 to be so limited deliberate?"
More Questions…
Still yet further detailed technical and legal queries do arise.
How do we treat and analyse, in copyright terms, operations designed to extract data such as data cleaning, normalisation and feature extraction? Would such operations involve creating "adaptations" of the original work?
Was "to make a copy" "storing", "retaining" and to "communicate" collectively intended to (by necessary implication) cover all the copyright-protected acts undertaken as part of "computational data analysis", or was intention for the permitted acts under sections 243 and 244 of the CA 2021 to be so limited deliberate?
If the provisions only get us partly but not fully pass the post for all steps undertaken as part of computational data analysis, then do the provisions have any practical use for developers?
Conclusion
It is likely that each case of A.I. model development would need to be assessed and the legal analysis thought through before one can greenlight the acts of "computation data analysis" in processing or even start on harvesting raw copies of data. This will involve parsing the actual steps taken and then undertaking a thorough review.
Without proper analysis, litigation could well be foreseeable, and this could lead to a situation where, as Marc Anthony, who spoke after Brutus, said, "Let slip the dogs of war."