Artificial intelligence (AI) and machine learning (ML) Legal framework for the use of training data
Protection of training data, dataset and corpus
There is currently no specific legal code for artificial intelligence. In order to assess whether the use of training data to train an artificial intelligence in the context of machine learning may constitute a legal infringement, it is necessary to take a look at various laws and legal areas. In particular, the following should be considered:
The training data is protected by copyright, meaning that its use may violate the German Copyright Act (UrhG). As a result, in addition to claims for damages, claims for destruction and even urgent legal action by an opposing party to search your own business premises without a prior hearing may be possible.
The training data contains personal data, meaning that the GDPR and any other data protection regulations must be observed during processing.
The training data is subject to a contractual confidentiality agreement, possibly with contractual penalty provisions, so that claims for damages and possibly contractual penalties may arise or be forfeited as a result of the training or the subsequent dissemination of the trained AI.
Training data and copyright
Copyright protection
Copyright can be affected in various ways. This depends on the nature of the training data. Depending on whether the training data is text, images, music, videos, software, databases or one of the various other intellectual property rights, different sections of the German Copyright Act (UrhG) must be observed.
For example, the collection of data via screen scraping, web scraping or web crawlers may infringe the rights of third parties as database producers. See also our separate article on the permissibility of screen scraping / web scraping and web crawlers.
If, for example, texts from third-party websites are to be used for the training of your own AI, this may infringe the copyright of the respective text authors as well as a database right of the website operator, which may have arisen due to his investment in the selection and structured compilation of the individual contents. If, on the other hand, images are to be used for machine learning, the copyrights (or related ancillary copyrights) of the respective photographer, who has generally not granted a general license (or general right of use) for the use of the images for the purpose of machine learning, may have to be taken into account.
Exceptions to copyright protection
However, the Copyright Act also provides for limitations. This means that although third-party content is protected by copyright, it may be used for your own purposes without the permission of the copyright holder. Such so-called copyright limitations exist for various special areas.
With regard to a third-party database, for example, non-essential content may not be systematically and repeatedly analyzed (see details in our article on the permissibility of screen scraping / web scraping and web crawlers).
It was only in March that a new copyright law cabinet for "text and data mining" (TDM) was introduced. The relevant regulation can be found in Section 60d UrhG. According to this, it is expressly permitted to automatically evaluate a large number of works as source material, including systematically and with the express aim of creating a corpus (i.e. a dataset). However, this regulation expressly only applies to the purely scientific field. The practical use of the standard is therefore very limited. Section 60c UrhG also contains a relevant provision, but also for the scientific field.
However, the provisions of Sections 60c and 60d UrhG indicate that the use of text and data mining and the creation of a corpus for commercial purposes is questionable and must be strictly aligned with other possible limitations in order to be permissible.
There are currently hopes that a forthcoming major copyright reform will result in more extensive, blanket licenses for the commercial sector.
Irrespective of the invocation of legal restrictions on the use of training data, care can also be taken to only use content from third parties that contractually permit its use for machine learning. For example, it is possible to examine content under a license from the "Creative Commons" license family more closely.
Training data and data protection
If the training data contains personal data, data protection law (usually in the form of the GDPR) must be observed. The training data may then only be used if there is a legal basis. The option of first anonymizing the data is conceivable. However, the process of generating anonymized data from personal data may already require a legal basis under the GDPR. In addition, anonymized data is often no longer suitable training data because the personal reference or at least the linking of certain partial data with each other is required for machine learning.
On the topic of artificial intelligence and compliance with data protection law, see our detailed article here.
Training data and non-disclosure agreement
Regardless of the legal regulations, the use of data for the purpose of training an AI may be inadmissible due to contractual provisions.
If, for example, a supplier company discovers that it has an interesting treasure trove of data and wants to use it to train an artificial intelligence, non-disclosure agreements or NDAs with suppliers or other partners may need to be observed. Such non-disclosure agreements or NDAs may include a ban on using the data for the company's own purposes and a ban on passing on the data, as well as severe contractual penalties. Machine learning can therefore already violate a non-disclosure agreement or NDA, especially when passing on a fully trained AI to third parties. It should be noted that "back-calculation" to the output data (or parts thereof) from the fully trained AI is certainly conceivable. However, the details depend very much on the AI in question. The blanket approach that only a fully trained AI is passed on and the output data is therefore not disclosed to third parties is therefore not always reliable.
Conclusion
A great deal of interesting source data that is to be used for the purpose of training an artificial intelligence in the context of machine learning is subject to legal framework conditions. The simple training or transfer of a fully trained artificial intelligence can therefore result in legal violations which, in addition to injunctive relief, can also lead to claims for damages and even summary proceedings in court without a hearing to secure documents with the involvement of bailiffs.
However, if certain legal framework conditions are observed, machine learning is permissible without any legal risks.