Data Management Policy for LAWPOL

1. General description of administrative data of RI

The administrative data processed in LAWPOL includes the information about the users and their access roles in the RI. For the basic tools, the information consists of just IP-addresses. For the advanced tools and tailored datasets, the users are required to register, and thus this information consists additionally of usernames, email addresses, names, invoicing addresses, and the names of the research group or organisation the user is affiliated with.

This data is collected by LAWPOL in three ways. Firstly, the IP addresses of all users are collected to track the utilization rate of the RI. This data is not disclosed to third parties. Secondly, in order to use the services that require a registration, an ethical permission, or a purchase, a consent to the registration of the personal data is necessary. This kind of personal data will not be disclosed to third parties and is to be stored encrypted. Thirdly, research usage is tracked by requiring users to inform the consortium about publications for which the RI has been used. Data concerning research usage can be disclosed to third parties.

As the principal owner of the LAWPOL infrastructure, University of Turku attends to the internal responsibilities regarding administrative data. Access to the administrative data is restricted to selected personnel of the LAWPOL consortium. The Privacy Notice of LAWPOL infrastructure is available at the website: https://lawpol.fi/en/tietosuojailmoitus-lawpol/.

2. General description of research data managed within RI

The research infrastructure will comprise of legislative and political materials collected from existing digital sources or replicated from hard copies. The digital sources include the open application interfaces of the Parliament, the Government Project Register Hankeikkuna and the Finnish Social Science Data Archive.

The data the RI pools together is originally in plain text, or in PDF or image format, and is typically accompanied by structured metadata. These primary types of data are shown in Table 1.

	Plain text	Pdf/ image	Meta
Preparatory materials	x	x	x
Government bills	x	x	x
Committee materials	x	x	x
Parliamentary debates	x		x
Policy documents in which legislative modifications are envisioned	x		x
National legislation	x		x
International treaties and statutes	x		x
Rulings of both national and international courts	x		x
Open access articles	x		x
Political agendas, manifests	x		x
Information about the governments, political parties and MPs			x
Official translations (Swedish, Sámi, English)	x		x

Table 1. Types of data

The RI produces and stores lemmatized and machine-readable versions of the source texts in a text format, and links them together using metadata. In addition, the RI stores also vector representations of these texts, called embeddings, for the advanced AI tools. As the data is pooled together, it is connected together using existing identification numbers for documents, organisations or MPs when possible. Some data has to be interconnected using other means, e.g. the names of the political parties.

The user can access the text-based data through the user interface, and the data is offered both as plain text, and as original copies in PDF or image formats. For further interoperability and reusability, data is offered also in CSV and spreadsheet formats. The RI also produces visual aids such as graphs on the quantity of different types of documents and their content and timelines on political processes. The graphs will be downloadable in a standard image format, with a mention of the origins of the image. New types of documents added during the development of LAWPOL are expected to closely match the existing documents so that the same data formats can be used.

3. Ethical and legal compliance of personal or sensitive research data

The data produced by and contained in LAWPOL will be managed according to the Data management policy of the University of Turku (https://www.utu.fi/en/research/open-science/open-data). The infrastructure also has access to data management support at University of Turku. The data produced will be made openly and freely accessible insofar as that is possible, considering the General Data Protection Regulation (GDPR). The data includes personal data such as names, titles and background organisations of the people involved in the political and legislative procedures. All personal data has previously been published as part of the original public documentation. The data concerning the implementation of law will include publicly available anonymised court decisions, which, while possibly containing some sensitive information, do not contain personal data, and cannot as such be associated with the persons involved with the court cases. This data has been previously published in publicly available databases containing case law. Transparency and openness of administration and legislative procedure are regarded as key elements in ensuring democracy and rule of law. The processing is necessary for the performance of a task carried out in the public interest (Article 6(1)(e) GDPR).

All members of the consortium commit to adhere to the previously mentioned data management policy when collecting new personal or sensitive data into the data repository. After the data has been pooled into the repository, UTU as the principal owner of the RI will be responsible to ensuring the data in the repository is managed securely and according to the policy.

All users of the RI will be required to adhere to the Terms of use of LAWPOL specified in the LAWPOL portal, that specify the need to follow the Guidelines of the Finnish Advisory Board on Research Integrity. In addition, access to some sensitive research data requires a research and/or ethical permit. The executive board of LAWPOL will require proof of this permit before allowing access to such data.

4. Agreements on research data rights

LAWPOL contains data sourced by other parties, data created by the RI, and data produced by the users of the RI. The public source data used by LAWPOL is owned by the primary party responsible for its publishment, for example Prime Minister’s office, other ministries, and the Parliament of Finland. The data is accessed via open application programming interfaces or replicated from public electronical or hard copies. The licenses in the APIs grant permission for reusing and republishing the data. The source of the data will be clearly stated to fully comply with the FAIR principles, and to allow the end-user to access the original copy e.g. from the archives.

The primary owner of LAWPOL is University of Turku, hence the data produced by the RI is also primarily owned by University of Turku. Ownership will be contracted with the consortium members in due course, according to the university guidelines. All contracts will honor the goal of maintaining an open access infrastructure with as little restrictions of use as possible.

The ownership of the data produced by users with the help of the RI will depend on the contribution of the user. For example, datasets directly downloaded from LAWPOL are owned by the consortium similarly to data produced by the RI. However, when this data is enriched, the ownership of these expansions will naturally belong to the user that contributed them. Regardless of how ownership of this data is defined, neither the RI, nor its owners will make any claims to the produced data, other than those detailed in the Terms of use; namely mentioning the RI in publications.

The Terms of use of LAWPOL, which will define the ownership and user rights of the data produced with the RI, will be published in the LAWPOL portal. All persons have a right to use the data currently provided by the RI. Always when the source data’s licensing scheme permits, the data the RI produces will be licensed with the CC-BY-NC 4.0 license, meaning that only commercial use is restricted. All other use is allowed if LAWPOL is appropriately cited. The license does not restrict the licensing of data produced by further use of LAWPOL data. The RI recommends the use of open licenses wherever possible.

The source code of the infrastructure is currently owned by University of Turku. When the development has progressed further, the source code will be licensed with an appropriate license such as the GNU GPLv3 or the MIT license.

The Terms of use of the RI will include instructions to mention in publications that the infrastructure was used in gathering or analyzing the research materials. All researchers using the infrastructure are required to inform the developers about all publications that the infrastructure has been used for. An example citation based on Data Citation Roadmap for Finland will be provided in the instructions.

5. Documentation and metadata

The infrastructure produces data matrices on the documents it contains, with an option to download metadata in a standard CSV or spreadsheet format. In addition to the already existing metadata associated with the source data, some linguistic data is provided, including the prevalence of search words in the materials included in the infrastructure, as well as the most prevalent words in each individual document. The materials that in the future can be produced with the help of LAWPOL include datasets tailored by the user with the help of the digital workbench for specific research on any legislative or political materials. The various forms and formats of data production combined with unique identifiers matching the official identifiers enable users to combine the data produced by LAWPOL with information from different sources. The generated datasets will include a README-file specifying e.g. the origins of the data, and the conditions of its use.

Data quality is controlled by the structure of LAWPOL. It is a one-way pipeline, where a user can only edit the data downloaded onto their own device, never affecting the data provided to other users. The data provided by the infrastructure comes typically directly from the APIs of the parliament and government, so any human error in the information is also inherited from these databases. The infrastructure will maintain a web form for user feedback, which will allow the consortium to fix faulty data spotted by users. The developers are also in communication with the administrators of the government databases, so that feedback on faulty data may be shared and data fixed beyond LAWPOL itself. Some minor problems affecting the quality of data are to be expected when new sources of data are included in the RI. For example, the machine reading tools used might have made mistakes in converting documents into a usable format. A notice of this possibility is included at the top of each document and a link to the original document is always provided. Naturally, the consortium does comprehensive testing on the infrastructure, but user feedback is also vital to spotting such bugs.

Furthermore, the consortium acknowledges that some data is lacking due to non-publishment by the responsible ministries. These constraints are described in detail in the user guide section of the infrastructure, and shorter flags on possibly missing information are provided in the relevant parts of the infrastructure. In the future, the comprehensiveness of data will be ensured by systematic review of missing data, acquiring them through other means than the API’s, including digitizing data from the National Archives, and incorporating them into the infrastructure.

For enhancing findability, all documents have a unique identifier and associated metadata that describe the data. The unique identifier, for example the number of the government bill, matches the official identifier used in the source of the data. The identifier is also visible through the user interface, enabling the user to access the same data later with ease.

To maximize reusability and compliance with other services, LAWPOL will generate the structured metadata according to the same structures and codebooks that the originators of the data use whenever it is feasible. Both the default metadata matrices and the user-tailored datasets will adhere to relevant community standards and use already existing identifiers and codebooks for the data, making reusing the data easier. The linguistic data is formatted according to suitable metadata standards for language resources, the standard tentatively selected for this is Open Language Archives Community metadata set.

In addition to detailed instructions covering the usage of LAWPOL, some specifically data centered guidance is provided. Users will be able to access detailed descriptions of all metadata used by the RI in the catalogue of the University of Turku, to which the user interface will provide a link. LAWPOL also provides the user with the data management guidelines assembled by the University of Turku.

6. Access control, backup, storage and disposal of administrative and research data

The data LAWPOL uses and provides for the users is stored on the University of Turku’s own virtual servers. The IT services of UTU are responsible for the access control and monitoring of the services. The Data Security Description of University of Turku describes the rules and policies the university’s IT services abide by (https://www.utu.fi/data-security-description). The development work of LAWPOL will follow good programming practices and the principles of secure software development, as well as the so-called security by default principles.

For security, the servers can only be reached from the university network domain. The connections to both the server and the database in it are protected with strong passwords and secure transfer protocols. To provide robust service with rapid responses, the data and processing load is distributed among necessary number of servers. The source code for LAWPOL software is stored on the university’s own version control server, and will be made open source after a stable version has been reached. The network infrastructure is configured and maintained on-site by university staff both physically and logically. Storage facilities, servers and other services are produced on-premises. The data center of IT services is in three separate facilities on and near the campus. All servers are backed up daily. Backups can be restored within approximately 2 hours upon request during business hours and can be restored to any of the last 14 copies. The number of copies may be increased as necessary and upon request.

The existing infrastructure does not provide any data storing services for its users, but later, if registered users are allowed to import their own data into the portal, the usage of that data is restricted only to the original user, and the data is removed after it is no longer needed by the user. The data in the LAWPOL data repository is there for permanent storage, thus there are no plans to remove the data. In the long run, in case the RI comes to the end of its lifespan, the long-term preservation of data will be solved by depositing the text corpus accumulated in the LAWPOL system in the Language Bank of Finland for archiving and further research use.

The infrastructure is designed to last over time and update automatically, converting and lemmatizing text and creating vector embeddings as it updates. Hence, the amount of data is accumulative and depends for example on the amount of legislation drafted by governments, the number of rulings given by the courts, and the excess of political agendas published. The current amount of data in LAWPOL is 27 GB for the political data and 28 GB for the legislative data. The amount of data expected to accrue before 2028 is described in the Table 2 below.

	Preparatory works and policy documents	Political data	Legislation	Court rulings	Research articles	Total
Database	430 GB	2 TB	1 Gb	1 TB	5 GB	~ 3,42 TB
Search engine	140 GB	650 GB	0,3 GB	325 GB	1,6 GB	~ 1,09 TB

Table 2. Amount of data

7. Opening research data and/or metadata

Most of the data provided by the infrastructure is public record, and thus primarily all of it may be published on any platform. Due to the resource intensiveness of certain features of LAWPOL and possible sensitivity of combined data, the services offered by LAWPOL, despite being generally accessible by all users without registration and fees, may also entail one or several of the following requirements: 1) registration by the user, 2) payment of a user fee, and 3) applying for a research permit from the data owner. The data management recommendations will be included in the LAWPOL portal particularly for research use.

The license of the RI, as well as the licenses of its source data, require naming the owner of the data when using them. The license the RI recommends its users to adhere to is the same open license that the RI itself uses for its data. For research users, LAWPOL recommends depositing the datasets produced with the help of the infrastructure into the Language Bank of Finland for long-term preservation.

All data that can be downloaded from the RI is uniquely identified with persistent official document numbers or other similar identifiers. When possible, these identifiers are directly human readable for the comfort of the user.

The University of Turku will issue its own Digital Object Identifier (DOI) for data descriptions of its own research data, which have been published in the University’s Data Catalog (https://data.utu.fi/catalog). The data contained in LAWPOL can also be described in this catalog and as a university’s infrastructure, it could use this DOI in the future. However, the datasets produced via LAWPOL by users outside the university will not have this option.

The source code of the system will be opened for wider use. A final version of the source code will be archived in a public general repository after the end of the project.

LAWPOL