Data Management Policy for LAWPOL

1. General description of administrative data of RI

The administrative data processed in LAWPOL includes the information about the users and their access roles in the RI. For the basic tools, the information consists of IP-addresses, that are collected to track the utilization rate.

This data is collected by LAWPOL in three ways. Firstly, the IP addresses of all users are collected to track the utilization rate of the RI. These addresses are anonymized, and the original data is deleted within two weeks of collecting. This data is only disclosed to third parties only when demonstration of usage is required, namely in applications for funding. Secondly, in order to use the services that require a registration, an ethical permission, or a purchase, a consent to the registration of the personal data such as the user’s e-mail address and invoicing address is necessary. This kind of personal data will not be disclosed to third parties and is to be stored encrypted. Thirdly, research usage is tracked by requiring users to inform the consortium about publications for which the RI has been used.

As the principal owner of the LAWPOL infrastructure, University of Turku takes care of the internal responsibilities regarding administrative data. The Privacy Notice of LAWPOL infrastructure is available at the website: https://lawpol.fi/tietosuojailmoitus-lawpol/ (in Finnish).

2. General description of research data managed within RI

The research infrastructure will comprise legislative and political materials collected from existing digital sources, including but not limited to the open application interfaces of the Parliament, the Government Project Register Hankeikkuna and the Finnish Social Science Data Archive, or digitalized from existing physical documents.

The data can be divided into the categories shown in Table 1.

 Plain textPdf
Preparatory materialsxx
Government billsxx
Committee materialsxx
Parliamentary debatesx 
Policy documents in which legislative modifications are envisionedxx
National legislationx 
International treaties and statutesxx
Rulings of both national and international courtsx 
Open access articlesx 
Political agendas, manifests, and information about the political partiesxx
Translations for some of these documents (Swedish, Sámi, English)x 

All data the RI pools together is originally in text or PDF format. The RI produces and stores lemmatized and machine-readable versions [TY1] of the source texts in a text format, and links them together using metadata. In addition, the RI stores also embeddings, that is vector representations of these texts, for the advanced AI tools. The user can access the text-based data through the user interface, and the data is to be offered both as plain text, and in PDF, CSV, and spreadsheet formats for further interoperability and reusability. The RI also produces visual aids such as graphs on the quantity of different types of documents and their content and timelines on political processes. The graphs will be downloadable in a standard image format, with a mention of the origins of the image. New types of documents added during the development of LAWPOL are expected to closely match the existing documents so that the same data formats can be used.

The infrastructure produces data matrices on all the documents mentioned above, with an option to download metadata in a standard CSV or spreadsheet format. In addition to the already existing metadata associated with the source data, some linguistic data is provided, including the prevalence of search words in the materials included in the infrastructure, as well as the most prevalent words in each individual document. The materials that in the future can be produced with the help of LAWPOL include datasets tailored by the user with the help of the digital workbench for specific research on any legislative or political materials. The various forms and formats of data production combined with unique identifiers matching the official identifiers enable users to combine the data produced by LAWPOL with information from different sources.

The infrastructure is designed to last over time and update automatically every 24 hours, converting and lemmatizing texts as it updates. Hence, the amount of data is accumulative and depends for example on the amount of legislation drafted by governments, the number of rulings given by the courts, and the excess of political agendas published. The current amount of legislative data in LAWRADAR is 26,5 GB, which correlates with a little over 55,000 documents, converted and lemmatized. The amount of data expected to accrue between 2022 and 2028 is described in the table below.



Preparatory works and policy documentsPolitical dataLegislation Court rulingsResearch articlesTotal
Database430 GB2 TB1 Gb1 TB5 GB~ 3,42 TB
Search engine140 GB650 GB0,3 GB325 GB1,6 GB~ 1,09 TB

Data quality is controlled by the structure of LAWPOL. It is a one-way pipeline, where a user can only edit the data downloaded onto their own device, never affecting the data provided to other users. The data provided by the infrastructure comes typically directly from the APIs of the parliament and government, so any human error in the information is also inherited from these databases. The infrastructure will maintain a line for user feedback, which will allow the consortium to fix faulty data spotted by users. The developers are also in communication with the administrators of the government databases, so that feedback on faulty data may be shared and data fixed beyond LAWPOL itself.

Some bugs affecting data are to be expected specially when new sources of data are included in the RI. For example, the machine reading tools used might have made mistakes in converting documents into a usable format. A notice of this possibility is included at the top of each document and a link to the original document is always provided. Naturally, the consortium does comprehensive testing on the infrastructure, but user feedback is also vital to spotting such bugs.

Furthermore, the consortium acknowledges that some data is lacking due to non-publishment by the responsible ministries. These constraints are described in detail in the user guide section of the infrastructure, and shorter flags on possibly missing information are provided in the relevant parts of the infrastructure. In the future, the comprehensiveness of data will be ensured by systematic review of missing data, acquiring them through other means than the API’s, including digitizing data from the National Archives, and incorporating them into the infrastructure.

3. Ethical and legal compliance of personal or sensitive research data

The data produced by and contained in LAWPOL will be managed according to the Data management policy of the University of Turku (https://www.utu.fi/en/research/open-science/research-data-and-data-policy). The infrastructure also has access to data management support at University of Turku. The data produced will be made openly and freely accessible insofar as that is possible, considering the General Data Protection Regulation (GDPR). The legislative data includes personal data such as names, titles and background organisations of the people involved in the legislative procedures. All personal data has previously been published as part of the original public documentation. The data concerning the implementation of law will include publicly available anonymised court decisions, which, while possibly containing some sensitive information, do not contain personal data, and cannot as such be associated with the persons involved with the court cases. This data has been previously published in publicly available databases containing case law. Transparency and openness of administration and legislative procedure are regarded as key elements in ensuring democracy and rule of law. The processing is necessary for the performance of a task carried out in the public interest (Article 6(1)(e) GDPR). Data privacy policy will be included in the LAWPOL portal, and all users are required to adhere to these terms (for the data privacy policy of the existing LAWRADAR infrastructure in Finnish, please see https://lakitutka.fi/tietosuoja).

4. Agreements on research data rights

The public source data used by LAWPOL is owned by the primary party responsible for its publishment, for example Prime Minister’s office, other ministries, and the Parliament of Finland. The data is accessed via open application programming interfaces, and the licenses grant permission for reusing and republishing the data. The source of the data will be clearly stated to fully comply with the FAIR principles.

The primary owner of LAWPOL is University of Turku, hence the data produced by the RI is also primarily owned by University of Turku. Ownership will be contracted with the consortium members in due course, according to the university guidelines. All contracts will honor the goal of maintaining an open access infrastructure with as little restrictions of use as possible.

The ownership of the data produced by users with the help of the RI will depend on the contribution of the user. For example, datasets directly downloaded from LAWPOL are owned by the consortium similarly to data produced by the RI. However, when this data is enriched, the ownership of these expansions will naturally belong to the user that contributed them. Regardless of how ownership of this data is defined, neither the RI, nor its owners will make any claims to the produced data, other than those detailed in the Terms of use; namely mentioning the RI in publications.

The Terms of use of LAWPOL, which will define the ownership and user rights of the data produced with the RI, will be published in the LAWPOL portal. All persons have a right to use the data currently provided by the RI. Always when the source data’s licensing scheme permits, the data the RI produces will be licensed with the CC-BY-NC 4.0 license, meaning that only commercial use is restricted. All other use is allowed if LAWPOL is appropriately cited. The license does not restrict the licensing of data produced by further use of LAWPOL data. The RI recommends the use of open licenses wherever possible.

The source code of the infrastructure is currently owned by University of Turku. Once the infrastructure moves from beta-testing to the final version, the source code will be licensed with an appropriate license such as the GNU GPLv3 or MIT l[TY2] icense.

5. Documentation and metadata

For enhancing findability, all documents have a unique identifier and associated metadata that describe the data. The unique identifier, for example the number of the government bill, matches the official identifier used in the source of the data. The identifier is also visible through the user interface, enabling the user to access the same data later with ease.

To maximize reusability and compliance with other services, LAWPOL will generate the structured metadata according to the same structures and codebooks that the originators of the data use whenever it is feasible. Both the default metadata matrices and the user-tailored datasets will adhere to relevant community standards and use already existing identifiers and codebooks for the data, making reusing the data easier. The linguistic data is formatted according to suitable metadata standards for language resources, the standard tentatively selected for this is Open Language Archives Community metadata set.

In addition to detailed instructions covering the usage of LAWPOL, some specifically data centered guidance is provided. Users will be able to access detailed descriptions of all metadata used by the RI in the catalogue of the University of Turku, to which the user interface will provide a link. LAWPOL also provides the user with the data management guidelines assembled by the University of Turku.

6. Access control, backup, storage and disposal of administrative and research data

The data LAWPOL uses and provides for the users is stored on the University of Turku’s own virtual servers. For security, the servers can only be reached from the university network domain. The connections to both the server and the database in it are password protected with strong passwords[TY3]  and secure transfer protocols. To provide robust service with rapid responses, the data and processing load is distributed among necessary amount of servers. The source code for LAWPOL software is stored on the university’s own version control server, and will be published after a stable version has been reached. The existing infrastructure does not provide any data storing services for its users, but later, if registered users are allowed to import their own data into the portal, the usage of that data is restricted only to the original user, and the data is removed after it is no longer needed by the user.

All servers are backed up daily. Backups can be restored within approximately 2 hours upon request during business hours and can be restored to any of the last 14 copies. The number of copies may be increased as necessary and upon request. The network infrastructure is configured and maintained on-site by university staff both physically and logically. Storage facilities, servers and other services are produced on-premises. The data center of IT services is in three separate facilities on and near the campus.

The Data Security Description of University of Turku describes the rules and policies the university’s IT services abide by (https://www.utu.fi/data-security-description). The development work of LAWPOL will follow good programming practices and the principles of secure software development, as well as the so-called security by default principles.

In the long run, in case the RI comes to the end of its lifespan, the long-term preservation of data will be solved by depositing the text corpus accumulated in the LAWPOL system in the Language Bank of Finland for archiving and further research use.

The source code of the system will be opened for wider use by publishing the code in the Zenodo digital library at certain intervals already during the project. A final version of the source code will be archived in a public general repository after the end of the project.

7. Opening research data and/or metadata

Most of the data provided by the infrastructure is public record, and thus primarily all of it may be published on any platform. Due to the resource intensiveness of certain features of LAWPOL and possible sensitivity of combined data, the services offered by LAWPOL, despite being generally accessible by all users without registration and fees, may also entail one or several of the following requirements: 1) registration by the user, 2) payment of a user fee, and 3) applying for a research permit from the data owner. The data management recommendations will be included in the LAWPOL portal particularly for research use.

The Terms of use of the RI will include instructions to mention that the infrastructure was used in gathering or analyzing research materials. The RI will have its own DOI, which is to be included when citing data produced by the RI. All researchers using the infrastructure are required to inform the developers about all publications that the infrastructure has been used for. An example citation based on Data Citation Roadmap for Finland will be provided in the instructions. The license of the RI, as well as the licenses of its source data, require naming the owner of the data when using them. The license the RI recommends its users to adhere to is the same open license that the RI itself uses for its data. For research users, LAWPOL recommends depositing the datasets produced with the help of the infrastructure into the Language Bank of Finland for long-term preservation.

All data that can be downloaded from the RI is uniquely identified with persistent official document numbers or other similar identifiers. When possible, these identifiers are directly human readable for the comfort of the user.

The University of Turku will issue its own Digital Object Identifier (DOI) for data descriptions of its own research data, which have been published in the University’s Data Catalog (https://data.utu.fi/catalog). The data contained in LAWPOL can also be described in this catalog and as a university’s infrastructure, it could use this DOI in the future. However, the datasets produced via LAWPOL by users outside the university will not have this option.

Suomi