earchiving in action - european commission

54
eArchiving in Action Workshop on 25, 27, 28 January 2021 European Commission, DG Cnect Interactive Technologies, Digital for Culture and Education Unit Rehana Schwinninger-Ladak, Head of Unit, <[email protected]> Adelina Dinu <[email protected]> Fulgencio Sanmartín <[email protected]>

Upload: others

Post on 28-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

eArchiving in ActionWorkshop on 25, 27, 28 January 2021

European Commission, DG Cnect

Interactive Technologies, Digital for Culture and Education Unit

Rehana Schwinninger-Ladak, Head of Unit, <[email protected]>

Adelina Dinu <[email protected]>

Fulgencio Sanmartín <[email protected]>

eArchiving in Action: Data ProducerseArchiving Workshop – 25th January 2021

DIGIT

Directorate-General

for Informatics

DG Connect

Directorate-General for Communications

Networks, Content and Technology

E-ARK Consortium

Agenda

• Demonstration

• Use Case

• Questions & Answers

• Panel Discussion

• Final Questions & Answers

3

Database preservation toolkit

Luís Faria

KEEP SOLUTIONS

TIP: Delete the picture and click

the placeholder button to select

another picture. Change the

background color

https://www.introducingporto.com

4

Databases

The information that supports institutions and businesses is usually centralised

on databases.

This information is of great value and needs to be preserved for decades due to

strategic and legal reasons.

The systems that have this information are usually complex with many software

components playing their part for supporting the business-logic, and the

submission and presentation interfaces.

The information is usually laid out in an organisation specifically optimised for

the database and original business objectives (i.e. not in a user-friendly

organisation).

5

The problem with preserving databases

• Every vendor has their data types and export formats

• It is rare that information exported from one vendor’s system works on another

• Sometimes does not work on different versions of the same product

• Need for a vendor-agnostic format based on standards

6

Preservation format criteria

Ubiquity Stability Complexity

SupportEase of identification and validation

Interoperability

Disclosure Intellectual Property Rights Viability

Documentation quality

Metadata support Re-usability

https://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf

7

SIARD: Software Independent Archiving of Relational Databases

• Database preservation format

• Based on international standards

• For database data, structure and behaviour

• Swiss national standard eCH-0165

• Now managed by DILCIS board and the EU eArchiving building block

https://dilcis.eu/content-types/siard

https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eArchiving

8

https://database-preservation.com

DBPTK

Database

Preservation

ToolkitSet of tools to store relational databases

in a standard archival format.

9

DBPTK DesktopDesktop application to save database to preservation format, validate it, and browse and search the content

DBPTK EnterpriseWeb application to browse and search on the content of multiple large preserved databases

DBPTK DeveloperA command-line tool and development library for automation and system integration

10

Y DBPTK DesktopBasic features

11

DBPTK Desktop features

12

Windows

macOS

Also available on Linux12

DBPTK Desktop features

SIARD creation

Export database to a preservation format

• Connect to a local or remote database

and save all content into a

preservation format like SIARD

• Test connection will diagnose most

common problems and provide you

with helpful hints to solve them

Supported DBMS:

• Microsoft Access

• Microsoft SQL Server

• MySQL / MariaDB

• Oracle

• PostgreSQL

• Progress Openedge

• Sybase 13

DBPTK Desktop features

Migration reportDetailed report of migration changes and losses

• All export and selection

parameters are presented.

• All column data types mapping to

standard types are recorded.

• All compromises are documented.

14

DBPTK Desktop features

Edit SIARD metadataEnrich archived database with descriptions

● Add descriptions to database, tables and columns to better understand its contents.

15

DBPTK Desktop features

SIARD validationValidate archived database

● Validate SIARD against specification plus many additional checks for a thorough validation.

16

DBPTK Desktop features

Search recordsBrowse and search database content

● Google-like search on the database content.

● Drill down on specific tables and do advanced search for specific fields to find exactly what you are looking for.

17

DBPTK Desktop features

Auto-updateAutomatic check of updates

● Stay up-to-date with automatic update check on startup and installation of new versions.

Yes No

18

Y DBPTK EnterpriseBasic features

19

DBTPK Enterprise features

Enterprise architectureFor large institutions with many databases and users

● A web application that can be horizontally scaled to support many very large databases being accessed by many users.

20

DBTPK Enterprise features

Manage multiple databasesSingle system, multiple databases

● Search through the databases, manage their status, enrich their metadata, validate them, make them ready for users to search.

21

DBTPK Enterprise features

Data transformationTransform content to answer useful questions

● De-normalisation

and table and

column hiding, to

simplify browsing

and allow

anonymisation of

content

22

DBTPK Enterprise features

Data transformation (aka denormalisation)

person

Name Birth City name MayorCountry

name

Mary 1986-03-28Payne

SpringsMary

United

States

Phillip RosenhaynUnited

States

23

DBPTK Enterprise features

Single sign-onSupport for multiple protocols

● LDAP, Active Directory, Database, SAML, ADFS, OAuth2, OpenID, Google, Facebook, Twitter, FIDO U2F, YubiKey, Google Authenticator, Authy, etc.

● Supports internal authorisation definition or configurable external authorisation

24

DBPTK Enterprise features

Browse and searchAllow users to access database content on the Web

● Allow them to search on a prepared, user-friendly and anonymised database content

25

DBPTK Enterprise features

Export featuresExport data into tabular data

● Allow users to save search results in Microsoft Excel or other spreadsheet software format for easy analytics and diagrams

26

DBPTK Enterprise features

Activity logAudit every access

● Who has done what, when and from where.

● Requirement for ISO 16363 certification.

27

DBPTK Enterprise & Desktop

Interface translated into:

English, German, Estonian, Czech, Portuguese

Search stemming and stopwords support for:

English, Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, Spanish, Estonian,

Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian,

Italian, Latvian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Thai,

Turkish, Japanese (using morphological analysis), CJK bigram (Chinese, Japanese, and Korean languages)

Multiple languages supported

28

Y DBPTK DeveloperBasic features

29

DBPTK Developer features

Command line interfaceAutomation of periodic preservation tasks

● Command line interface allows easy automation of periodic tasks like saving database to preservation format, validating, and editing metadata.

30

DBPTK Developer features

Systems integrationJava library

● Library to allow integration of production systems to directly use database preservation features.

31

DBPTK Developer features

Open sourceFor custom development

● Code base that allows custom development of new features or specialised support for new or legacy database systems.

32

And many more features

For archiving databases:

• SSH Tunnel

• Selection of tables and columns

• Selection and materialisation of views

• Custom views

• External files (files stored outside the DB)

• External files via SSH tunnel

• Automated quality assurance

• Save LOBs outside SIARD file

• Migrate from SIARD to SIARD

• Migrate from SIARD to live DBMS

• Convert ORACLE geodata

For accessing archived databases:

• Configure visible tables

• Configure visible columns

• Set column name, description and order

• Binary columns advanced options

• REST API

• Load on access and auto-unload

33

How DBPTK can be useful for data creators?

To archive and provide access to:

• Legacy databases

• Legacy information systems that

are supported by databases

• Production databases or systems

(snapshots or incremental)

• To restore archived databases into

modern database management

systems

• To alleviate the load of production

systems

35

Contact us

© European Union, 2017. All rights reserved. Certain parts are licensed under conditions to the EU. Reproduction is authorized provided the source is acknowledged.

[email protected]

More information at:https://database-preservation.com

DBPTK

Database

Preservation

ToolkitSet of tools to store relational databases

in a standard archival format.

See full webinar (#6)

on https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eArchivi

ng+webinar+Series+2020

video at https://youtu.be/D-MZS1vloWc?t=1973

36

Use CaseNorwegian National Health Archive

TIP: Delete the picture and click

the placeholder button to select

another picture. Change the

background color

Source: Woodcon

Stephen Mackey, Piql Hanne Mari Hindklev, Norwegian Health Archives 37

National Health Archive Project

• A Digital Preservation System built to preserve both digitised and electronically created patient archives for perpetuity

• The HARI (Health Archive Register Index) keeps track of the journals and metadata in the Archive

• The standard for digitised material is as close to the electronically created material as possible

• A separate standard for a “submission index” that populates both the metadata in the register and in the SIP

• A tool: EHR validation tool created to do the structural control/analysis of EHR-extract before making EPJ-SIP

38

● Virtualised onsite environment

● Automated deployment (Ansible, Ansible Tower)

● Test, QA and Production environments

● Expected throughput○ 350,000 journals

per year○ 450GB per day

39

Use Cases for a Health Archive

The Norwegian regulation envisions two possible use cases for the archive when built, which are to:

o provide records to next of kin in compliance with open information regulation

o harvest the vast amount of historical healthcare-related data within the archive for medical research

There is no limit to the age of the records to be presented to the NHA from hospitals and so consist of physical and electronic patient records.

40

NHA EPJARK and DPJARK

• Norwegian standards for extraction of Patient Medical Records from source EHR/EMR systems or digitisation of Journals

• Legislation defines the use cases for the archive• The standards define the metadata (patient personal and clinical) that

should be included in archival packages• The standards present a taxonomy for archiving of Patient Medical

Records (Case, Sub-case, Documents, File)

But,

• The standards used are not based on international medical (excluding ICD) or archiving metadata standards (e.g. METS, PREMIS, FHIR)

41

EPJ Submission Case Taxonomy

Single Document Multiple Cases, Documents

Multiple Cases, Sub-cases, Documents

42

NHA Lessons Learnt

• Two big suppliers of EHR-systems in Norway – want to be a part of defining the workflow

• A project is started with testing both extracting EHRs and transferring the data

• EPJARK and the associated standard of the AVLXML needs to be understood by the vendor of the EHR-systems

• EHR validation tool is important to avoid going into a loop

• Needs to define limitations of the EHR-extract, max GB, number of patients, etc

43

eHealth1 Content Information

Type Specification (CITS)

Defined in the CEF Telecom Call for Proposals 2019 as “… specifications for eHealth will be developed by the Activity. One specification will be based upon the Norwegian eHealth archives transfer format of patient journals (from provider EMR systems to a central health archive).”

44

https://www.shutterstock.com/

eHealth1 Specification – Summary

• Builds on the Common Specification (CSIP) and package specifications (SIP, DIP, AIP) structures

• Uses NHA use cases as foundation

• Submission agreements are mandatory

• Extractions in Case/Sub-case/Document/File structure (from simple to complex) based on EPJ specification

• Makes allowance for encapsulated bitstreams (such as DICOM)

• Can be used in digitisation programs or for born digital extractions

• The specification does not consider extraction from centralised EHR systems or submission via CDAs, but this is a possible future enhancement

45

eHealth1 Specification-

Metadata

• Extensible descriptive metadata model

• Builds on the Common Specification (CSIP) through use of METS (Metadata Encoding and Transmission Standard) and PREMIS (Preservation Metadata Implementation Strategies)

• Patient-centric - recommends use of FHIR Patient resource

• Extensible clinical metadata -recommends use of FHIR resources such as: Condition, Allergy Intolerance, Procedure, etc

46

https://www.shutterstock.com/

eHealth1 - Next Steps

• Software development• eHealth1 SIP Creator tool

(November 21)• Pilot implementation of

an eHealth archiving solution based on piql/NHA and E-ARK software

47

https://www.shutterstock.com/

Questions &

Answers

48

Break12:15 – 12:30

49

Panel discussion

- Moderator -Carlota Bustelo

Gabinete Umbus SL

TIP: Delete the picture and click

the placeholder button to select

another picture. Change the

background color

Pont Royal seen from Quai Voltaire Christoffer Wilhelm Eckersberg

Statens Museum for Kunst

50

The European directive on open data and FAIR principles: Impact on long-term preservation of government and research data

The Directive on open data and the

re-use of public sector information

provides a common legal framework

for a European market for

government-held data (public sector

information). Although focusing on

public sector information, its

transcription to national law also

interlinks with research data and the

adoption of FAIR principles. This

panel will join government officials

and research communities to

debate what will be the impact of

this directive on data long-term

preservation across the EU member

states.

51

Directive (EU) 2019/1024 of the European

Parliament and of the Council of 20 June 2019 on

open data and the re-use of public sector

information

- Whereas statement #59

Member States should also facilitate the long-term

availability for re-use of public sector information, in

accordance with the applicable preservation policies

- Article 9. Practical Arrangements

Member States shall also encourage public sector

bodies to make practical arrangements facilitating the

preservation of documents available for re-use

After discussion,

consultations and adoption

as principles in 2016, the

‘FAIR Guiding Principles for

scientific data management

and stewardship’ were

published in Scientific Data

https://www.panosc.eu/dat

a/fair-principles/

52

Speakers

53

José Borbinha

INESC-ID, Lisbon

University

Joy Davidson

Digital Curation Centre and

University of Glasgow

Igor Kuzma

Statistical Office of the

Republic of Slovenia

Andreas Rauber

Technical University, Vienna

Daniele Rizzi

European Commission – Unit

G1: Data Policy and

Innovation

Questions

1. In your opinion, what is the relevance of digital preservation for the re-use of

open data?

2. What should be the role of archives and other memory institutions on the

preservation and re-use of open data?

3. In your experience, do you find there is a need that different projects

converge to common standards? In the projects you are involved, what steps

should be taken to achieve this?

4. How can the eArchiving Building Block support the implementation of the

Open Data Directive?

54

Final Questions &

Answers

55