Technical Overview

This page provides an overview of the technical solution used to build and present the I.Sicily digital corpus. The project was designed with sustainability, modularity, and openness as guiding principles.

Technologies and Processes

Data Standards

The core dataset for the project relies on the EpiDoc TEI-XML standard. This is used to encode all information about the inscriptions and the inscribed objects, as well as the actual text itself.

Data is standardised and made potentially interoperable by the use of recognised vocabularies:

Development

The project solution is built using a monorepo structure with two main components:

  • An ETL (Extract, Transform, Load) process for handling and enriching XML files
  • A static website built with SvelteKit configured as a Static Site Generator (SSG)

The site generates plain HTML, CSS, and JavaScript files that can be served by any standard web server (like Nginx or Apache). This means it does not require a Node.js or any other application server to be running.

Key frontend dependencies include:

Data Model

The project processes EpiDoc TEI XML files, enriches the original input corpus, and presents it on a static website. Individual editions are published as HTML pages but can also be searched and filtered, as well as being freely available for download.

High resolution images, where available, are presented via a IIIF server.

Annotation Layers

The corpus integrates multiple annotation layers, each feeding metadata into the inscription pages, faceted search, and other dynamic pages:

Linguistic Annotation

Linguistic annotation has been undertaken on a Greek and Latin subset of the corpus, alongside tokenisation and lemmatisation of the complete corpus.

Palaeographic Annotation

Palaeographic annotation is conducted through a dedicated digital palaeographic environment, which enables the assignment of letter typologies to linked images and texts. The environment uses Web Annotations that bind graphs as they appear on inscription images with their occurrences in the EpiDoc edition and the formal description of their structure.

Petrographic Annotation

Petrographic analysis has been undertaken on hundreds of items across the corpus. The dataset includes raw and processed data from (geo)chemical and minero-petrographic analyses on epigraphic supports, supporting the identification of rock types and their provenance.

Workflows

The following diagram illustrates the corpus building workflow:

Architecture

The project uses a monorepo structure with the following components:

DirectoryDescription
packages/etl/ETL package for processing XML
frontend/Static site generator web application
data/raw/Git submodule for the EpiDoc files
data/processed/Output data generated by the ETL process
xslt/epidoc/Git submodule for XSLT stylesheets

Deployment

The site is available both as a staging instance for testing, and a public production site served via GitHub Pages.

The set of granular standard-compliant data files (DTS, EpiDoc, Web Annotations, etc.) versioned on and served from public code repositories is independent from the software. This portability and sustainability requirement was a consideration in all design and development work.

Design Process

The design of the site was informed by initial discussions around the information architecture and review of static mock-ups. A usability workshop with a group of representative prospective users identified bugs as well as user interface and user experience refinements which have been prioritised collaboratively and integrated into the current interface.

Community Value

By integrating different layers of annotations in the same publication, the site converges benefits from multiple communities:

  • Researchers: Multidisciplinary researchers involved in the project are empowered to conduct data quality and analysis on the integrated corpus
  • Museums: Repositories where the actual inscription objects are held gain visibility and enrichment to their collections under the same integrated digital space
  • DH Teams: Other collaborative Digital Humanities teams may adopt similar solutions for cultural heritage online corpora that rely on open standards and open architectures

Source Code

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (CROSSREADS, grant agreement no. 885040).