Technical Overview

This page provides an overview of the technical solution used to build and present the I.Sicily digital corpus. The project was designed with sustainability, modularity, and openness as guiding principles.

Technologies and Processes

Data Standards

The core dataset for the project relies on the EpiDoc TEI-XML standard. This is used to encode all information about the inscriptions and the inscribed objects, as well as the actual text itself.

Data is standardised and made potentially interoperable by the use of recognised vocabularies:

Development

The project solution is built using a monorepo structure with two main components:

  • An ETL (Extract, Transform, Load) process for handling and enriching XML files
  • A static website built with SvelteKit configured as a Static Site Generator (SSG)

The site generates plain HTML, CSS, and JavaScript files that can be served by any standard web server (like Nginx or Apache). This means it does not require a Node.js or any other application server to be running.

Key frontend dependencies include:

Data Workflows and Models

The project workflow processes EpiDoc TEI XML files, enriches the original input corpus, and presents it on a static website. Individual editions are published as HTML pages but can also be searched and filtered, as well as being freely available for download.

High resolution images, where available, are presented via a IIIF server.

Annotation Layers

The corpus integrates multiple annotation layers, each feeding metadata into the inscription pages, faceted search, and other dynamic pages:

Linguistic Annotation

Linguistic annotation has been undertaken on a Greek and Latin subset of the corpus, alongside tokenisation and lemmatisation of the complete corpus.

Palaeographic Annotation

Palaeographic annotation is conducted through a dedicated digital palaeographic environment with its own web application and interface (Annotator) also served and documented via a dedicated Github repository. The environment lets the user define the palaeographic structure of allographs and then create thousands of Web Annotations (Web Annotation Data Model) that bind graphs as they appear on inscription images (using IIIF Image API region format) with their occurrences in the EpiDoc edition (fetched from a Distributed Text Services (DTS) collection) and the formal description of their structure. A faceted search interface allows the researcher to filter graph annotations by their structural patterns and define a high-level typology of allographs. The list of letter types identified for each annotated inscription are then added to the TEI file and so visible on the corpus website with links back to the palaeography environment. These types are also searchable from the ‘Lettering’ filter in the list of facet options, thus integrating them with the rest of the exploration possibilities within the entire corpus, even though only a selection of inscriptions underwent palaeographic analysis.

Petrographic Annotation

Petrographic analysis has been undertaken on hundreds of items across the corpus. The dataset includes raw and processed data from (geo)chemical and minero-petrographic analyses on epigraphic supports, supporting the identification of rock types and their provenance. Geology-specific vocabularies (The BGS Rock Classification Scheme) are used as reference to update the XML files with material and material provenance data. Materials description has been augmented via a dedicated RSE-supported workflow which aggregates the multi-analytical data collected by the project’s material scientist and supports (pre)processing and analysis of different data by streamlining repetitive tasks and simplifying data interpretation, eventually leading to the identification of the rocks where the texts (in Latin, Greek and other languages) are inscribed, and therefore their provenance. This research workflow is summarised in technical documentation (ADD diagram below), research dissemination (e.g. Coccato et al., 2025a; Coccato et al., 2025b; Ciula et. al., 2025; Ciula, 2025) and the actual code used to implement the petrographic research environment.

Flowchart showing the petrographic analysis workflow. An Inscribed Object yields a Sample and Digital Microscopy images. The Sample is divided into Powders and Fragments. Powders undergo pXRF spectral analysis and XRD diffractometry. Fragments are examined via Digital and Optical Microscopy. All four analytical results feed into a decision point: is the material Marble? If no, the workflow proceeds directly to Analysis and Interpretation. If yes, Isotopic Analysis and Maximum Grain Size comparisons are performed before reaching the final Analysis and Interpretation stage.

Figure 1: Petrographic analysis workflow, from sample preparation through multi-analytical characterisation to rock identification and provenance determination.

Workflows

Flowchart showing the Corpus Building workflow. EpiDoc and XSLT submodules feed into an ETL Process, which also fetches citations from the Zotero API. The ETL Process transforms XML via Saxon-JS to produce processed HTML fragments, and extracts data into JSON files. Both the processed HTML and JSON files, along with Markdown files, feed into the Static Site Generator, which also uses the JSON files as a search index. The generator produces static HTML pages that serve Inscriptions, Search and Filters, Interactive Visualisations, and Open Data Endpoints.

Figure 2: Corpus Building workflow, from EpiDoc source files through ETL processing and static site generation to the published website outputs.

Architecture

The project uses a monorepo structure with the following components:

DirectoryDescription
packages/etl/ETL package for processing XML
frontend/Static site generator web application
data/raw/Git submodule for the EpiDoc files
data/processed/Output data generated by the ETL process
xslt/epidoc/Git submodule for XSLT stylesheets

Deployment

The site is available both as a staging instance for testing, and a public production site served via GitHub Pages.

The set of granular standard-compliant data files (DTS, EpiDoc, Web Annotations, etc.) versioned on and served from public code repositories is independent from the software. This portability and sustainability requirement was a consideration in all design and development work.

Design Process

The design of the site was informed by initial discussions around the information architecture and review of static mock-ups. A usability workshop with a group of representative prospective users identified bugs as well as user interface and user experience refinements which have been prioritised collaboratively and integrated into the current interface.

Community Value

By integrating different layers of annotations in the same publication, the site converges benefits from multiple communities:

  • Researchers: Multidisciplinary researchers involved in the project are empowered to conduct data quality and analysis on the integrated corpus
  • Museums: Repositories where the actual inscription objects are held gain visibility and enrichment to their collections under the same integrated digital space
  • DH Teams: Other collaborative Digital Humanities teams may adopt similar solutions for cultural heritage online corpora that rely on open standards and open architectures

Source Code

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (CROSSREADS, grant agreement no. 885040). A previous instance of I.Sicily on which the requirements of the current solution is based was created and developed by Jonathan Prag, with the technical support of James Cummings and James Chartrand of OpenSky Solutions (see Prag, 2021 and Prag and Chartrand, 2019).