Technical Overview
This page provides an overview of the technical solution used to build and present the I.Sicily digital corpus. The project was designed with sustainability, modularity, and openness as guiding principles.
Technologies and Processes
Data Standards
The core dataset for the project relies on the EpiDoc TEI-XML standard. This is used to encode all information about the inscriptions and the inscribed objects, as well as the actual text itself.
Data is standardised and made potentially interoperable by the use of recognised vocabularies:
- Pleiades gazetteer of ancient places
- EAGLE epigraphic vocabularies
- FAIR Epigraphy vocabularies
- BGS Rock Classification Scheme for petrographic data
Development
The project solution is built using a monorepo structure with two main components:
- An ETL (Extract, Transform, Load) process for handling and enriching XML files
- A static website built with SvelteKit configured as a Static Site Generator (SSG)
The site generates plain HTML, CSS, and JavaScript files that can be served by any standard web server (like Nginx or Apache). This means it does not require a Node.js or any other application server to be running.
Key frontend dependencies include:
- bits-ui for User Interface (UI) components
- itemsjs for faceted search
- mdsvex for markdown content
- unovis for data visualisations
- OpenSeadragon for IIIF image viewing
- MapLibre GL for interactive maps
Data Workflows and Models
The project workflow processes EpiDoc TEI XML files, enriches the original input corpus, and presents it on a static website. Individual editions are published as HTML pages but can also be searched and filtered, as well as being freely available for download.
High resolution images, where available, are presented via a IIIF server.
Annotation Layers
The corpus integrates multiple annotation layers, each feeding metadata into the inscription pages, faceted search, and other dynamic pages:
Linguistic Annotation
Linguistic annotation has been undertaken on a Greek and Latin subset of the corpus, alongside tokenisation and lemmatisation of the complete corpus.
Palaeographic Annotation
Palaeographic annotation is conducted through a dedicated digital palaeographic environment with its own web application and interface (Annotator) also served and documented via a dedicated Github repository. The environment lets the user define the palaeographic structure of allographs and then create thousands of Web Annotations (Web Annotation Data Model) that bind graphs as they appear on inscription images (using IIIF Image API region format) with their occurrences in the EpiDoc edition (fetched from a Distributed Text Services (DTS) collection) and the formal description of their structure. A faceted search interface allows the researcher to filter graph annotations by their structural patterns and define a high-level typology of allographs. The list of letter types identified for each annotated inscription are then added to the TEI file and so visible on the corpus website with links back to the palaeography environment. These types are also searchable from the ‘Lettering’ filter in the list of facet options, thus integrating them with the rest of the exploration possibilities within the entire corpus, even though only a selection of inscriptions underwent palaeographic analysis.
Petrographic Annotation
Petrographic analysis has been undertaken on hundreds of items across the corpus. The dataset includes raw and processed data from (geo)chemical and minero-petrographic analyses on epigraphic supports, supporting the identification of rock types and their provenance. Geology-specific vocabularies (The BGS Rock Classification Scheme) are used as reference to update the XML files with material and material provenance data. Materials description has been augmented via a dedicated RSE-supported workflow which aggregates the multi-analytical data collected by the project’s material scientist and supports (pre)processing and analysis of different data by streamlining repetitive tasks and simplifying data interpretation, eventually leading to the identification of the rocks where the texts (in Latin, Greek and other languages) are inscribed, and therefore their provenance. This research workflow is summarised in technical documentation (ADD diagram below), research dissemination (e.g. Coccato et al., 2025a; Coccato et al., 2025b; Ciula et. al., 2025; Ciula, 2025) and the actual code used to implement the petrographic research environment.
Figure 1: Petrographic analysis workflow, from sample preparation through multi-analytical characterisation to rock identification and provenance determination.
Workflows
Figure 2: Corpus Building workflow, from EpiDoc source files through ETL processing and static site generation to the published website outputs.
Architecture
The project uses a monorepo structure with the following components:
| Directory | Description |
|---|---|
packages/etl/ | ETL package for processing XML |
frontend/ | Static site generator web application |
data/raw/ | Git submodule for the EpiDoc files |
data/processed/ | Output data generated by the ETL process |
xslt/epidoc/ | Git submodule for XSLT stylesheets |
Deployment
The site is available both as a staging instance for testing, and a public production site served via GitHub Pages.
The set of granular standard-compliant data files (DTS, EpiDoc, Web Annotations, etc.) versioned on and served from public code repositories is independent from the software. This portability and sustainability requirement was a consideration in all design and development work.
Design Process
The design of the site was informed by initial discussions around the information architecture and review of static mock-ups. A usability workshop with a group of representative prospective users identified bugs as well as user interface and user experience refinements which have been prioritised collaboratively and integrated into the current interface.
Community Value
By integrating different layers of annotations in the same publication, the site converges benefits from multiple communities:
- Researchers: Multidisciplinary researchers involved in the project are empowered to conduct data quality and analysis on the integrated corpus
- Museums: Repositories where the actual inscription objects are held gain visibility and enrichment to their collections under the same integrated digital space
- DH Teams: Other collaborative Digital Humanities teams may adopt similar solutions for cultural heritage online corpora that rely on open standards and open architectures
Source Code
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (CROSSREADS, grant agreement no. 885040). A previous instance of I.Sicily on which the requirements of the current solution is based was created and developed by Jonathan Prag, with the technical support of James Cummings and James Chartrand of OpenSky Solutions (see Prag, 2021 and Prag and Chartrand, 2019).