Technical Overview
This page provides an overview of the technical solution used to build and present the I.Sicily digital corpus. The project was designed with sustainability, modularity, and openness as guiding principles.
Technologies and Processes
Data Standards
The core dataset for the project relies on the EpiDoc TEI-XML standard. This is used to encode all information about the inscriptions and the inscribed objects, as well as the actual text itself.
Data is standardised and made potentially interoperable by the use of recognised vocabularies:
- Pleiades gazetteer of ancient places
- EAGLE epigraphic vocabularies
- FAIR Epigraphy vocabularies
- BGS Rock Classification Scheme for petrographic data
Development
The project solution is built using a monorepo structure with two main components:
- An ETL (Extract, Transform, Load) process for handling and enriching XML files
- A static website built with SvelteKit configured as a Static Site Generator (SSG)
The site generates plain HTML, CSS, and JavaScript files that can be served by any standard web server (like Nginx or Apache). This means it does not require a Node.js or any other application server to be running.
Key frontend dependencies include:
- bits-ui for UI components
- itemsjs for faceted search
- mdsvex for markdown content
- unovis for data visualisations
- OpenSeadragon for IIIF image viewing
- MapLibre GL for interactive maps
Data Model
The project processes EpiDoc TEI XML files, enriches the original input corpus, and presents it on a static website. Individual editions are published as HTML pages but can also be searched and filtered, as well as being freely available for download.
High resolution images, where available, are presented via a IIIF server.
Annotation Layers
The corpus integrates multiple annotation layers, each feeding metadata into the inscription pages, faceted search, and other dynamic pages:
Linguistic Annotation
Linguistic annotation has been undertaken on a Greek and Latin subset of the corpus, alongside tokenisation and lemmatisation of the complete corpus.
Palaeographic Annotation
Palaeographic annotation is conducted through a dedicated digital palaeographic environment, which enables the assignment of letter typologies to linked images and texts. The environment uses Web Annotations that bind graphs as they appear on inscription images with their occurrences in the EpiDoc edition and the formal description of their structure.
Petrographic Annotation
Petrographic analysis has been undertaken on hundreds of items across the corpus. The dataset includes raw and processed data from (geo)chemical and minero-petrographic analyses on epigraphic supports, supporting the identification of rock types and their provenance.
Workflows
The following diagram illustrates the corpus building workflow:
Architecture
The project uses a monorepo structure with the following components:
| Directory | Description |
|---|---|
packages/etl/ | ETL package for processing XML |
frontend/ | Static site generator web application |
data/raw/ | Git submodule for the EpiDoc files |
data/processed/ | Output data generated by the ETL process |
xslt/epidoc/ | Git submodule for XSLT stylesheets |
Deployment
The site is available both as a staging instance for testing, and a public production site served via GitHub Pages.
The set of granular standard-compliant data files (DTS, EpiDoc, Web Annotations, etc.) versioned on and served from public code repositories is independent from the software. This portability and sustainability requirement was a consideration in all design and development work.
Design Process
The design of the site was informed by initial discussions around the information architecture and review of static mock-ups. A usability workshop with a group of representative prospective users identified bugs as well as user interface and user experience refinements which have been prioritised collaboratively and integrated into the current interface.
Community Value
By integrating different layers of annotations in the same publication, the site converges benefits from multiple communities:
- Researchers: Multidisciplinary researchers involved in the project are empowered to conduct data quality and analysis on the integrated corpus
- Museums: Repositories where the actual inscription objects are held gain visibility and enrichment to their collections under the same integrated digital space
- DH Teams: Other collaborative Digital Humanities teams may adopt similar solutions for cultural heritage online corpora that rely on open standards and open architectures
Source Code
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (CROSSREADS, grant agreement no. 885040).