Palaeohispanic languages

Unearthing Ancient Languages: How AI and Data Curation Are Revolutionizing Palaeohispanic Studies

Explore how AI and carefully curated datasets are transforming the study of ancient Palaeohispanic languages, offering new insights into their decipherment and structure.

ARSA Technology Team

16 Apr 2026 • 5 min read

Unlocking the Secrets of Ancient Iberian Tongues with AI

Before the Roman Empire extended its reach across the Iberian Peninsula in the 3rd Century B.C., a diverse array of indigenous languages thrived. These ancient tongues, collectively known as Palaeohispanic languages, hold invaluable clues about early European history and linguistics. Despite centuries of dedicated scholarship, many remain largely undeciphered, presenting profound challenges to linguists. However, a recent academic paper, "Curation of a Palaeohispanic Dataset for Machine Learning" by Gonzalo Martínez-Fernández and his colleagues from the Universidad de Sevilla, introduces a transformative approach: leveraging modern Artificial Intelligence (AI) and Machine Learning (ML) techniques to break new ground in this complex field.

The study highlights a critical hurdle: while significant knowledge exists, the data—comprising thousands of inscriptions and texts—is often unstructured and unsuitable for computational analysis. This necessitates a robust data curation effort to prepare these precious historical records for the powerful algorithms of today. By transforming scattered linguistic information into a machine-friendly format, the researchers pave the way for AI to contribute significantly to the decipherment and understanding of these enigmatic languages.

The Diverse Linguistic Landscape of Ancient Iberia

The Iberian Peninsula was a melting pot of languages prior to Romanization. The primary Palaeohispanic languages include Iberian (a language isolate, potentially related to Basque), Celtiberian (a Celtic language), Lusitanian (Indo-European), Vasconic, and the South-Western language (Tartessian), whose linguistic affiliation is still debated. While distinct, Celtiberian, Iberian, and the South-Western languages often shared similar writing systems known as semi-syllabaries. Unlike alphabets where each symbol represents a single sound, semi-syllabaries use some graphemes for individual phonemes and others for syllables (typically an occlusive consonant plus a vowel). These scripts were not native innovations but evolved from Phoenician writing, demonstrating ancient cross-cultural influences.

These languages are classified as "corpora languages," meaning our entire understanding of them comes exclusively from surviving written records. This limitation creates a dual challenge: first, deciphering the scripts themselves to accurately transcribe texts, and second, understanding the languages' morphology (word structure) and semantics (meaning). The seminal work of Gómez-Moreno in the early 20th century, which deciphered the Iberian Levantine script, provided the initial breakthrough. However, even today, the degree of decipherment varies greatly among the languages, with many gaps remaining. For instance, while Iberian boasts approximately 2250 inscriptions, Celtiberian has only around 200, and the South-Western language roughly 100, underscoring the scarcity of data for computational linguistics, as highlighted by the researchers.

Bridging the Gap: From Historical Texts to Machine-Readable Data

Traditionally, studies of Palaeohispanic languages have relied heavily on historical linguistics, with computational methods playing a minimal role. The potential of advanced analytical tools, particularly those in Natural Language Processing (NLP), remains largely untapped due to the format of existing data. Resources like the Hesperia Data Bank, while invaluable for their extensive collection of Iberian and Celtiberian data, are not structured for direct computational input. Their attributes are often provided as plain strings, sometimes with inconsistencies, making them unsuitable for Machine Learning algorithms that require clean, standardized datasets.

This is precisely where the innovation of the presented paper lies. The authors embarked on a meticulous process of collecting and transforming these diverse corpora into a unified, machine-friendly format. The outcome is a structured CSV file comprising 1751 instances and 36 feature columns, ready to be fed into computational models. This data curation effort is a foundational step, enabling sophisticated analyses that were previously impossible. For organizations facing similar challenges with large, unstructured datasets—whether historical archives, industrial sensor readings, or complex customer feedback—developing custom AI solutions for data ingestion and normalization is often the key to unlocking hidden value.

AI and the Decipherment Challenge: New Frontiers in Palaeohispanic Studies

With a properly structured dataset, Machine Learning offers unprecedented opportunities to explore Palaeohispanic languages. One significant challenge in these ancient texts is scriptio continua, where words are not separated by spaces. While a native speaker would implicitly understand where words begin and end, it's a non-trivial task for researchers. NLP techniques can be trained to perform word segmentation, much like they process modern languages like Japanese that also employ continuous writing. Beyond segmentation, AI can assist in morphological analysis, differentiating between fusional languages (like Celtiberian, where morphemes merge) and agglutinative languages (like Iberian, where morphemes are distinct and "glued" together, similar to Basque).

Furthermore, the dataset enables semantic studies through techniques like cognate detection, which identifies words sharing a common origin, even across different languages. Approaches inspired by low-resource translation models can also be adapted to infer meanings from limited contexts. AI can even help in developing part-of-speech detectors, providing syntactic insights into these extinct languages. This computational approach promises to accelerate decipherment, reveal nuanced linguistic structures, and deepen our understanding of these ancient civilizations. For example, ARSA Technology applies principles of advanced data processing and pattern recognition, akin to those used in AI Video Analytics, to extract actionable intelligence from complex, real-world data, demonstrating how AI transforms raw input into meaningful insights.

The Power of Structured Data: A Pathway for Computational Linguistics

The newly curated dataset represents a significant leap forward. By providing a standardized, clean, and accessible resource, it dramatically lowers the barrier for computational linguists and AI researchers to engage with Palaeohispanic studies. This shift from purely linguistic methodologies to an interdisciplinary approach blending linguistics, computer science, and artificial intelligence promises to unlock centuries-old mysteries faster and with greater accuracy. The potential extends beyond academic decipherment, hinting at future applications in digital humanities, cultural heritage preservation, and even advanced AI API development for specialized linguistic tasks.

The implications for enterprise are also profound. Many businesses grapple with legacy data, fragmented information, or specialized datasets that are not optimized for modern analytical tools. The methodology applied here—identifying unstructured, valuable data, meticulously curating it, and transforming it into a machine-readable format—is a blueprint for digital transformation across various sectors. Whether it's historical customer records, complex engineering logs, or proprietary domain-specific knowledge, the principle remains the same: structured data is the foundation for intelligent decision-making and automated insights.

Conclusion: A New Era for Ancient Language Research

The pioneering work of Martínez-Fernández, Quesada-Moreno, Riscos-Núñez, and Salguero-Lamillar marks an exciting turning point for the study of Palaeohispanic languages. By meticulously curating a machine-learning-ready dataset, they have provided the essential infrastructure for AI to illuminate the complex structures and meanings of these ancient tongues. This project underscores the power of interdisciplinary collaboration and the critical role of data preparation in unlocking historical, cultural, and scientific insights. As AI continues to evolve, its application to "low-resource" languages will undoubtedly bring us closer to understanding the human story in its entirety.

For enterprises looking to transform their own unique data challenges into strategic advantages with cutting-edge AI and IoT solutions, contact ARSA today to discuss how tailored engineering intelligence can deliver measurable impact.