Ancient text corpora

Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage.

Chinese, Latin, and Greek are examples of ancient languages with significant text corpora, although much of these corpora are known to us via transmission (frequently via medieval manuscript copies) rather than in their original form. These texts – both transmitted and original – provide valuable insights into the history and culture of different regions of the world, and have been studied for centuries by scholars and researchers. Other ancient texts – particularly stone inscriptions and papyrus scrolls – have been published following archaeological research, notably the cuneiform corpus of c.10 million words and the c.5 million words in ancient Egyptian.

Through advances in technology and digitization, ancient text corpora are more accessible than ever before. Tools such as the Perseus Digital Library and the Digital Corpus of Sanskrit^[1] have made it easier for researchers to access and analyze these texts.

Quantifying the corpora

Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese, Brahmi, Greek, Latin, Hebrew and Avestan tradition), and those known from original inscriptions, papyri and other manuscripts.^[2]

Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the Book of the Dead and the coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers.^[2]

Languages with known size estimates

Script	Language	Dates used	Number of texts prior to 300AD	Number of words prior to 300AD			Ref.
Script	Language	Dates used	Number of texts prior to 300AD	Archaeological	Transmission	Total	Ref.
Egyptian hieroglyphs / Hieratic	Egyptian			5,000,000	none	5,000,000	^[3]^[4]
Demotic	Egyptian			1,000,000	none	1,000,000	^[5]
Greek (Ancient Greek literature, New Testament, Church Fathers, etc.)						57,000,000	^[6]^[7]
Latin						10,000,000	^[8]^[7]
Cuneiform	Akkadian		144,000^[9]	9,900,000^[9]	none	9,900,000	^[10]
	Sumerian		102,300^[11]	3,076,000^[11]	none	3,076,000	^[12]
	Hurrian			12,500	none	12,500	^[13]
	Urartian		400	10,000	none	10,000
	Hittite			700,000	none	700,000	^[14]
	Hattic			500	none	500	^[15]
	Cuneiform Luwian			3000	none	3000	^[16]
	Elamite		2,087	100,000	none	100,000	^[17]
	Protoelamic		1,435	20,000	none	20,000	^[18]
	Eblaite		16,000	300,000	none	300,000	^[19]
	Amorite		7,000	11,600	none	11,600	^[20]
	Ugaritic			40,000	none	40,000	^[21]
	Old Persian			7,000	100,000	107,000	^[22]
Canaanite and Aramaic	Ancient Hebrew (inc. Hebrew Bible)			35,000	265,000	300,000	^[23]^[24]
	Aramaic (ancient, imperial, biblical, Hasmonean, Nabataean, Palmyrenean)					100,000	^[25]
	Phoenician/Punic		10,000		68^[26]		^[27]^[28] ^[29]
Old South Arabian			10,500	112,500	none	112,500	^[30]^[31]
Etruscan				25,000		25,000	^[32]^[33]

South Asian

Sanskrit (Vedic Sanskrit and Classical Sanskrit)
Indus script (3,800 items, c.20,000 characters)^[34]
Brahmi script
Old Tamil
Early Indian epigraphy and Indian epic poetry
Kharosthi^[35]
Pali literature^[36]
List of historic Indian texts

Mesoamerican

East Asian

Old Chinese
Chinese classics
- The pre-Qin corpus: a collection of ancient Chinese texts written before the Qin dynasty (221 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought.
- The pre-Han corpus: a collection of ancient Chinese texts written before the Han dynasty (202 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought.
- See the Chinese Text Project
- Chinese bronze inscriptions, Oracle bone script, Seal script, Clerical script

Central Iranian languages

Prior to 300 AD, the Central Iranian languages are mainly in the form of Sassanid stone inscriptions in the two closely related idioms Middle Persian (Pahlavi scripts and Inscriptional Parthian),^[37] there are 5000 for the corpus of Middle Persian (mostly 3rd, but also 4th/5th centuries) and for the corpus of Parthian (3rd century) 3000 words. To what extent some of the Manichaean Middle Persian literary texts may date back to the 3rd century is difficult to estimate; Mani is said to have personally written the Shabuhragan^[38] totaling about 5000 words. In any case, if we combine Middle Persian and Parthian, we come to over 10,000 words.^[39]

Proto-Sinaitic

Proto-Sinaitic script has no more than about 400 letters (number of words is unknown since the script has not been fully interpreted).^[40] To a similar extent, there are probably approximately contemporaneous Proto-Canaanite inscriptions (ibid.).^[41]

Anatolian

Luwian cuneiform,^[42] approx. 3000 words^[43]
the Palaic language^[44] few hundred words.^[43]
Hieroglyphic Luwian^[45]^[43]
the Lycian alphabet (the best attested Anatolian successor language written in alphabetic script)^[46] with about 5000 words^[43]
The Lydian alphabet^[47] 109 inscriptions comprising about 1500 words^[43]
The Phrygian alphabet the in-tomb inscriptions from the 2nd and 3rd centuries AD^[48] (approx. 1000 words) and in the so-called "old Phrygian" inscriptions^[49] less than 300 words^[43]
The Carian alphabets^[50] whose texts, mainly from Egypt, contain around 600 words.^[43]

Old Italic

the Umbrian language^[51] attested essentially by the sacrificial instructions of the Iguvinian Tables with 5000 words^[52]
the Oscan language (ibid.) with 2000 words^[52]
the Messapic language^[53] with probably a good 1000 words (the estimate is difficult because most texts in this hardly understandable language do not use word separators)^[52]
the Venetic language^[54] a few hundred words^[52]
the Faliscan language^[55] a few hundred words^[52]
Cisalpine Celtic inscriptions amount to approximately 2000 words, to which are added a number of glosses by classical authors^[56]^[52]

Iberia

Iberian scripts, more rarely written in Greek or Latin script, approx. 2500 words^[52]^[57]
Celtiberian script, which refers to Celtic language testimonies in Iberian, but also in Latin script from Spain (approx. 1000 words)^[52]^[57]
Southwest Paleohispanic script, 78 inscriptions, a few hundred words^[52]^[57]
Lusitanian language, three monuments in Latin script, approx. 60 words^[52]^[57]

Germanic Northern Europe

Runic inscriptions dated before the 4th century amount to about 30 pieces, which contain no more than 50 words in total^[58]^[59]

Africa

Geʽez script: comparatively few inscriptions with a total of around 1,000 words before 300 AD.^[60] Following Christianization in the 4th century, more extensive texts are known.^[59]
Libyco-Berber alphabet: over 1,000 inscriptions from the Maghreb,^[61] which are dated to Roman times. Most texts do not use a word separator; Peust estimates that the total number of words could be around 5,000^[59]
Meroitic script (Ancient Nubian): about 900 texts are known, which Peust estimates may contain approximately 10,000 words, albeit with uncertainty from the fact that the word separator is not used consistently in the Meroitic script.^[62]^[59]

Aegean

The Cretan Linear A inscriptions that have not yet been deciphered^[63] are available in about 2500 texts, which contain a total of around 20,000 characters. The total number of words can hardly be determined; Peust tentatively put it in the same order of magnitude as in Meroitic.^[59]
In addition to the Linear A texts, there are also inscriptions Cretan hieroglyphs of a few hundred characters^[64] and texts written in the Greek alphabet, but not in Greek, with a few dozen words^[65]^[59]
Cypriot syllabary in the first millennium BC, in which mostly Greek texts were recorded.^[66] The relevant texts comprise around 100 to 200 words.^[59]

Micro corpora

There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain.^[59]

Preservation and curation

Historic preservation and maintaining ancient text corpora presents several challenges, including issues with preservation, translation, and digitization. Many ancient texts have been lost over time, and those that survive may be damaged or fragmented. Translating ancient languages and scripts requires specialized expertise, and digitizing texts can be time-consuming and resource-intensive.

Corpus linguistics

The field of corpus linguistics studies language as expressed in text corpora. This includes the analysis of word frequency, collocations, grammar, and semantics. Ancient text corpora provide a valuable resource for corpus linguistics research, enabling scholars to explore the evolution of language and culture over time.

References

^ "Digital Corpus of Sanskrit (DCS) - Online Sanskrit dictionary and annotated corpus". www.sanskrit-linguistics.org. Archived from the original on 2023-06-03. Retrieved 2023-06-03.
^ ^a ^b Peust 2000, pp. 252–253.
^ Carsten Peust, "Über ägyptische Lexikographie. 1: Zum Ptolemaic Lexikon von Penelope Wilson; 2: Versuch eines quantitativen Vergleichs der Textkorpora antiker Sprachen Archived 2023-04-16 at the Wayback Machine", in Lingua Aegyptia 7, 2000: 245-260: "Nach einer von W. F. Reineke in S. Grunert & L Hafemann (Hrsgg.), Textcorpus und Wörterbuch (Problemeder Ägyptologie 14), Leiden 1999, S.xiii veröffentlichten Schätzung W. Schenkels beträgt die Zahl der in allen heute bekannten ägyptischen (d.h. hieroglyphischen und hieratischen) Texten enthaltenen Wortformen annähernd 5 Millio nen und tendiert, wenn man die Fälle von Mehrfachüberlieferung u.a. des Toten buchs und der Sargtexte separat zählt, gegen 10 Millionen; das Berliner Zettelarchiv des Wörterbuchs der ägyptischen Sprache von A. Erman & H. Grapow (Wb), das sei nerzeit Vollständigkeit anstrebte, umfasst "nur" 1,7 Millionen (nach anderen Angaben: 1,5 Millionen) Zettel." (p.246)
^ W. Schenkel (1995). "Die Lexikographie des Altägyptisch-Koptischen". The lexicography of the Ancient Near Eastern languages (PDF). Verona: Essedue. p. 197. ISBN 88-85697-43-7. OCLC 34816015. Archived (PDF) from the original on 2023-04-30. Retrieved 2023-05-05.
^ The Chicago Demotic Dictionary, in which the Demotic texts were published between 1955 and 1979, with the exception of the definite and indefinite articles and the suffix pronouns (J.H. Johnson in S. Grunert & L Hafemann, Textcorpus und Handbuch , Leiden 1999, p. 243), has produced more than 200,000 slips (R.K. Ritner in S.P. Vleeming, Aspects of Demotic Lexicography, Leuven 1987, p. 145). Peust states that this represents only a fraction of the total number of known demotic texts, which could be in the region of a million words.
^ The vocabulary of Greek is completely recorded in the Thesaurus Linguae Graecae, which was previously published on CDRom by the University of California, Irvine. The currently available version E, in which all texts up to 600 AD and a selection of Byzantine texts from the following epoch are included in their entirety, comes to a total of around 76 million text words. In 1985, when the authors up to 400 AD were mostly scattered (2700 authors), but only 200 authors from the period after 400, the database still contained 57 million words (L. Berkowitz & KA Squitier, Thesaurus Linguae Graecae, Canon of Greek Authors and Works, New York 1986, p. xii f.). Even if the vocabulary of the texts up to AD 300 must still be somewhat smaller and we also want to deduct the numerous examples of the Greek definite article, we can still clearly declare Greek to be the best preserved ancient language within the framework of our specifications. This corpus also contains writings by the Church Fathers written in Greek.
^ ^a ^b Dee, James H. (2002). "The First Downloadable Word-Frequency Database for Classical and Medieval Latin". The Classical Journal. 98 (1). The Classical Association of the Middle West and South: 59–67. ISSN 0009-8353. JSTOR 3298278. Archived from the original on 2023-05-05. Retrieved 2023-05-05. All those frequency counts are drawn from a much wider variety of subjects and styles than exist for classical or medieval Latin, and because the volume of printed and spoken matter in any modern language is staggeringly huge, their authors take great pains to select "representative" corpora, seeking statistically meaningful data. Things are quite different in Latin, where there is, for the classical period, a surviving mass of literature estimated at no more than 9,000,000 words, whereas the corpus of classical Greek literature is usually estimated at "only" ten times that much.'
^ Peust states that the Thesaurus Linguae Latinae has 9 to 10 million words. Thesaurus Linguae Latinae has all texts up to AD 150 complete and the later texts up to AD 600 selectively (W. Ehlers, Der Thesaurus Linguae Latinae, in Antiquity and Occident 14, 1968, 172ff.) with nine (according to other sources: ten) million slips. Even if the vocabulary of Latin up to the year 300 AD has not yet been precisely determined, it should be clear that it surpasses that of Egyptian but is below that of Greek. However, the rule of thumb often heard among classical philologists, that there are ten times as many Greek as Latin texts, is exaggerated, at least if one takes our time limit of 300 AD as a basis.
^ ^a ^b Streck 2010, p. 54.
^ Peust estimated > 10,000,000, stating that estimating the disparately published corpus of this language is extremely difficult. As with the Thesaurus Linguae Latinae and the Egyptian Dictionary, the Chicago Assyrian Dictionary (“Assyrian” means Akkadian as a whole, not just its Assyrian dialect) began with a systematic fragmentation of text. When, in 1948, 1,249,000 slips had been reached, the systematic fiddling was stopped and the index was only expanded to include selected slips, so that by 1964 around 1500,000 to 1750,000 entries had come together (I.J. Gelb in CAD Vol. A 7, Chicago 1964, p. xvi). In my estimation, however, this should only be a fraction of the entire Akkadian material. The Assyriologist S.N. Kramer is said to have estimated that there were 500,000 cuneiform tablets (most of the cuneiform texts are in Akkadian). Also of interest is the estimate by A.L. Oppenheims, Ancient Mesopotamia, Chicago 21977, p. 17f., that the private library Assur banipals found in Nineveh contained 1200 to 1500 tablets, the number of lines of which "would probably reach, if not exceed in bulk, even the size of the M ahabharata with its 190,000 verses". Far less than half of this library has survived and been published, but on the other hand there are of course many more texts than those in Ashurbanipal's library. Instead of a specific number, I'd like to conclude by conjecturing that the corpus of Akkadian may be of the order of Latin.
^ ^a ^b Streck 2010, p. 53.
^ Sumerian corpus is difficult to estimate. According to W. Sallaberger & A. Westenholz, Mesopotamia: Akkade period and Ur III period, approximations 3, Freiburg/Schw. 1999 (OBO 160/3), p. 128, just under 40,000 administrative and legal documents from the Ur III period that have been published to date, each of which may contain an average of 20 to 30 words, which amounts to a million word forms. However, the total volume will probably remain below that of the Egyptian texts.
^ "The most heavily attested of these languages is Hurrian, which is not related to Hittite. The previously published volumes (1,2, 4, 5, 7, 9) of the corpus of Hurrian language monuments published by V. Haas contain a good 10,000 Hurrian words; MitannikönigsTusrattaanAmenophisITI(ed.J.Friedrich , Asiatic language monuments, Berlin 1932, p. 8 ff.) with approx i s c h e (ed. G.A. Melikisvili, Urartskije klinoobraznyje nadpisi, Moskva 1960, about 10 000 words), which can be considered as a successor language of Hurrian.
^ "1980 600,000 words = 90% of the text material wasted at the chD. Among the cuneiform languages, Hittite, known for the most part through texts from Boghazkoi, the old capital of the Hittite Empire, comes in third place attested. The index boxes of the Chicago Hittite Dictionary contained over 600,000 index cards in 1980, when they had completely bogged down "over 90 percent" of the texts published up to that point (H.G. Güterbock & H.A. Hoffner, The Hittite Dictionary, Vol. 3/1 [L ], Chicago 1980, p. xv). In the text material from Boghazkoi, several regional languages from the Hittite sphere of influence are also attested, mostly in the form of Hittite foreign languages Bilinguals of religious content or in the form of foreign-language passages interspersed in Hittite ritual texts.
^ "Another Boghazkoi language that has so far hardly been understood is Hattian . The total mass of the known Hattic text material should not exceed a few thousand words (the texts edited in J. Klinger, Studien zur Reconstruction of the Hattic Cult Layer, Wiesbaden 1996 contain about 500 words); the number will be easier to determine as soon as the announced compilation of texts in the Hattic language by H. Otten & Ch. Rüster (StBo 37) has been published.
^ There are a few hundred "glossal wedge words" interspersed in Hittite texts, see Güterbock, H. G. (1956). "Notes on Luwian Studies (A propos B. Rosenkranz' Book". Orientalia. 25 (2). GBPress- Gregorian Biblical Press: 120ff. ISSN 0030-5367. JSTOR 43581480. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ 2,087 Persepolis tablets royal inscriptions etc. This is the number divided by Richard Hallock in his publication Persepolis Fortification Tablets. Numerous other panels are still unpublished. See Matthew Stolper Archived 2007-02-14 at the Wayback Machine: "there were as many as 15,000 to 30,000 or more tablets and fragments. Most (thousands of tablets and tens of thousands of fragments) were in the elamite language in cuneiform script." This is a little-known but comparatively well-attested cuneiform language Elamite used in Persia. A count of the most important publications gives the following: R.T. Hailock, Persepolis Fortification Tablets, Chicago 1969, 2087 clay tablets with about 50,000 words (besides that, many tablets have remained unpublished); F.W. King, The [sc. pre-Achaemenidic] Elamite royal inscriptions, Graz 1965, approx. 10000 words; F.H. Weissbach, Die Keilinschriften der Achaemeniden, Leipzig 1911, approx. 5000 words; GC Cameron, Persepolis Treasury Tablets, Chicago 1948, about 5000 words. With all the Elamite material published so far, one will get close to 100,000 words.
^ "c after Englund 1998, 22 note. 8: 1,450 published texts from susa and 146 unpublished texts. This does not include the 1435 known written in the Proto-Elamic script, which was only partially deciphered Benen panels from the third millennium (ed.P.Meriggi, Lascrituraproto-elamica, 3 vols., Roma 1971/4), which contain around 20,000 characters, most of which can be read logographically.
^ 2,436 complete panels, 13947 fragments of different sizes. After I. Samir, see appendix 1 below to the national Museum of Syria, Idlib for a breakdown of the fragments by size. Peust assumed 1,800 tablets and 15,000 fragments. The oldest known Semitic language is Eblaitic from the third millennium, discovered only in the 1970s, of which 1800 intact, often quite extensive clay tablets were found along with around 15,000 fragments (G.Pettinato, Catalogo dei testi cuneiformi di Tell Mardikh-Ebla, Napoli 1979, p. xvi f. (ASIN B004HZDL5K)). The 3,000 or so pieces published to date (G. Conti, Index of Eblaitic texts published or cited, Firenze 1992) should contain around 300,000 words.
^ 7,000 personal names 90 safe loan words, according to Streck, M.P. (2000). Das amurritische Onomastikon der altbabylonischen Zeit. Alter Orient und Altes Testament (in German). Ugarit-Verlag. p. 135. ISBN 978-3-927120-87-7. Archived from the original on 2023-05-14. Retrieved 2023-05-14.
^ "The Ugaritic cuneiform alphabetic texts collected in M. Dietrich et al, The Cuneiform Alphabetic Texts, Munster 21995 comprise about 40000 words. There is also an insignificant number of Ugaritic glosses in syllabic cuneiform.
^ The Old Persian cuneiform inscriptions (ed. R.G.Kent, Old Persian, New Haven 1953) contain about 7000 words; the linguistically closely related Avesta texts (ed. K.F. Geldner, Avesta, 3 vols., Stuttgart 1896), which have survived only in recent manuscripts, contain around 100,000 words
^ Peust follows the estimates of Clines (see following reference) that the total number of Ancient Hebrew words are 353,396, from a summation of the Hebrew Bible's 305,500 words, plus 38,349 words from the Dead Sea Scrolls, 7,020 words from the apocryphal Book of Sirach and 2,528 words from the inscriptions. Peust then subtracts the 34,622 instances for the definite article ה־ ha-, and notes that one can also argue about the 62,760 cases of ו "and" that is grammaticalized in a special way in Hebrew, concluding that adjusting for these figures, the corpus of Ancient Hebrew is of the order of 300,000 words.
^ Clines, D.J.A. (1993). The Dictionary of Classical Hebrew: Aleph. Sheffield Academic Press. p. 28. ISBN 978-1-905048-75-5. Archived from the original on 2023-05-14. Retrieved 2023-05-14. [Table: Biblical: BHS 305500; Non-biblica: Ben Sire 7020, Qumran 38349, Inscr 2528; Total 353396] The foregoing statistics of the size of the various corpora of Hebrew texts have been derived in the following way. From the totals in the table, Words Beginning with Aleph in Order of Frequency, it can be seen that we have identified 61,883 occurrences of words in the Hebrew Bible (Biblia hebraica stuttgartensia) beginning with Aleph. Knowing that there are some 305,500 words in the Hebrew Bible (the figure comes from Francis I. Andersen and A. Dean Forbes, The Vocabulary of the Old Testament [Rome: Pontifical Biblical Institute, 1989], p. 23), we can assume that, roughly speaking, the 1,422 occurrences of Aleph words in ben Sira imply a text of c. 7,020 words (i.e. 1422, divided by 61833 and multiplied by 305500). Similarly, the total of 7,768 occurrences in the Qumran and related materials implies a corpus of c. 38,300 words (in the non-biblical texts already published, that is).
^
Peust writes that the corpus of Aramaic is fragmented into numerous dialects:
- Old Aramaic inscriptions from the first half of the first millennium BC (Kanaanäische und Aramäische Inschriften) with about 4000 words
- The primary Imperial Aramaic documents are from Egypt (Textbook of Aramaic Documents from Ancient Egypt; the first three volumes contain approx. 20,000 words), but it is also preserved in numerous other inscriptions and documents. Imperial Aramaic also includes the Aramaic text of Papyrus Amherst 63, written in Egyptian-Demotic script, which must contain around 3000 words.
- The Aramaic passages of the Old Testament ("Biblical Aramaic", particularly the Book of Daniel chapters 2-7 and the Book of Ezra 1:2–4, 4:8–16, 4:17–22, 5:7–17, 6:3–5, 6:6–12, 7:12–26) are closely related to Imperial Aramaic with a volume of a good 5000 words.
- Hasmonean, which is found above all in Apocrypha and Targum in the Dead Sea Scrolls, but is also attested in the Judean documents (Beyer, Klaus (1984). Die aramäischen Texte vom Toten Meer (in German). Vandenhoeck & Ruprecht. pp. 157–318. ISBN 978-3-525-53571-4.); 15,000 words in total.
- Nabataean and the Palmyrene are attested in about 1,000 grave and votive inscriptions each.
Peust concludes that the total Aramaic corpus available up to this time is probably not much less than 100,000 words, and notes that from about 300 AD the Aramaic text corpus increases in leaps and bounds, since several major literary languages are now developing (Syriac, Mandean, Galilean, Samaritan).
^ Gray, Louis H. (1923). "The Punic Passages in the "Poenulus" of Plautus". The American Journal of Semitic Languages and Literatures. 39 (2). University of Chicago Press: 73–88. doi:10.1086/369974. ISSN 1062-0516. JSTOR 528483. S2CID 170454820. Archived from the original on 2023-05-05. Retrieved 2023-05-05.
^ Peust states that the corpus of Phoenician including Punic per Kanaanäische und Aramäische Inschriften (KAI) amounts to around 10,000 words, but this contradicts other estimates of 10,000 texts in the whole corpus. KAI contains a selection of texts rather than a complete corpus.
^ Doak, Brian R. (2019-08-26). The Oxford Handbook of the Phoenician and Punic Mediterranean. Oxford University Press. p. 223. ISBN 978-0-19-049934-1. Most estimates place it at around ten thousand texts. Texts that are either formulaic or extremely short constitute the vast majority of the evidence.
^ Lehmann, Reinhard G. (2013). "Wilhelm Gesenius and the Rise of Phoenician Philology". Biblische Exegese und hebräische Lexikographie. pp. 209–266. doi:10.1515/9783110267044.209. ISBN 978-3-11-026612-2. Quote: "Nearly two hundred years later the repertory of Phoenician-Punic epigraphy counts about 10.000 inscriptions from throughout the Mediterranean and its environs."
^ The corpus of the Old South Arabic languages has been published in scattered publications and is difficult to survey. The old compilations in the Corpus Inscriptionum Semiticarum (and RES) contains around 3,000 texts with over 50,000 words, although a (small) part of these texts dates from after 300 AD. Thus, a stock of well over 100,000 words can now be assumed. The Old South Arabic texts are mainly in Sabaean, but also in other languages such as Written in Minaean, Qataban and Hadramautic, although the attribution of some shorter monuments remains uncertain. P. Stein stated in 2007 that there were 10,500 inscriptions, whilst Peust stated with 8,000 inscriptions in 2000. According to Stein, the texts are divided as follows: Sabaean: 5,300 texts; Qataban: 2,000; Minaean: 1,200, Haḍramite: 1,500; other/uncertain: 500. The corpus will be further increased by the chopsticks, which will be published bit by bit. P. Stein in 2007 also estimated the number of words at 112,500, versus Peust's estimate of 100,000 words. According to Stein, the words break down as follows: Sabaean: 85,000 words; Qataban: 11,000; Minaean: 11,000; Haḍramite: 5,000; other: 500.
^ Ryckmans, J.; Müller, W.W.; Allāh, Y.M.A. (1994). Textes du Yemen Antique. Inscrits sur bois. Institut Orientaliste Louvain: Publications de l'Institut Orientaliste de Louvain (in French). Université Catholique de Louvain, Institut Orientaliste. ISBN 978-2-87723-104-6. Archived from the original on 2023-05-05. Retrieved 2023-05-05. Mais les données principales sont fournies par plus de 8000 inscriptions monumentales , au texte soigneusement gravé dans la pierre ou coulé dans le bronze
^ Quantitatively the best attested language from ancient Italy, after Latin, is Etruscan. The corpus compiled by Helmut Rix counted several thousand almost all very short texts with a total of around 25,000 words.
^ Rix, Helmut (1991). Etruskische Texte : editio minor (in German). Tübingen: G. Narr. ISBN 3-8233-4240-1. OCLC 25336064.
^ Rao, Rajesh P. N.; Yadav, Nisha; Vahia, Mayank N.; Joglekar, Hrishikesh; Adhikari, R.; Mahadevan, Iravatham (18 August 2009). "A Markov model of the Indus script". Proceedings of the National Academy of Sciences. 106 (33): 13685–13690. Bibcode:2009PNAS..10613685R. doi:10.1073/pnas.0906237106. ISSN 0027-8424. PMC 2721819. PMID 19666571.
^ Die Lexikographie der Gandharī-Sprache Archived 2023-04-08 at the Wayback Machine, Akademie Aktuell Jahrgang 2013 - Ausgabe Nr. 44, 44-47: "Seit dem erscheinen von Baileys Artikel ist die Materialgrundlage für die Ga-ndha-rī-Lexiko- graphie durch umfangreiche neufunde von handschriften – aber auch inschriften, Verwal- tungsdokumenten und Münzen – um ein Viel- faches angewachsen: Der von uns erstellte catalog of Ga-ndha-rī Texts (http://gandhari.org/ Archived 2013-12-05 at the Wayback Machine catalog) verzeichnet derzeit 77 umfangreiche Schriftrollen, 330 handschriftenfragmente, 834 inschriften, 792 niya-Dokumente und 335 unterschiedliche Münzlegenden mit einem ge- schätzten Textbestand von insgesamt 120.000 Wortbelegen."
^ Kingsbury, P. (2002). The Chronology of the Pali Canon: The Case of the Aorists. University of Pennsylvania. ISBN 978-0-493-92911-8. Retrieved 2023-05-03. The early Buddhist canon written in Pali comprises some 4 million words of text written across several centuries in early India. As such, it is of interest not only to scholars of Buddhism but also linguists and historians for the insight it gives into the social, linguistic, and religious culture of the time.
^ Gignoux, P. (1972). Corpus Inscriptionum Iranicarum: Glossaire des inscriptions pehlevies et parthes (in French). School of Oriental and African Studies. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ MacKenzie, D. N., and Mani. “Mani’s ‘Šābuhragān.’” Bulletin of the School of Oriental and African Studies, University of London 42, no. 3 (1979): 500–534. http://www.jstor.org/stable/615572 Archived 2023-05-05 at the Wayback Machine.; and “Mani’s ‘Šābuhragān’--II.” Bulletin of the School of Oriental and African Studies, University of London 43, no. 2 (1980): 288–310. http://www.jstor.org/stable/616043 Archived 2022-10-08 at the Wayback Machine and Hutter, M. (1992). Manis Kosmogonische Šābuhragān-Texte: Edition, Kommentar und literaturgeschichtliche Einordnung der manichäisch-mittelpersischen Handschriften M 98/99 I und M 7980-7984. Studies in Oriental religions (in German). Otto Harrassowitz. ISBN 978-3-447-03227-8. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Peust 2000, pp. 257–258.
^ Sass, Benjamin (1988). The genesis of the alphabet and its development in the second millen̄ium B.C. Wiesbaden: In Kommission bei O. Harrassowitz. ISBN 3-447-02860-2. OCLC 21033775.
^ Peust 2000, pp. 257.
^ Starke, Frank (1985). Die keilschrift-luwischen Texte in Umschrift (in German). Wiesbaden: O. Harrassowitz. ISBN 3-447-02349-X. OCLC 12170509.
^ ^a ^b ^c ^d ^e ^f ^g Peust 2000, pp. 255.
^ Carruba, O. (1970). Das Palaische. Studien zu den Bogazkoy-Texten; hrsg. von der Kommission fur den Alten Orient der Akademie der Wissenschaften und der Literatur, Heft 10 (in German). Harrassowitz. ISBN 978-3-447-01283-6. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ The relevant corpus of hieroglyphic Luwian inscriptions was published by H. Cambel after Peust's article
^ Tituli Asiae Minoris: Tituli Lyciae lingua lycia conscripti. R.M. Rohrer. 1901. and Neumann, G. (1979). Neufunde lykischer Inschriften seit 1901. Denkschriften (Österreichische Akademie der Wissenschaften. Philosophisch-Historische Klasse) (in German). Verlag der Österreichischen Akademie der Wissenschaften. ISBN 978-3-7001-0283-0. Archived from the original on 2023-05-13. Retrieved 2023-05-01.
^ Roberto Gusmani (1980–1986). Lydisches Wörterbuch. Mit grammatischer Skizze und Inschriftensammlung (in German). Ergänzungsband 1-3, Heidelberg. and Gusmani, Roberto (1964). Lydisches Wörterbuch (in German). C. Winter. OCLC 582362214.
^ Haas, O. (1966). Die phrygischen Sprachdenkmäler. Académie bulgare des sciences linguistiques balkanique (in German). Académie bulgare des sciences. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Brixhe, C.; Lejeune, M. (1984). Corpus des inscriptions paléo-phrygiennes (in French). Editions Recherche sur les civilisations. ISBN 978-2-86538-089-3. Retrieved 2023-05-01.
^ Lajara, I.J.A.; Neumann, G. (1993). Studia carica: investigaciones sobre la escritura y lengua carias (in Spanish). PPU. ISBN 978-84-477-0236-7. Archived from the original on 2023-05-13. Retrieved 2023-05-01.
^ Vetter, E. (1953). Handbuch der italischen Dialekte. 1. Reihe: Lehr und Handbücher (in German). C. Winter. ISBN 978-3-8253-5952-2. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Peust 2000, pp. 258.
^ O. Haas, Messapische Studien, Heidelberg 1962 and C. Santoro, Nuovi studi messapici, 2 vols., Lecce 1982/3 and Supplement 1984
^ Lejeune, M. (1974). Manuel de la langue vénète. Indogermanische Bibliothek / Lehr- und Handbücher (in French). Winter. ISBN 978-3-533-02353-1.
^ Giacomelli, G. (1963). La lingua falisca. Biblioteca di "Studi etruschi" (in Italian). L.S. Olschki.
^ Whatmough, Joshua (1969). Dialects of Ancient Gaul. Cambridge: HUP. ISBN 978-0-674-86413-9. OCLC 935283757.
^ ^a ^b ^c ^d Untermann, J. (1975). Monumenta linguarum Hispanicarum (in German). Ludwig Reichert Verlag. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Krause, W.; Jankuhn, H. (1966). Die Runeninschriften im älteren Futhark. Abhandlungen der Akademie der Wissenschaften in Göttingen, Philologisch-Historische Klasse (in German). Vandenhoeck u. Ruprecht. Archived from the original on 2023-05-01. Retrieved 2023-05-01. and M. Stoklund, Neue Runenfunde in Illerup and Vimose, in Germania 64, 1986, 75ff
^ ^a ^b ^c ^d ^e ^f ^g ^h Peust 2000, pp. 259.
^ Bernand, E.; Drewes, A.J.; Schneider, R. (1991). Recueil des inscriptions de l'Ethiopie des périodes pré-axoumite et axoumite. Publication of the De Goeje Fund (in French). Académie des inscriptions et belles-lettres. ISBN 978-3-447-11316-8. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Chabot, J.B. (1940). Recueil des inscriptions libyques. Gouvernement général de l'Algérie (in French). Imprimerie nationale. and Galand, L. (1966). Inscriptions antiques du Maroc. Etudes d'Antiquités africaines (in French). Editions du Centre national de la recherche scientifique. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Török, L. (1997). The Kingdom of Kush: Handbook of the Napatan-Meroitic Civilization. Handbook of Oriental Studies / 1: Der Nahe und der Mittlere Osten. Brill. p. 64. ISBN 978-90-04-10448-8.
^ Godart, L.; Olivier, J.P. (1976–85). Recueil des inscriptions en linéaire A (in French). P. Geuthner. ISBN 978-2-86958-470-9. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Poursat, J.C.; Godart, L.; Olivier, J.P. (1978). Le Quartier Mu: Introduction générale. Ecriture hiéroglyphique crétoise / par Louis Godart et Jean-Pierre Olivier. 1. Fouilles Exécutées à Mallia (in French). P. Geuthner. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Duhoux, Y. (1982). L'étéocrétois: les textes, la langue (in French). J.C. Gieben. ISBN 978-90-70265-05-2. Archived from the original on 2023-05-01. Retrieved 2023-05-01.
^ Masson, O. (1961). Les inscriptions chypriotes syllabiques: recueil critique et commenté. École français d'Athènes: Études chypriotes (in French). E. de Boccard. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

Bibliography

Peust, Carsten (2000). "Über ägyptische Lexikographie. 1: Zum Ptolemaic Lexikon von Penelope Wilson; 2: Versuch eines quantitativen Vergleichs der Textkorpora antiker Sprachen". Lingua Aegyptia 7 (PDF). pp. 245–260.
Streck, Michael P. (2010). "Großes Fach Altorientalistik. Der Umfang des keilschriftlichen Textkorpus". Mitteilungen der Deutschen Orientgesellschaft 142 (PDF). pp. 35–58.

[1] "Digital Corpus of Sanskrit (DCS) - Online Sanskrit dictionary and annotated corpus". www.sanskrit-linguistics.org. Archived from the original on 2023-06-03. Retrieved 2023-06-03.

[FOOTNOTEPeust2000252–253-2] Peust 2000, pp. 252–253.

[3] Carsten Peust, "Über ägyptische Lexikographie. 1: Zum Ptolemaic Lexikon von Penelope Wilson; 2: Versuch eines quantitativen Vergleichs der Textkorpora antiker Sprachen Archived 2023-04-16 at the Wayback Machine", in Lingua Aegyptia 7, 2000: 245-260: "Nach einer von W. F. Reineke in S. Grunert & L Hafemann (Hrsgg.), Textcorpus und Wörterbuch (Problemeder Ägyptologie 14), Leiden 1999, S.xiii veröffentlichten Schätzung W. Schenkels beträgt die Zahl der in allen heute bekannten ägyptischen (d.h. hieroglyphischen und hieratischen) Texten enthaltenen Wortformen annähernd 5 Millio nen und tendiert, wenn man die Fälle von Mehrfachüberlieferung u.a. des Toten buchs und der Sargtexte separat zählt, gegen 10 Millionen; das Berliner Zettelarchiv des Wörterbuchs der ägyptischen Sprache von A. Erman & H. Grapow (Wb), das sei nerzeit Vollständigkeit anstrebte, umfasst "nur" 1,7 Millionen (nach anderen Angaben: 1,5 Millionen) Zettel." (p.246)

[4] W. Schenkel (1995). "Die Lexikographie des Altägyptisch-Koptischen". The lexicography of the Ancient Near Eastern languages (PDF). Verona: Essedue. p. 197. ISBN 88-85697-43-7. OCLC 34816015. Archived (PDF) from the original on 2023-04-30. Retrieved 2023-05-05.

[5] The Chicago Demotic Dictionary, in which the Demotic texts were published between 1955 and 1979, with the exception of the definite and indefinite articles and the suffix pronouns (J.H. Johnson in S. Grunert & L Hafemann, Textcorpus und Handbuch , Leiden 1999, p. 243), has produced more than 200,000 slips (R.K. Ritner in S.P. Vleeming, Aspects of Demotic Lexicography, Leuven 1987, p. 145). Peust states that this represents only a fraction of the total number of known demotic texts, which could be in the region of a million words.

[6] The vocabulary of Greek is completely recorded in the Thesaurus Linguae Graecae, which was previously published on CDRom by the University of California, Irvine. The currently available version E, in which all texts up to 600 AD and a selection of Byzantine texts from the following epoch are included in their entirety, comes to a total of around 76 million text words. In 1985, when the authors up to 400 AD were mostly scattered (2700 authors), but only 200 authors from the period after 400, the database still contained 57 million words (L. Berkowitz & KA Squitier, Thesaurus Linguae Graecae, Canon of Greek Authors and Works, New York 1986, p. xii f.). Even if the vocabulary of the texts up to AD 300 must still be somewhat smaller and we also want to deduct the numerous examples of the Greek definite article, we can still clearly declare Greek to be the best preserved ancient language within the framework of our specifications. This corpus also contains writings by the Church Fathers written in Greek.

[Dee_2002_pp._59–67-7] Dee, James H. (2002). "The First Downloadable Word-Frequency Database for Classical and Medieval Latin". The Classical Journal. 98 (1). The Classical Association of the Middle West and South: 59–67. ISSN 0009-8353. JSTOR 3298278. Archived from the original on 2023-05-05. Retrieved 2023-05-05. All those frequency counts are drawn from a much wider variety of subjects and styles than exist for classical or medieval Latin, and because the volume of printed and spoken matter in any modern language is staggeringly huge, their authors take great pains to select "representative" corpora, seeking statistically meaningful data. Things are quite different in Latin, where there is, for the classical period, a surviving mass of literature estimated at no more than 9,000,000 words, whereas the corpus of classical Greek literature is usually estimated at "only" ten times that much.'

[8] Peust states that the Thesaurus Linguae Latinae has 9 to 10 million words. Thesaurus Linguae Latinae has all texts up to AD 150 complete and the later texts up to AD 600 selectively (W. Ehlers, Der Thesaurus Linguae Latinae, in Antiquity and Occident 14, 1968, 172ff.) with nine (according to other sources: ten) million slips. Even if the vocabulary of Latin up to the year 300 AD has not yet been precisely determined, it should be clear that it surpasses that of Egyptian but is below that of Greek. However, the rule of thumb often heard among classical philologists, that there are ten times as many Greek as Latin texts, is exaggerated, at least if one takes our time limit of 300 AD as a basis.

[FOOTNOTEStreck201054-9] Streck 2010, p. 54.

[10] Peust estimated > 10,000,000, stating that estimating the disparately published corpus of this language is extremely difficult. As with the Thesaurus Linguae Latinae and the Egyptian Dictionary, the Chicago Assyrian Dictionary (“Assyrian” means Akkadian as a whole, not just its Assyrian dialect) began with a systematic fragmentation of text. When, in 1948, 1,249,000 slips had been reached, the systematic fiddling was stopped and the index was only expanded to include selected slips, so that by 1964 around 1500,000 to 1750,000 entries had come together (I.J. Gelb in CAD Vol. A 7, Chicago 1964, p. xvi). In my estimation, however, this should only be a fraction of the entire Akkadian material. The Assyriologist S.N. Kramer is said to have estimated that there were 500,000 cuneiform tablets (most of the cuneiform texts are in Akkadian). Also of interest is the estimate by A.L. Oppenheims, Ancient Mesopotamia, Chicago 21977, p. 17f., that the private library Assur banipals found in Nineveh contained 1200 to 1500 tablets, the number of lines of which "would probably reach, if not exceed in bulk, even the size of the M ahabharata with its 190,000 verses". Far less than half of this library has survived and been published, but on the other hand there are of course many more texts than those in Ashurbanipal's library. Instead of a specific number, I'd like to conclude by conjecturing that the corpus of Akkadian may be of the order of Latin.

[FOOTNOTEStreck201053-11] Streck 2010, p. 53.

[12] Sumerian corpus is difficult to estimate. According to W. Sallaberger & A. Westenholz, Mesopotamia: Akkade period and Ur III period, approximations 3, Freiburg/Schw. 1999 (OBO 160/3), p. 128, just under 40,000 administrative and legal documents from the Ur III period that have been published to date, each of which may contain an average of 20 to 30 words, which amounts to a million word forms. However, the total volume will probably remain below that of the Egyptian texts.

[13] "The most heavily attested of these languages is Hurrian, which is not related to Hittite. The previously published volumes (1,2, 4, 5, 7, 9) of the corpus of Hurrian language monuments published by V. Haas contain a good 10,000 Hurrian words; MitannikönigsTusrattaanAmenophisITI(ed.J.Friedrich , Asiatic language monuments, Berlin 1932, p. 8 ff.) with approx i s c h e (ed. G.A. Melikisvili, Urartskije klinoobraznyje nadpisi, Moskva 1960, about 10 000 words), which can be considered as a successor language of Hurrian.

[14] "1980 600,000 words = 90% of the text material wasted at the chD. Among the cuneiform languages, Hittite, known for the most part through texts from Boghazkoi, the old capital of the Hittite Empire, comes in third place attested. The index boxes of the Chicago Hittite Dictionary contained over 600,000 index cards in 1980, when they had completely bogged down "over 90 percent" of the texts published up to that point (H.G. Güterbock & H.A. Hoffner, The Hittite Dictionary, Vol. 3/1 [L ], Chicago 1980, p. xv). In the text material from Boghazkoi, several regional languages from the Hittite sphere of influence are also attested, mostly in the form of Hittite foreign languages Bilinguals of religious content or in the form of foreign-language passages interspersed in Hittite ritual texts.

[15] "Another Boghazkoi language that has so far hardly been understood is Hattian . The total mass of the known Hattic text material should not exceed a few thousand words (the texts edited in J. Klinger, Studien zur Reconstruction of the Hattic Cult Layer, Wiesbaden 1996 contain about 500 words); the number will be easier to determine as soon as the announced compilation of texts in the Hattic language by H. Otten & Ch. Rüster (StBo 37) has been published.

[16] There are a few hundred "glossal wedge words" interspersed in Hittite texts, see Güterbock, H. G. (1956). "Notes on Luwian Studies (A propos B. Rosenkranz' Book". Orientalia. 25 (2). GBPress- Gregorian Biblical Press: 120ff. ISSN 0030-5367. JSTOR 43581480. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[17] 2,087 Persepolis tablets royal inscriptions etc. This is the number divided by Richard Hallock in his publication Persepolis Fortification Tablets. Numerous other panels are still unpublished. See Matthew Stolper Archived 2007-02-14 at the Wayback Machine: "there were as many as 15,000 to 30,000 or more tablets and fragments. Most (thousands of tablets and tens of thousands of fragments) were in the elamite language in cuneiform script." This is a little-known but comparatively well-attested cuneiform language Elamite used in Persia. A count of the most important publications gives the following: R.T. Hailock, Persepolis Fortification Tablets, Chicago 1969, 2087 clay tablets with about 50,000 words (besides that, many tablets have remained unpublished); F.W. King, The [sc. pre-Achaemenidic] Elamite royal inscriptions, Graz 1965, approx. 10000 words; F.H. Weissbach, Die Keilinschriften der Achaemeniden, Leipzig 1911, approx. 5000 words; GC Cameron, Persepolis Treasury Tablets, Chicago 1948, about 5000 words. With all the Elamite material published so far, one will get close to 100,000 words.

[18] "c after Englund 1998, 22 note. 8: 1,450 published texts from susa and 146 unpublished texts. This does not include the 1435 known written in the Proto-Elamic script, which was only partially deciphered Benen panels from the third millennium (ed.P.Meriggi, Lascrituraproto-elamica, 3 vols., Roma 1971/4), which contain around 20,000 characters, most of which can be read logographically.

[19] 2,436 complete panels, 13947 fragments of different sizes. After I. Samir, see appendix 1 below to the national Museum of Syria, Idlib for a breakdown of the fragments by size. Peust assumed 1,800 tablets and 15,000 fragments. The oldest known Semitic language is Eblaitic from the third millennium, discovered only in the 1970s, of which 1800 intact, often quite extensive clay tablets were found along with around 15,000 fragments (G.Pettinato, Catalogo dei testi cuneiformi di Tell Mardikh-Ebla, Napoli 1979, p. xvi f. (ASIN B004HZDL5K)). The 3,000 or so pieces published to date (G. Conti, Index of Eblaitic texts published or cited, Firenze 1992) should contain around 300,000 words.

[20] 7,000 personal names 90 safe loan words, according to Streck, M.P. (2000). Das amurritische Onomastikon der altbabylonischen Zeit. Alter Orient und Altes Testament (in German). Ugarit-Verlag. p. 135. ISBN 978-3-927120-87-7. Archived from the original on 2023-05-14. Retrieved 2023-05-14.

[21] "The Ugaritic cuneiform alphabetic texts collected in M. Dietrich et al, The Cuneiform Alphabetic Texts, Munster 21995 comprise about 40000 words. There is also an insignificant number of Ugaritic glosses in syllabic cuneiform.

[22] The Old Persian cuneiform inscriptions (ed. R.G.Kent, Old Persian, New Haven 1953) contain about 7000 words; the linguistically closely related Avesta texts (ed. K.F. Geldner, Avesta, 3 vols., Stuttgart 1896), which have survived only in recent manuscripts, contain around 100,000 words

[23] Peust follows the estimates of Clines (see following reference) that the total number of Ancient Hebrew words are 353,396, from a summation of the Hebrew Bible's 305,500 words, plus 38,349 words from the Dead Sea Scrolls, 7,020 words from the apocryphal Book of Sirach and 2,528 words from the inscriptions. Peust then subtracts the 34,622 instances for the definite article ה־ ha-, and notes that one can also argue about the 62,760 cases of ו "and" that is grammaticalized in a special way in Hebrew, concluding that adjusting for these figures, the corpus of Ancient Hebrew is of the order of 300,000 words.

[24] Clines, D.J.A. (1993). The Dictionary of Classical Hebrew: Aleph. Sheffield Academic Press. p. 28. ISBN 978-1-905048-75-5. Archived from the original on 2023-05-14. Retrieved 2023-05-14. [Table: Biblical: BHS 305500; Non-biblica: Ben Sire 7020, Qumran 38349, Inscr 2528; Total 353396] The foregoing statistics of the size of the various corpora of Hebrew texts have been derived in the following way. From the totals in the table, Words Beginning with Aleph in Order of Frequency, it can be seen that we have identified 61,883 occurrences of words in the Hebrew Bible (Biblia hebraica stuttgartensia) beginning with Aleph. Knowing that there are some 305,500 words in the Hebrew Bible (the figure comes from Francis I. Andersen and A. Dean Forbes, The Vocabulary of the Old Testament [Rome: Pontifical Biblical Institute, 1989], p. 23), we can assume that, roughly speaking, the 1,422 occurrences of Aleph words in ben Sira imply a text of c. 7,020 words (i.e. 1422, divided by 61833 and multiplied by 305500). Similarly, the total of 7,768 occurrences in the Qumran and related materials implies a corpus of c. 38,300 words (in the non-biblical texts already published, that is).

[25] Peust writes that the corpus of Aramaic is fragmented into numerous dialects:
Old Aramaic inscriptions from the first half of the first millennium BC (Kanaanäische und Aramäische Inschriften) with about 4000 words

The primary Imperial Aramaic documents are from Egypt (Textbook of Aramaic Documents from Ancient Egypt; the first three volumes contain approx. 20,000 words), but it is also preserved in numerous other inscriptions and documents. Imperial Aramaic also includes the Aramaic text of Papyrus Amherst 63, written in Egyptian-Demotic script, which must contain around 3000 words.

The Aramaic passages of the Old Testament ("Biblical Aramaic", particularly the Book of Daniel chapters 2-7 and the Book of Ezra 1:2–4, 4:8–16, 4:17–22, 5:7–17, 6:3–5, 6:6–12, 7:12–26) are closely related to Imperial Aramaic with a volume of a good 5000 words.

Hasmonean, which is found above all in Apocrypha and Targum in the Dead Sea Scrolls, but is also attested in the Judean documents (Beyer, Klaus (1984). Die aramäischen Texte vom Toten Meer (in German). Vandenhoeck & Ruprecht. pp. 157–318. ISBN 978-3-525-53571-4.); 15,000 words in total.

Nabataean and the Palmyrene are attested in about 1,000 grave and votive inscriptions each.
Peust concludes that the total Aramaic corpus available up to this time is probably not much less than 100,000 words, and notes that from about 300 AD the Aramaic text corpus increases in leaps and bounds, since several major literary languages are now developing (Syriac, Mandean, Galilean, Samaritan).

[26] Old Aramaic inscriptions from the first half of the first millennium BC (Kanaanäische und Aramäische Inschriften) with about 4000 words

[27] The primary Imperial Aramaic documents are from Egypt (Textbook of Aramaic Documents from Ancient Egypt; the first three volumes contain approx. 20,000 words), but it is also preserved in numerous other inscriptions and documents. Imperial Aramaic also includes the Aramaic text of Papyrus Amherst 63, written in Egyptian-Demotic script, which must contain around 3000 words.

[28] The Aramaic passages of the Old Testament ("Biblical Aramaic", particularly the Book of Daniel chapters 2-7 and the Book of Ezra 1:2–4, 4:8–16, 4:17–22, 5:7–17, 6:3–5, 6:6–12, 7:12–26) are closely related to Imperial Aramaic with a volume of a good 5000 words.

[29] Hasmonean, which is found above all in Apocrypha and Targum in the Dead Sea Scrolls, but is also attested in the Judean documents (Beyer, Klaus (1984). Die aramäischen Texte vom Toten Meer (in German). Vandenhoeck & Ruprecht. pp. 157–318. ISBN 978-3-525-53571-4.); 15,000 words in total.

[30] Nabataean and the Palmyrene are attested in about 1,000 grave and votive inscriptions each.

[Gray_1923_pp._73–88-26] Gray, Louis H. (1923). "The Punic Passages in the "Poenulus" of Plautus". The American Journal of Semitic Languages and Literatures. 39 (2). University of Chicago Press: 73–88. doi:10.1086/369974. ISSN 1062-0516. JSTOR 528483. S2CID 170454820. Archived from the original on 2023-05-05. Retrieved 2023-05-05.

[27] Peust states that the corpus of Phoenician including Punic per Kanaanäische und Aramäische Inschriften (KAI) amounts to around 10,000 words, but this contradicts other estimates of 10,000 texts in the whole corpus. KAI contains a selection of texts rather than a complete corpus.

[Doak2019-28] Doak, Brian R. (2019-08-26). The Oxford Handbook of the Phoenician and Punic Mediterranean. Oxford University Press. p. 223. ISBN 978-0-19-049934-1. Most estimates place it at around ten thousand texts. Texts that are either formulaic or extremely short constitute the vast majority of the evidence.

[29] Lehmann, Reinhard G. (2013). "Wilhelm Gesenius and the Rise of Phoenician Philology". Biblische Exegese und hebräische Lexikographie. pp. 209–266. doi:10.1515/9783110267044.209. ISBN 978-3-11-026612-2. Quote: "Nearly two hundred years later the repertory of Phoenician-Punic epigraphy counts about 10.000 inscriptions from throughout the Mediterranean and its environs."

[30] The corpus of the Old South Arabic languages has been published in scattered publications and is difficult to survey. The old compilations in the Corpus Inscriptionum Semiticarum (and RES) contains around 3,000 texts with over 50,000 words, although a (small) part of these texts dates from after 300 AD. Thus, a stock of well over 100,000 words can now be assumed. The Old South Arabic texts are mainly in Sabaean, but also in other languages such as Written in Minaean, Qataban and Hadramautic, although the attribution of some shorter monuments remains uncertain. P. Stein stated in 2007 that there were 10,500 inscriptions, whilst Peust stated with 8,000 inscriptions in 2000. According to Stein, the texts are divided as follows: Sabaean: 5,300 texts; Qataban: 2,000; Minaean: 1,200, Haḍramite: 1,500; other/uncertain: 500. The corpus will be further increased by the chopsticks, which will be published bit by bit. P. Stein in 2007 also estimated the number of words at 112,500, versus Peust's estimate of 100,000 words. According to Stein, the words break down as follows: Sabaean: 85,000 words; Qataban: 11,000; Minaean: 11,000; Haḍramite: 5,000; other: 500.

[31] Ryckmans, J.; Müller, W.W.; Allāh, Y.M.A. (1994). Textes du Yemen Antique. Inscrits sur bois. Institut Orientaliste Louvain: Publications de l'Institut Orientaliste de Louvain (in French). Université Catholique de Louvain, Institut Orientaliste. ISBN 978-2-87723-104-6. Archived from the original on 2023-05-05. Retrieved 2023-05-05. Mais les données principales sont fournies par plus de 8000 inscriptions monumentales , au texte soigneusement gravé dans la pierre ou coulé dans le bronze

[32] Quantitatively the best attested language from ancient Italy, after Latin, is Etruscan. The corpus compiled by Helmut Rix counted several thousand almost all very short texts with a total of around 25,000 words.

[33] Rix, Helmut (1991). Etruskische Texte : editio minor (in German). Tübingen: G. Narr. ISBN 3-8233-4240-1. OCLC 25336064.

[34] Rao, Rajesh P. N.; Yadav, Nisha; Vahia, Mayank N.; Joglekar, Hrishikesh; Adhikari, R.; Mahadevan, Iravatham (18 August 2009). "A Markov model of the Indus script". Proceedings of the National Academy of Sciences. 106 (33): 13685–13690. Bibcode:2009PNAS..10613685R. doi:10.1073/pnas.0906237106. ISSN 0027-8424. PMC 2721819. PMID 19666571.

[35] Die Lexikographie der Gandharī-Sprache Archived 2023-04-08 at the Wayback Machine, Akademie Aktuell Jahrgang 2013 - Ausgabe Nr. 44, 44-47: "Seit dem erscheinen von Baileys Artikel ist die Materialgrundlage für die Ga-ndha-rī-Lexiko- graphie durch umfangreiche neufunde von handschriften – aber auch inschriften, Verwal- tungsdokumenten und Münzen – um ein Viel- faches angewachsen: Der von uns erstellte catalog of Ga-ndha-rī Texts (http://gandhari.org/ Archived 2013-12-05 at the Wayback Machine catalog) verzeichnet derzeit 77 umfangreiche Schriftrollen, 330 handschriftenfragmente, 834 inschriften, 792 niya-Dokumente und 335 unterschiedliche Münzlegenden mit einem ge- schätzten Textbestand von insgesamt 120.000 Wortbelegen."

[Kingsbury-36] Kingsbury, P. (2002). The Chronology of the Pali Canon: The Case of the Aorists. University of Pennsylvania. ISBN 978-0-493-92911-8. Retrieved 2023-05-03. The early Buddhist canon written in Pali comprises some 4 million words of text written across several centuries in early India. As such, it is of interest not only to scholars of Buddhism but also linguists and historians for the insight it gives into the social, linguistic, and religious culture of the time.

[37] Gignoux, P. (1972). Corpus Inscriptionum Iranicarum: Glossaire des inscriptions pehlevies et parthes (in French). School of Oriental and African Studies. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[38] MacKenzie, D. N., and Mani. “Mani’s ‘Šābuhragān.’” Bulletin of the School of Oriental and African Studies, University of London 42, no. 3 (1979): 500–534. http://www.jstor.org/stable/615572 Archived 2023-05-05 at the Wayback Machine.; and “Mani’s ‘Šābuhragān’--II.” Bulletin of the School of Oriental and African Studies, University of London 43, no. 2 (1980): 288–310. http://www.jstor.org/stable/616043 Archived 2022-10-08 at the Wayback Machine and Hutter, M. (1992). Manis Kosmogonische Šābuhragān-Texte: Edition, Kommentar und literaturgeschichtliche Einordnung der manichäisch-mittelpersischen Handschriften M 98/99 I und M 7980-7984. Studies in Oriental religions (in German). Otto Harrassowitz. ISBN 978-3-447-03227-8. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[FOOTNOTEPeust2000257–258-39] Peust 2000, pp. 257–258.

[40] Sass, Benjamin (1988). The genesis of the alphabet and its development in the second millen̄ium B.C. Wiesbaden: In Kommission bei O. Harrassowitz. ISBN 3-447-02860-2. OCLC 21033775.

[FOOTNOTEPeust2000257-41] Peust 2000, pp. 257.

[42] Starke, Frank (1985). Die keilschrift-luwischen Texte in Umschrift (in German). Wiesbaden: O. Harrassowitz. ISBN 3-447-02349-X. OCLC 12170509.

[FOOTNOTEPeust2000255-43] ^ ^a ^b ^c ^d ^e ^f ^g Peust 2000, pp. 255.

[44] Carruba, O. (1970). Das Palaische. Studien zu den Bogazkoy-Texten; hrsg. von der Kommission fur den Alten Orient der Akademie der Wissenschaften und der Literatur, Heft 10 (in German). Harrassowitz. ISBN 978-3-447-01283-6. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[45] The relevant corpus of hieroglyphic Luwian inscriptions was published by H. Cambel after Peust's article

[46] Tituli Asiae Minoris: Tituli Lyciae lingua lycia conscripti. R.M. Rohrer. 1901. and Neumann, G. (1979). Neufunde lykischer Inschriften seit 1901. Denkschriften (Österreichische Akademie der Wissenschaften. Philosophisch-Historische Klasse) (in German). Verlag der Österreichischen Akademie der Wissenschaften. ISBN 978-3-7001-0283-0. Archived from the original on 2023-05-13. Retrieved 2023-05-01.

[47] Roberto Gusmani (1980–1986). Lydisches Wörterbuch. Mit grammatischer Skizze und Inschriftensammlung (in German). Ergänzungsband 1-3, Heidelberg. and Gusmani, Roberto (1964). Lydisches Wörterbuch (in German). C. Winter. OCLC 582362214.

[48] Haas, O. (1966). Die phrygischen Sprachdenkmäler. Académie bulgare des sciences linguistiques balkanique (in German). Académie bulgare des sciences. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[49] Brixhe, C.; Lejeune, M. (1984). Corpus des inscriptions paléo-phrygiennes (in French). Editions Recherche sur les civilisations. ISBN 978-2-86538-089-3. Retrieved 2023-05-01.

[50] Lajara, I.J.A.; Neumann, G. (1993). Studia carica: investigaciones sobre la escritura y lengua carias (in Spanish). PPU. ISBN 978-84-477-0236-7. Archived from the original on 2023-05-13. Retrieved 2023-05-01.

[51] Vetter, E. (1953). Handbuch der italischen Dialekte. 1. Reihe: Lehr und Handbücher (in German). C. Winter. ISBN 978-3-8253-5952-2. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[FOOTNOTEPeust2000258-52] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Peust 2000, pp. 258.

[53] O. Haas, Messapische Studien, Heidelberg 1962 and C. Santoro, Nuovi studi messapici, 2 vols., Lecce 1982/3 and Supplement 1984

[54] Lejeune, M. (1974). Manuel de la langue vénète. Indogermanische Bibliothek / Lehr- und Handbücher (in French). Winter. ISBN 978-3-533-02353-1.

[55] Giacomelli, G. (1963). La lingua falisca. Biblioteca di "Studi etruschi" (in Italian). L.S. Olschki.

[56] Whatmough, Joshua (1969). Dialects of Ancient Gaul. Cambridge: HUP. ISBN 978-0-674-86413-9. OCLC 935283757.

[Untermann-57] Untermann, J. (1975). Monumenta linguarum Hispanicarum (in German). Ludwig Reichert Verlag. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[58] Krause, W.; Jankuhn, H. (1966). Die Runeninschriften im älteren Futhark. Abhandlungen der Akademie der Wissenschaften in Göttingen, Philologisch-Historische Klasse (in German). Vandenhoeck u. Ruprecht. Archived from the original on 2023-05-01. Retrieved 2023-05-01. and M. Stoklund, Neue Runenfunde in Illerup and Vimose, in Germania 64, 1986, 75ff

[FOOTNOTEPeust2000259-59] ^ ^a ^b ^c ^d ^e ^f ^g ^h Peust 2000, pp. 259.

[60] Bernand, E.; Drewes, A.J.; Schneider, R. (1991). Recueil des inscriptions de l'Ethiopie des périodes pré-axoumite et axoumite. Publication of the De Goeje Fund (in French). Académie des inscriptions et belles-lettres. ISBN 978-3-447-11316-8. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[61] Chabot, J.B. (1940). Recueil des inscriptions libyques. Gouvernement général de l'Algérie (in French). Imprimerie nationale. and Galand, L. (1966). Inscriptions antiques du Maroc. Etudes d'Antiquités africaines (in French). Editions du Centre national de la recherche scientifique. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[62] Török, L. (1997). The Kingdom of Kush: Handbook of the Napatan-Meroitic Civilization. Handbook of Oriental Studies / 1: Der Nahe und der Mittlere Osten. Brill. p. 64. ISBN 978-90-04-10448-8.

[63] Godart, L.; Olivier, J.P. (1976–85). Recueil des inscriptions en linéaire A (in French). P. Geuthner. ISBN 978-2-86958-470-9. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[64] Poursat, J.C.; Godart, L.; Olivier, J.P. (1978). Le Quartier Mu: Introduction générale. Ecriture hiéroglyphique crétoise / par Louis Godart et Jean-Pierre Olivier. 1. Fouilles Exécutées à Mallia (in French). P. Geuthner. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[65] Duhoux, Y. (1982). L'étéocrétois: les textes, la langue (in French). J.C. Gieben. ISBN 978-90-70265-05-2. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[66] Masson, O. (1961). Les inscriptions chypriotes syllabiques: recueil critique et commenté. École français d'Athènes: Études chypriotes (in French). E. de Boccard. Archived from the original on 2023-05-01. Retrieved 2023-05-01.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]