Albanian National Corpus
This is the main page of the website where the Albanian National Corpus (ANC) with more than 230 million words is located.
The Corpus is designed for people interested in various issues related to the Albanian language, and may provide the reference data for both professional linguists and those who show an interest in Albanian and its history. The material collected in the Corpus can be used for studies of the Albanian lexicon and grammar, as well as for the investigation of language changes which happened in Albanian in the previous centuries.
From 2011 to the present, the Albanian National Corpus included primarily original prose texts (fiction, non-fiction, press) representing the Albanian literary language, as well as translated texts and Albanian poetry. The project aims at the widest possible presentation of not only modern written texts of various genres, but also texts created in different periods of the Albanian language history, as well as colloquial and dialectal texts representing non-literary forms of the modern Albanian language. In 2019, the Corpus of early Albanian texts was opened.
The texts of ANC have user-friendly metadata descriptions and linguistic annotation consisting of tags assigned to words.
Annotation
- automatic morphological annotation (lemmatization, part of speech, all grammatical categories, English translation of lemmata), 93.9% words analyzedonly tokens that do not contain digits or characters from other alphabets are taken into account
- etymological and derivational markup (annotation of inherited words and loanwords, annotation of derived words)
- metatextual annotation
- partial disambiguation (automatic disambiguation based on CG-3 rules)
Metadata
- title of the text
- author or title of the newspaper
- translator (in the case of translated texts)
- creation year (exact date of issue in the case of newspapers)
- genre / type of text
About corpora
Currently, two corpora are available: Corpus of the modern literary Albanian (“Main corpus”) and Corpus of early Albanian texts. They differ in what kind of texts they contain and how these texts are presented, but have mostly identical annotation and search capabilities. Here are the main characteristics of the corpora:
Main corpus
The Main corpus contains 232 million words and consists of the following subcorpora:
|
Press |
Fiction |
Nonfiction |
Poetry |
| Size |
222.5 million words |
4.5 million words |
4.6 million words |
0.3 million words |
| Texts |
|
authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s) |
|
Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s |
| Language variety |
In most cases, standard written Albanian or close to it.
In the interviews and quotations one can meet examples of dialectal speech. |
In most cases, standard written Albanian or close to it.
In the speech of the characters one can meet examples of dialectal speech or dialect stylization. |
In most cases, standard written Albanian or close to it.
In the texts created before the 1950s earlier variants of literary language can be used. |
In most cases, standard written Albanian or close to it.
In the poetic texts created before the 1950s earlier variants of literary language can be used. |
Corpus of early Albanian texts
This Corpus is supposed to include the main monuments of the Albanian writing: the works by Gj. Buzuk, L. Matranga, P. Budi, F. Bardhi, P. Bogdani, P. Mazreku, the Kanun of Arbëria, etc.
To date, the text of the “Christian Doctrine” by Luka Matranga (1592) has been included into the Corpus of early Albanian texts. Matranga’s book is the second largest text written in the Albanian language, and the first in the Tosk dialect.
The text which is used in the Corpus is based on the so-called manuscript А. We used the transcription of M. de Vaan, based on M. Sciambra. La “Dottrina cristiana” albanese di Luca Matranga (1964) and available at TITUS. We are grateful to M. de Vaan for providing us with the textual version of the transcription. During the text processing, it was verified using the other editions of the book, including the phototypic ones.
Publications
In our publications you may find more information about the Albanian National Corpus, its contents, and development:
Morozova, M. 2012. Shënime për standardin morfologjik të Korpusit nacional të shqipes [Notes on the morphological standard of the Albanian National Corpus]. In: Rugova, B. ed. Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXI Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 13–27 gusht 2012. Vëll. 31/1. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, pp. 153–156.
Arkhangelskij, T., Daniel, М., Morozova, М. and Rusakov, A. 2012. Korpusi i gjuhës shqipe: drejtimet kryesore të punës [Albanian Corpus: Main directions of work]. In: Ismajli, R. ed. Shqipja dhe gjuhët e Ballkanit. Albanian and Balkan Languages. Konferencë e mbajtur më 10–11 dhjetor 2011 në Prishtinë. Prishtinë: ASHAK, pp. 635-642.
Rusakov, A. and Morozova, M. 2012. Korpusi i gjuhës shqipe: problemet dhe rezultatet [Albanian Corpus: Problems and results]. In: Rugova, B. ed. Studime për nder të Rexhep Ismajlit me rastin e 65-vjetorit të lindjes. Prishtinë: Koha, pp. 639–649.
Morozova, M., Rusakov, A. and Domosiletskaya, M. 2013. Albanskaya imennaya morfologiya v korpusnom predstavlenii: Natsionalnyj korpus albanskogo yazyka [The Albanian noun morphology in the morphological model developed for the Albanian National Corpus]. In: Girfanova, A. Kh., Domosiletskaya, M. V., Zhugra, A. V., Kazansky, N. N., Rusakov, A. Yu., Sukhachev, N. L. eds. Albanskaya filologiya, balkanistika, problemy yazykoznaniya. К 100-letiyu so dnya rozhdeniya chlena-korrespondenta RAN Agnii Vasiliyevny Desnitskoj. Saint Petersburg: Nauka, pp. 120–130.
Morozova, M. and Rusakov, A. 2014. Korpusi elektronik i shqipes: përpunimi, përmbajtja dhe përdorimi [Electronic corpus of Albanian: Development, contents and use]. In: Rugova, B. ed. Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXII Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 19–30.08.2013. Vëll. 33/1. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, pp. 85-96.
Morozova, M. and Rusakov, A. 2015. Albanian National Corpus: Composition, Text Processing and Corpus-Oriented Grammar Development. In: Demiraj, B. ed. Sprache und Kultur der Albaner. Zeitliche und räumliche Dimensionen. Akten der 5. Deutsch-albanischen kulturwissenschaftlichen Tagung (5.–8. Juni 2014, Buçimas bei Pogradec, Albanien). Wiesbaden: Harrassowitz Verlag, pp. 270-308. (Albanische Forschungen, 37).
Morozova, M., Arkhangelskiy, T., Daniel, M. and Rusakov, A. 2016. Albanskij natsionalnyj korpus: osnovnye napravleniya raboty [Albanian National Corpus: Main directions of work]. Acta Linguistica Petropolitana. Trudy Instituta lingvisticheskikh issledovanij RAN. Vol. XII, Part 3, 169–189.
Morozova, M. and Rusakov, A. 2020. The early Albanian texts in an annotated language corpus: An attempt of processing and analysis. In: Demiraj, B. ed. Altalbanische Schriftkultur - aus der Perspektive der historischen Lexikographie und der Philologie der Gegenwart - Akten der 6. deutsch-albanischen kulturwissenschaftlichen Tagung (27. September 2019, Buçimas bei Pogradec, Albanien). Wiesbaden: Harrassowitz Verlag, pp. 91–102. (Albanische Forschungen, 44).
Morozova, M. and Rusakov, A. 2023. Struktura fjalëformuese e leksikut të shqipes (rreth anotimit të Korpusit elektronik të gjuhës shqipe) [Derivational structure of Albanian lexicon (towards the annotation of the Albanian corpus)]. In: Ismajli, R. ed. Shqipja standarde sot: konferencë shkencore me rastin e 50-vjetorit të Kongresit të Drejtshkrimit. Prishtinë: Akademia e Shkencave dhe e Arteve e Kosovës, pp. 233–242.
Rusakov, A. and Morozova, M. 2025. Albanian classical poetry in the Albanian National Corpus: Between linguistics and philology. In: Omari, A., Sinani, Sh., Hala, L. eds. Studime albanologjike në indoeuropianistike, filologji dhe gjuhësi kontakti. Vëllim në nderim të prof. Bardhyl Demirajt. Tirana: Akademia e shkencave e Shqipërisë, pp. 182–197.
How to cite the Corpus
If you use the information and examples from the Albanian National Corpus in your research, please use the following link:
Morozova, Maria, Alexander Rusakov & Timofey Arkhangelskiy. Albanian National Corpus. URL: albanian.web-corpora.net (accessed on .)
Authors
The first edition of the Corpus was developed in 2011 as a result of efforts of the creative community of linguists from Saint Petersburg (Institute for Linguistic Studies of the Russian Academy of Sciences) and Moscow (the School of Linguistics at HSE). This version used the search engine of the Eastern Armenian National Corpus (EANC).
The main ideas for the development of the corpus representation of Albanian grammar, the metatext and grammatical annotation belong to Maria Morozova and Alexander Rusakov. Timofey Arkhangelskiy and Mikhail Daniel took part in the discussion of these issues.
The current version of ANC was created by Maria Morozova, Alexander Rusakov and Timofey Arkhangelskiy and opened in 2016 and uses the morphological analyzer and the tsakorpus platform developed by Timofey Arkhangelskiy.
At different stages, in the development of the Corpus were involved:
- Maria Morozova, Saint Petersburg (corpus representation of Albanian grammar, ideas for annotation, grammatical description of the inflection for the parser, compilation of the wordlist for the parser, metatext markup, etymological markup, derivational markup, disambiguation rules, text collection and processing, corpus webpage)
- Alexander Rusakov, Saint Petersburg — Tirana (corpus representation of Albanian grammar, ideas for annotation, collection and processing of texts)
- Timofey Arkhangelskiy, Moscow — Hamburg — Freising (development of the parser, the corpus platform, and the user interface, technical support)
- Marina Domosiletskaya, Saint Petersburg (compilation of the wordlist for the parser — nouns)
- Anna Konovalenko, Saint Petersburg (compilation of the wordlist for the parser — adverbs)
- Anastasia Sidko, Saint Petersburg (compilation of the wordlist for the parser — nouns, adjectives, verbs)
- Daria Alekseeva, Saint Petersburg (collection and processing of texts)
- Elizabeth Atakova, Saint Petersburg (collection and processing of texts)
- Varvara Diveeva, Saint Petersburg (collection and processing of texts)
- Maxim Makartsev, Moscow — Oldenburg (providing texts for the corpus)
- Besim Kabashi, Munich (providing texts for the corpus)
- Qerim Ondozi, Prishtina (collection and processing of texts)
The first version of the Albanian National Corpus was created with the financial support of the “Corpus linguistics” Program by the Presidium of the Russian Academy of Sciences. The Corpus team is grateful to the Academy of Sciences of Albania for their interest to the project, assistance, and willingness to cooperate. We also thank the publishing house “Onufri” (Tirana) for their help in the selection of texts.
The current version of the Corpus is hosted by the Institute for Linguistic Studies of the Russian Academy of Sciences.
Now the Corpus is maintained and developed by:
Maria Morozova (Insitute for Linguistic Studies of the Russian Academy of Sciences), morozovamaria86@gmail.com
Alexander Rusakov (Institute of Linguistics and Literature of the Academy of Sciences of Albania / Insitute for Linguistic Studies of the Russian Academy of Sciences), ayurusakov@gmail.com
Timofey Arkhangelskiy (Universität Hamburg), timarkh@gmail.com