Albanian National Corpus


Welcome to the start page of the Albanian National Corpus, which includes: the Main corpus (from the end of the 19th century till present) and the Corpus of early Albanian texts.

Details To the Main corpus To the Corpus of early Albanian texts

Albanian National Corpus

Toggle navigation

This is the main page of the website where the Albanian National Corpus (ANC) with more than 230 million words is located.

The Corpus is designed for people interested in various issues related to the Albanian language, and may provide the reference data for both professional linguists and those who show an interest in Albanian and its history. The material collected in the Corpus can be used for studies of the Albanian lexicon and grammar, as well as for the investigation of language changes which happened in Albanian in the previous centuries.

From 2011 to the present, the Albanian National Corpus included primarily original prose texts (fiction, non-fiction, press) representing the Albanian literary language, as well as translated texts and Albanian poetry. The project aims at the widest possible presentation of not only modern written texts of various genres, but also texts created in different periods of the Albanian language history, as well as colloquial and dialectal texts representing non-literary forms of the modern Albanian language. In 2019, the Corpus of early Albanian texts was opened.

The texts of ANC have user-friendly metadata descriptions and linguistic annotation consisting of tags assigned to words.

Annotation

  • automatic morphological annotation (lemmatization, part of speech, all grammatical categories, English translation of lemmata), 93.9% words analyzedonly tokens that do not contain digits or characters from other alphabets are taken into account
  • etymological and derivational markup (annotation of inherited words and loanwords, annotation of derived words)
  • metatextual annotation
  • partial disambiguation (automatic disambiguation based on CG-3 rules)

Metadata

  • title of the text
  • author or title of the newspaper
  • translator (in the case of translated texts)
  • creation year (exact date of issue in the case of newspapers)
  • genre / type of text

About corpora

Currently, two corpora are available: Corpus of the modern literary Albanian (“Main corpus”) and Corpus of early Albanian texts. They differ in what kind of texts they contain and how these texts are presented, but have mostly identical annotation and search capabilities. Here are the main characteristics of the corpora:

Main corpus

The Main corpus contains 232 million words and consists of the following subcorpora:

Press Fiction Nonfiction Poetry
Size 222.5 million words 4.5 million words 4.6 million words 0.3 million words
Texts

    modern newspapers in Albanian (online):

  • Agjencia Telegrafike Shqiptare (Albanian Telegraphic Agency) (Albania, 2016–2017) — 4.4 mln words
  • Gazeta Shqiptare (Albania, 2006–2007) — 4.4 mln words
  • Panorama (Albania, 2014–2017) — 8.7 mln words
  • Gazeta Telegraf (Albania, 2012–2023) — 96 mln words
  • Koha Jonë (Albania, 2021–2022) — 8 mln words
  • Zëri (Kosovo, 2013, 2019–2023) — 45.3 mln words
  • Koha.mk (North Macedonia, 2011–2013, 2019–2023) — 60.3 mln words
authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s)

    authentic and translated nonfiction texts:

  • essays, interviews, epistolary texts, memoirs, diaries, notes: essays and correspondence of L. Poradeci (1910–1940s); modern essays, diaries and memoirs (Albania, Kosovo, North Macedonia, 1950–2010s) — 1.3 mln words
  • religious texts: the Bible (Old Testament, translated by the Albanian Bible Society; New testament, translated by V. Dervishi), the Quran — 0.8 mln words
  • scientific texts and textbooks (humanities and social sciences; Albania, Kosovo, 2000–2010s) — 2.1 mln words
  • official texts (legislative and diplomatic; Albania, Kosovo, 1990–2000s) — 0.4 mln words
Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s
Language variety

In most cases, standard written Albanian or close to it.

In the interviews and quotations one can meet examples of dialectal speech.

In most cases, standard written Albanian or close to it.

In the speech of the characters one can meet examples of dialectal speech or dialect stylization.

In most cases, standard written Albanian or close to it.

In the texts created before the 1950s earlier variants of literary language can be used.

In most cases, standard written Albanian or close to it.

In the poetic texts created before the 1950s earlier variants of literary language can be used.

Corpus of early Albanian texts

This Corpus is supposed to include the main monuments of the Albanian writing: the works by Gj. Buzuk, L. Matranga, P. Budi, F. Bardhi, P. Bogdani, P. Mazreku, the Kanun of Arbëria, etc.

To date, the text of the “Christian Doctrine” by Luka Matranga (1592) has been included into the Corpus of early Albanian texts. Matranga’s book is the second largest text written in the Albanian language, and the first in the Tosk dialect.

The text which is used in the Corpus is based on the so-called manuscript  А. We used the transcription of M. de Vaan, based on M. Sciambra. La “Dottrina cristiana” albanese di Luca Matranga (1964) and available at TITUS. We are grateful to M. de Vaan for providing us with the textual version of the transcription. During the text processing, it was verified using the other editions of the book, including the phototypic ones.

Publications

In our publications you may find more information about the Albanian National Corpus, its contents, and development:

Morozova, M. 2012. Shënime për standardin morfologjik të Korpusit nacional të shqipes [Notes on the morphological standard of the Albanian National Corpus]. In: Rugova, B. ed. Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXI Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 13–27 gusht 2012. Vëll. 31/1. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, pp. 153–156.

Arkhangelskij, T., Daniel, М., Morozova, М. and Rusakov, A. 2012. Korpusi i gjuhës shqipe: drejtimet kryesore të punës [Albanian Corpus: Main directions of work]. In: Ismajli, R. ed. Shqipja dhe gjuhët e Ballkanit. Albanian and Balkan Languages. Konferencë e mbajtur më 10–11 dhjetor 2011 në Prishtinë. Prishtinë: ASHAK, pp. 635-642.

Rusakov, A. and Morozova, M. 2012. Korpusi i gjuhës shqipe: problemet dhe rezultatet [Albanian Corpus: Problems and results]. In: Rugova, B. ed. Studime për nder të Rexhep Ismajlit me rastin e 65-vjetorit të lindjes. Prishtinë: Koha, pp. 639–649.

Morozova, M., Rusakov, A. and Domosiletskaya, M. 2013. Albanskaya imennaya morfologiya v korpusnom predstavlenii: Natsionalnyj korpus albanskogo yazyka [The Albanian noun morphology in the morphological model developed for the Albanian National Corpus]. In: Girfanova, A. Kh., Domosiletskaya, M. V., Zhugra, A. V., Kazansky, N. N., Rusakov, A. Yu., Sukhachev, N. L. eds. Albanskaya filologiya, balkanistika, problemy yazykoznaniya. К 100-letiyu so dnya rozhdeniya chlena-korrespondenta RAN Agnii Vasiliyevny Desnitskoj. Saint Petersburg: Nauka, pp. 120–130.

Morozova, M. and Rusakov, A. 2014. Korpusi elektronik i shqipes: përpunimi, përmbajtja dhe përdorimi [Electronic corpus of Albanian: Development, contents and use]. In: Rugova, B. ed. Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXII Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 19–30.08.2013. Vëll. 33/1. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, pp. 85-96.

Morozova, M. and Rusakov, A. 2015. Albanian National Corpus: Composition, Text Processing and Corpus-Oriented Grammar Development. In: Demiraj, B. ed. Sprache und Kultur der Albaner. Zeitliche und räumliche Dimensionen. Akten der 5. Deutsch-albanischen kulturwissenschaftlichen Tagung (5.–8. Juni 2014, Buçimas bei Pogradec, Albanien). Wiesbaden: Harrassowitz Verlag, pp. 270-308. (Albanische Forschungen, 37).

Morozova, M., Arkhangelskiy, T., Daniel, M. and Rusakov, A. 2016. Albanskij natsionalnyj korpus: osnovnye napravleniya raboty [Albanian National Corpus: Main directions of work]. Acta Linguistica Petropolitana. Trudy Instituta lingvisticheskikh issledovanij RAN. Vol. XII, Part 3, 169–189.

Morozova, M. and Rusakov, A. 2020. The early Albanian texts in an annotated language corpus: An attempt of processing and analysis. In: Demiraj, B. ed. Altalbanische Schriftkultur - aus der Perspektive der historischen Lexikographie und der Philologie der Gegenwart - Akten der 6. deutsch-albanischen kulturwissenschaftlichen Tagung (27. September 2019, Buçimas bei Pogradec, Albanien). Wiesbaden: Harrassowitz Verlag, pp. 91–102. (Albanische Forschungen, 44).

Morozova, M. and Rusakov, A. 2023. Struktura fjalëformuese e leksikut të shqipes (rreth anotimit të Korpusit elektronik të gjuhës shqipe) [Derivational structure of Albanian lexicon (towards the annotation of the Albanian corpus)]. In: Ismajli, R. ed. Shqipja standarde sot: konferencë shkencore me rastin e 50-vjetorit të Kongresit të Drejtshkrimit. Prishtinë: Akademia e Shkencave dhe e Arteve e Kosovës, pp. 233–242.

Rusakov, A. and Morozova, M. 2025. Albanian classical poetry in the Albanian National Corpus: Between linguistics and philology. In: Omari, A., Sinani, Sh., Hala, L. eds. Studime albanologjike në indoeuropianistike, filologji dhe gjuhësi kontakti. Vëllim në nderim të prof. Bardhyl Demirajt. Tirana: Akademia e shkencave e Shqipërisë, pp. 182–197.

How to cite the Corpus

If you use the information and examples from the Albanian National Corpus in your research, please use the following link:

Morozova, Maria, Alexander Rusakov & Timofey Arkhangelskiy. Albanian National Corpus. URL: albanian.web-corpora.net (accessed on .)

Grammatical tags

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of grammatical tags used in the Albanian corpora. Both corpora have identical set of tags.

  • A — adjective
  • ADV — adverb
  • ART — article
  • CLIT_PRO — pronominal clitic
  • CONJ — conjunction
  • FORM — formative suffix
  • INTJ — interjection
  • NOUN — noun
  • NUM — numeral
  • PART — particle
  • PR — preposition
  • PRO — pronoun
  • V — verb
  • VB_PART — verbal particle
  • 1 — 1st person
  • 2 — 2nd person
  • 3 — 3rd person
  • abl — ablative
  • abl2 — ablative 2
  • acc — accusative
  • acc_1pl — accusative clitic 1st person plural
  • acc_1sg — accusative clitic 1st person singular
  • acc_2pl — accusative clitic 2nd person plural
  • acc_2sg — accusative clitic 2nd person singular
  • acc_3pl — accusative clitic 3rd person plural
  • acc_3sg — accusative clitic 3rd person singular
  • act — active voice
  • adm — admirative mood
  • anim — animate noun
  • aor — aorist
  • cont — contact position of article
  • dat_1pl — dative clitic 1st person plural
  • dat_1sg — dative clitic 1st person singular
  • dat_2pl — dative clitic 2nd person plural
  • dat_2sg — dative clitic 2nd person singular
  • dat_3pl — dative clitic 3rd person plural
  • dat_3sg — dative clitic 3rd person singular
  • def — definite form
  • dist — distant position of article
  • f — feminine gender
  • gen_dat — genitive-dative
  • Gheg — Gheg lemma or word
  • imp — imperative mood
  • inanim — inanimate noun
  • ind — indicative mood
  • indef — indefinite form
  • ipf — imperfect
  • loc — locative
  • m — masculine gender
  • med — medial verb
  • mf — ambigenous noun
  • n — neuter gender
  • nom — nominative
  • nonst — non-standard lemma or word
  • opt — optative mood
  • pass — non-active voice
  • pl — plural
  • pres — present tense
  • ptcp — participle
  • sbjv — subjunctive mood
  • sg — singular
  • unmkd — unmarked case
  • vi — intransitive verb
  • vt — transitive verb
  • with_abl — preposition governs ablative
  • with_acc — preposition governs accusative
  • with_loc — preposition governs locative
  • with_nom — preposition governs nominative

Etymological and derivational tags

The following tags denote the origin of most analyzed Albanian words that can be either inherited ("alb") or borrowed from different languages, long time ago or fairly recently ("new"). Derivatives from both borrowed and inherited words are marked as "deriv".

  • alb — Albanian
  • Agreek — Ancient Greek
  • arum — Aromanian
  • borr — borrowed
  • en — English
  • fr — French
  • germ — German
  • greek — Byzantine and Modern Greek
  • hung — Hungarian
  • inter — international
  • it — Italian
  • lat — Latin
  • rom — Romance
  • rum — Romanian
  • rus — Russian
  • slav — Slavic
  • turk — Turkish
  • unknwn — unknown
  • new — new
  • deriv — derived

Authors

The first edition of the Corpus was developed in 2011 as a result of efforts of the creative community of linguists from Saint Petersburg (Institute for Linguistic Studies of the Russian Academy of Sciences) and Moscow (the School of Linguistics at HSE). This version used the search engine of the Eastern Armenian National Corpus (EANC).

The main ideas for the development of the corpus representation of Albanian grammar, the metatext and grammatical annotation belong to Maria Morozova and Alexander Rusakov. Timofey Arkhangelskiy and Mikhail Daniel took part in the discussion of these issues.

The current version of ANC was created by Maria Morozova, Alexander Rusakov and Timofey Arkhangelskiy and opened in 2016 and uses the morphological analyzer and the tsakorpus platform developed by Timofey Arkhangelskiy.

At different stages, in the development of the Corpus were involved:

  • Maria Morozova, Saint Petersburg (corpus representation of Albanian grammar, ideas for annotation, grammatical description of the inflection for the parser, compilation of the wordlist for the parser, metatext markup, etymological markup, derivational markup, disambiguation rules, text collection and processing, corpus webpage)
  • Alexander Rusakov, Saint Petersburg — Tirana (corpus representation of Albanian grammar, ideas for annotation, collection and processing of texts)
  • Timofey Arkhangelskiy, Moscow — Hamburg — Freising (development of the parser, the corpus platform, and the user interface, technical support)
  • Marina Domosiletskaya, Saint Petersburg (compilation of the wordlist for the parser — nouns)
  • Anna Konovalenko, Saint Petersburg (compilation of the wordlist for the parser — adverbs)
  • Anastasia Sidko, Saint Petersburg (compilation of the wordlist for the parser — nouns, adjectives, verbs)
  • Daria Alekseeva, Saint Petersburg (collection and processing of texts)
  • Elizabeth Atakova, Saint Petersburg (collection and processing of texts)
  • Varvara Diveeva, Saint Petersburg (collection and processing of texts)
  • Maxim Makartsev, Moscow — Oldenburg (providing texts for the corpus)
  • Besim Kabashi, Munich (providing texts for the corpus)
  • Qerim Ondozi, Prishtina (collection and processing of texts)

The first version of the Albanian National Corpus was created with the financial support of the “Corpus linguistics” Program by the Presidium of the Russian Academy of Sciences. The Corpus team is grateful to the Academy of Sciences of Albania for their interest to the project, assistance, and willingness to cooperate. We also thank the publishing house “Onufri” (Tirana) for their help in the selection of texts.

The current version of the Corpus is hosted by the Institute for Linguistic Studies of the Russian Academy of Sciences.

Now the Corpus is maintained and developed by:

  • Maria Morozova (Insitute for Linguistic Studies of the Russian Academy of Sciences), morozovamaria86@gmail.com
  • Alexander Rusakov (Institute of Linguistics and Literature of the Academy of Sciences of Albania / Insitute for Linguistic Studies of the Russian Academy of Sciences), ayurusakov@gmail.com
  • Timofey Arkhangelskiy (Universität Hamburg), timarkh@gmail.com
  • Contacts


    If you have questions, would like to propose collaboration, or noticed an error in the corpus, please contact us. You can also use the Albanian morphological analyzer and the tsakorpus corpus platform, which are open source and freely available.

    morozovamaria86@gmail.com

    ayurusakov@gmail.com

    timarkh@gmail.com