Albanian National Corpus
This is the main page of the website where the Albanian National Corpus (ANC) with 31.12 million words is located.
The Corpus is designed for people interested in various issues related to the Albanian language, and may provide the reference data for both professional linguists and those who show an interest in Albanian and its history due to their occupation or out of mere curiosity. The material collected in the Corpus can be used for studies of the Albanian lexicon and grammar, as well as for the investigation of language changes which happened in Albanian in the previous centuries.
From 2012 to the present, the Albanian National Corpus included primarily original prose texts (fiction, non-fiction, press) representing the Albanian literary language (since the 1970s), as well as translated texts and Albanian poetry. The project aims at the widest possible presentation of not only modern written texts of various genres, but also texts created in different periods of the Albanian language history, as well as colloquial and dialectal texts representing non-literary forms of the modern Albanian language. To date, the Corpus of early Albanian texts has been opened (in September 2019) and continues to develop.
The texts of ANC have user-friendly metadata descriptions and morphological annotation consisting of tags assigned to words.
Annotation
- automatic morphological annotation (lemmatization, part of speech, all inflectional features), 93.9% words analyzedonly tokens that do not contain digits or characters from other alphabets are taken into account
- no disambiguation at the current stage
- glossing
- English translation of lemmata
Metadata
- title of the text
- author or title of the newspaper
- creation year (exact date in the case of newspapers)
- genre / type of text
About corpora
Currently, two corpora are available: Corpus of the modern literary Albanian («Main corpus») and Corpus of early Albanian texts. They differ in what kind of texts they contain and how these texts are presented, but have mostly identical annotation and search capabilities. Here are the main characteristics of the corpora:
Main corpus
The Main corpus contains 31.12 million words and consists of the following subcorpora:
|
Press |
Fiction |
Nonfiction |
Poetry |
Size |
23.4 million words (75.2% of the Main Corpus) |
3.2 million words (10.3% of the Main Corpus) |
4.3 million words (13.8% of the Main Corpus) |
0.2 million words (0.7% of the Main Corpus) |
Texts |
|
authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s) |
|
Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s |
Language variety |
In most cases, standard written Albanian or close to it.
In the interviews and quotations one can meet examples of dialectal speech. |
In most cases, standard written Albanian or close to it.
In the speech of the characters one can meet examples of dialectal speech or dialect stylization. |
In most cases, standard written Albanian or close to it.
In the texts created before the 1950s earlier variants of literary language can be used. |
In most cases, standard written Albanian or close to it.
In the poetic texts created before the 1950s earlier variants of literary language can be used. |
Corpus of early Albanian texts
This Corpus is supposed to include the main monuments of the Albanian writing: the works by Gj. Buzuk, L. Matranga, P. Budi, F. Bardhi, P. Bogdani, P. Mazreku, the Kanun of Arbëria, etc.
To date, the text of the «Christian Doctrine» by Luka Matranga (1592) has been included into the Corpus of early Albanian texts. Matranga’s book is the second largest text written in the Albanian language, and the first in the Tosk dialect.
The text which is used in the Corpus is based on the so-called manuscript А. We used the transcription of M. de Vaan, based on M. Sciambra. La «Dottrina cristiana» albanese di Luca Matranga (1964) and available at TITUS. We are grateful to M. de Vaan for providing us with the textual version of the transcription. During the text processing, it was verified using the other editions of the book, including the phototypic ones.
Publications
In our publications you may find more information about the Albanian National Corpus, its contents, and development:
Morozova, Maria. Shënime për standardin morfologjik të Korpusit nacional të shqipes [Notes on the morphological standard of the Albanian National Corpus] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXI Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 13–27 gusht 2012 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2012. 31/1. F. 153–156.
Arkhangelskij, Timofej, Mikhail Daniel, Maria Morozova, Aleksandër Rusakov. Korpusi i gjuhës shqipe: drejtimet kryesore të punës [Albanian Corpus: Main directions of work] // Shqipja dhe gjuhët e Ballkanit. Albanian and Balkan Languages. Konferencë e mbajtur më 10–11 dhjetor 2011 në Prishtinë / red. R. Ismajli. Prishtinë: ASHAK, 2012. F. 635-642.
Rusakov, Aleksandër, Maria Morozova. Korpusi i gjuhës shqipe: problemet dhe rezultatet [Albanian Corpus: Problems and results] // Studime për nder të Rexhep Ismajlit me rastin e 65-vjetorit të lindjes / red. B. Rugova. Prishtinë: Koha, 2012. F. 639–649.
Morozova, Maria, Alexander Rusakov, Marina Domosiletskaya. Albanskaya imennaya morfologiya v korpusnom predstavlenii: Natsionalnyj korpus albanskogo yazyka [The Albanian noun morphology in the morphological model developed for the Albanian National Corpus] // Albanskaya filologiya, balkanistika, problemy yazykoznaniya. К 100-letiyu so dnya rozhdeniya chlena-korrespondenta RAN Agnii Vasiliyevny Desnitskoj / red. А. Kh. Girfanova, М. V. Domosiletskaya, А. V. Zhugra, N. N. Kazansky, А. Yu. Rusakov, N. L. Sukhachev. SPb: Nauka, 2013. S. 120–130.
Morozova, Maria, Alexander Rusakov. Korpusi elektronik i shqipes: përpunimi, përmbajtja dhe përdorimi [Electronic corpus of Albanian: Development, contents and use] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXII Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 19–30.08.2013 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2014. 33/1. F. 85-96.
Morozova, Maria, Alexander Rusakov. Albanian National Corpus: Composition, Text Processing and Corpus-Oriented Grammar Development // Sprache und Kultur der Albaner. Zeitliche und räumliche Dimensionen. Akten der 5. Deutsch-albanischen kulturwissenschaftlichen Tagung (5.–8. Juni 2014, Buçimas bei Pogradec, Albanien) / Hrsg. von B. Demiraj. Wiesbaden: Harrassowitz Verlag, 2015. (Albanische Forschungen, 37). S. 270-308.
Morozova, Maria, Timofey Arkhangelsky, Mikhail Daniel, Alexander Rusakov. Albanskij natsionalnyj korpus: osnovnye napravleniya raboty [Albanian National Corpus: Main directions of work] // Acta Linguistica Petropolitana. Trudy Instituta lingvisticheskikh issledovanij RAN / otv. red. N. N. Kazansky. SPb: Nauka, 2016. Т. XII, Ch. 3. S. 169–189.
How to cite the Corpus
If you use the information and examples from the Albanian National Corpus in your research, please use the following link:
Maria Morozova, Alexander Rusakov, Timofey Arkhangelskiy. Albanian National Corpus. (Available online at: albanian.web-corpora.net, accessed on .)
Authors
The Corpus was developed as a result of efforts of the creative community of linguists from Saint Petersburg (Institute for Linguistic Studies of the Russian Academy of Sciences) and Moscow (the School of Linguistics at HSE).
The previous version used the search engine of the Eastern Armenian National Corpus (EANC). The current version of ANC uses the morphological analyzer and the tsakorpus platform developed by Timofey Arkhangelsky.
The main ideas for the development of the corpus representation of Albanian grammar, the metatext and grammatical annotation belong to Maria Morozova and Alexander Rusakov. Timofey Arkhangelsky and Mikhail Daniel took part in the discussion of these issues.
In the development of the Corpus were involved:
- Maria Morozova, Saint Petersburg (corpus representation of Albanian grammar, grammatical annotation, metatext markup, corpus wordlist, text collection and processing)
- Alexander Rusakov, Saint Petersburg (corpus representation of Albanian grammar, collection and processing of texts)
- Timofey Arkhangelsky, Moscow — Hamburg (development of the morphological analyzer, the corpus platform, and the user interface, technical support)
- Marina Domosiletskaya, Saint Petersburg (compilation of the corpus wordlist — nouns)
- Anna Konovalenko, Saint Petersburg (compilation of the corpus wordlist — adverbs)
- Anastasia Sidko, Saint Petersburg (compilation of the corpus wordlist — nouns, adjectives, verbs)
- Daria Alekseeva, Saint Petersburg (collection and processing of texts)
- Elizabeth Atakova, Saint Petersburg (collection and processing of texts)
- Varvara Diveeva, Saint Petersburg (collection and processing of texts)
- Maxim Makartsev, Moscow — Oldenburg (providing texts for the corpus)
- Besim Kabashi, Munich (providing texts for the corpus)
- Qerim Ondozi, Prishtina (collection and processing of texts)
The Corpus team is grateful to the publishing house «Onufri» (Tirana) for the help in the selection of texts. The first version of the Albanian National Corpus was created with the financial support of the «Corpus linguistics» Program by the Presidium of the Russian Academy of Sciences. The current version of the Corpus is hosted by the School of Linguistics at HSE.
Now the Corpus is maintained and developed by:
Maria Morozova (ILS RAS, SPbSU), morozovamaria86@gmail.com
Alexander Rusakov (ILS RAS, SPbSU), ayurusakov@gmail.com
Timofey Arkhangelsky (Universität Hamburg), timarkh@gmail.com