wiki:TigrinyaCorpus

Context Navigation

close Warning: AdminModule failed with TracError: Unable to instantiate component <class 'trac.ticket.admin.SeverityAdminPanel'> (super(type, obj): obj must be an instance or subtype of type)

Version 4 (modified by hales, 9 years ago) ( diff )
--

D3.3a: A Tigrinya corpus, sized 3 million words

Building the Tigrinya web corpus

We have used the following steps to create a big Tigrinya Web corpus: First, adopting the Corpus factory method [1] bigrams of Tigrinya words from the Crúbadán database [2] were used to query Bing search engine for documents in Tigrinya. URLs of 9,034 documents found by the search engine were used as starting points for web crawler SpiderLing [3].

The following language models were created:

Character trigram model for language detection. 538 KB of text in the Ge`ez script from documents found by the search engine and manually checked was used to train the model.
Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
349 most frequent Tigrinya words from the manually checked bigrams of words from the Crúbadán database were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [4].

The crawler was set to harvest web domains in national top level domains of Ethiopia, Eritrea, Somalia, Djibouti (et, er, so, dj) and other general TLDs (com, org, net, gov, info, edu). 42 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Tigrinya texts were separated using the character trigram language model.

Duplicate or near duplicate paragraphs were identified and removed using tool onion [4]. The final size of the corpus is 39 MB and 2.5 million tokens. The corpus is called tiWaC16 (Tigrinya `Web as Corpus' corpus, year 2016).

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count	1,907
Paragraph count	28,552
Sentence count	139,357
Token count	2,531,443
Ge'ez script lexicon size	225,132
Sera script lexicon size	220,935

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Secon level domain size distribution
org	1,023	*.blogspot.com	349	At least 1000 documents	0
com	699	*.jw.org	307	At least 500 documents	0
net	55	tewahdo.org	174	At least 100 documents	4
		harnnet.org	116	At least 50 documents	8
		eritreantewahdo.org	97	At least 10 documents	28
		mekaleh-eritra.org	78	At least 5 documents	42
		mahberemariamisrael.com	76	At least 1 document	129
		asmarino.com	76
		fnoteatnatiewos.com	46
		erena.org	41
		forumeritrea.org	38
		dehnet.org	32

Since the corpus is small, the domain variety is also limited. The content of politics, religious and blog sites has a significant presence in the corpus sources.

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [5]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.

References

[1] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
[2] -- Scannell, Kevin P. "The Crúbadán Project: Corpus building for under-resourced languages." In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4, pp. 5-15. 2007.
[3] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
[4] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
[5] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text