Context Navigation

Changes between Version 5 and Version 6 of CorporaAndCorpusBuilding

Timestamp:: Feb 1, 2016, 2:56:41 PM (10 years ago)
Author:: xsuchom2
Comment:: Resources Available at the Beginning of the Project

Legend:

: Unmodified
: Added
: Removed
: Modified

CorporaAndCorpusBuilding

-              v5
+              v6
 Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. An analysis will be carried out to identify web domains allowing such semi-automated approach of obtaining data after the web crawl is done. That will lead to higher corpora sizes for languages with scarce presence in the internet.
+= Resources Available at the Beginning of the Project =
+== Corpora Built by Text Laboratory, University of Oslo ==
+* [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=wic Amharic WIC corpus], 200 thousand tokens News from Walta Information Center, manually tagged.
+* [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=oromo Oromo spoken corpus], 7,500 tokens. Oromo spoken corpus containing 1205 utterances.
+== Corpora & Wordlists at crubadan.org ==
+* Amharic: http://crubadan.org/languages/am
+* Oromo: http://crubadan.org/languages/om
+* Somali: http://crubadan.org/languages/so
+* Tigrinya: http://crubadan.org/languages/ti
+== Wikipedia articles ==
+* Amharic: https://am.wikipedia.org/wiki/ (~13,000 articles)
+* Oromo: https://om.wikipedia.org/wiki/ (~680 articles)
+* Somali: https://so.wikipedia.org/wiki/ (~3,700 articles)
+* Tigrinya: https://ti.wikipedia.org/wiki/ (~160 articles)