Changes between Version 5 and Version 6 of CorporaAndCorpusBuilding
- Timestamp:
- Feb 1, 2016, 2:56:41 PM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CorporaAndCorpusBuilding
v5 v6 85 85 86 86 Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. An analysis will be carried out to identify web domains allowing such semi-automated approach of obtaining data after the web crawl is done. That will lead to higher corpora sizes for languages with scarce presence in the internet. 87 88 = Resources Available at the Beginning of the Project = 89 == Corpora Built by Text Laboratory, University of Oslo == 90 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=wic Amharic WIC corpus], 200 thousand tokens News from Walta Information Center, manually tagged. 91 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=oromo Oromo spoken corpus], 7,500 tokens. Oromo spoken corpus containing 1205 utterances. 92 93 == Corpora & Wordlists at crubadan.org == 94 * Amharic: http://crubadan.org/languages/am 95 * Oromo: http://crubadan.org/languages/om 96 * Somali: http://crubadan.org/languages/so 97 * Tigrinya: http://crubadan.org/languages/ti 98 99 == Wikipedia articles == 100 * Amharic: https://am.wikipedia.org/wiki/ (~13,000 articles) 101 * Oromo: https://om.wikipedia.org/wiki/ (~680 articles) 102 * Somali: https://so.wikipedia.org/wiki/ (~3,700 articles) 103 * Tigrinya: https://ti.wikipedia.org/wiki/ (~160 articles)