Changes between Version 5 and Version 6 of CorporaAndCorpusBuilding


Ignore:
Timestamp:
Feb 1, 2016, 2:56:41 PM (9 years ago)
Author:
xsuchom2
Comment:

Resources Available at the Beginning of the Project

Legend:

Unmodified
Added
Removed
Modified
  • CorporaAndCorpusBuilding

    v5 v6  
    8585
    8686Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. An analysis will be carried out to identify web domains allowing such semi-automated approach of obtaining data after the web crawl is done. That will lead to higher corpora sizes for languages with scarce presence in the internet.
     87
     88= Resources Available at the Beginning of the Project =
     89== Corpora Built by Text Laboratory, University of Oslo ==
     90* ​[http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=wic Amharic WIC corpus], 200 thousand tokens News from Walta Information Center, manually tagged.
     91* ​[http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=oromo Oromo spoken corpus], 7,500 tokens. Oromo spoken corpus containing 1205 utterances.
     92
     93== Corpora & Wordlists at crubadan.org ==
     94* Amharic: http://crubadan.org/languages/am
     95* Oromo: http://crubadan.org/languages/om
     96* Somali: http://crubadan.org/languages/so
     97* Tigrinya: http://crubadan.org/languages/ti
     98
     99== Wikipedia articles ==
     100* Amharic: https://am.wikipedia.org/wiki/ (~13,000 articles)
     101* Oromo: https://om.wikipedia.org/wiki/ (~680 articles)
     102* Somali: https://so.wikipedia.org/wiki/ (~3,700 articles)
     103* Tigrinya: https://ti.wikipedia.org/wiki/ (~160 articles)