wiki:InterimResults

Version 10 (modified by gamback, 7 years ago) (diff)

--

Interim Results of the HaBiT project

Outputs

The second version of HaBiT system prototype

The prototype is accessible at http://corpora.fi.muni.cz/habit

The system includes selected corpus processing tools and the following HaBiT corpora:

Amharic WIC corpus (News from Walta Information Center), manually tagged.

Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015. Encoded in UTF-8, cleaned, deduplicated. Automatically tagged by TreeTagger trained on Amharic WiC

Oromo spoken corpus containing 1205 utterances. Built by Text Laboratory, University of Oslo.

Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.

Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.

Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.

Publications

D - conference paper, J - journal paper, R - software

  • D - Vít Baisa, Jane Bradbury, Silvie Cinková, Ismaïl El Maarouf, Adam Kilgarriff, Octavian Popescu. SemEval-2015 Task 15: A CPA dictionary-entry-building task. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, Colorado: Association for Computational Linguistics, 2015. s. 315-324, 10 s. ISBN 978-1-941643-40-2. https://is.muni.cz/publication/1308719
  • D - Adam Kilgarriff, Vít Baisa, Miloš Jakubíček, Pavel Rychlý. Longest-commonest Match. In Kosem, I., Jakubíček, M., Kallas, J., Krek, S.. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Jlubljana: Trojina, Institute for Applied Slovene Studies, 2015. s. 397-404, 8 s. ISBN 978-961-93594-3-3. https://is.muni.cz/publication/1308616
  • D - Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář, Vít Baisa. Interactive Visualizations of Corpus Data in Sketch Engine. In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk. Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015. s. 17-22, 6 s. ISBN 978-91-7519-035-8. https://is.muni.cz/publication/1299713
  • D - Adam Rambousek, Aleš Horák. DEBWrite: Free Customizable Web-based Dictionary Writing System. In Kosem, I., Jakubiček, M., Kallas, J., Krek, S.. Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., 2015. s. 443-451, 9 s. ISBN 978-961-93594-3-3. https://is.muni.cz/publication/1308365
  • D - Vít Baisa, Ondřej Herman, Miloš Jakubíček. Towards Automatic Finding of Word Sense Changes in Time. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015. s. 33-41, 9 s. ISBN 978-80-263-0974-1. https://is.muni.cz/publication/1318600
  • D - Zuzana Nevěřilová. Annotation of Multi-Word Expressions in Czech Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015. s. 103-112, 10 s. ISBN 978-80-263-0974-1. https://is.muni.cz/publication/1320593
  • D - Marek Medveď, Vít Baisa, Aleš Horák. Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods. In Constantin Orasan and Rohit Gupta. Proceedings of The Workshop on Natural Language Processing for Translation Memories (NLP4TM). Bulgaria: INCOMA Ltd. Shoumen, 2015. s. 31-35, 5 s. ISBN 978-954-452-032-8. https://is.muni.cz/publication/1311833
  • D - Negation Scope Detection for Twitter Sentiment Analysis - Johan Reitan, Jørgen Faret, Björn Gambäck, Lars Bungum. The 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA) at the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP), Lisbon, Portugal. September 2015, pp. 99–108, Association for Computatioal Linguistics. http://aclweb.org/anthology/W15-2914
  • D - Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages - Anupam Jamatia, Björn Gambäck, Amitava Das. The 10th Conference on Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria. September 2015, pp. 239–248. http://aclweb.org/anthology/R15-1033
  • D - Multi-Domain Adapted Machine Translation Using Unsupervised Text Clustering - Lars Bungum, Björn Gambäck. Modeling and Using Context: 9th International and Interdisciplinary Conference, CONTEXT 2015, Lanarca, Cyprus, November 2-6, 2015. Proceedings. Editors: Henning Christiansen, Isidora Stojanovic, George A. Papadopoulos. Springer Verlag, Lecture Notes in Computer Science Volume 9405, pp. 201-213. http://link.springer.com/chapter/10.1007/978-3-319-25591-0_15
  • D - Self-Organizing Maps for Classification of a Multi-Labeled Corpus - Lars Bungum, Björn Gambäck. The 12th International Conference on Natural Language Processing (ICON), Trivandrum, Kerala, India.
  • D - Sentence Boundary Detection for Social Media Text - Dwijen Rudrapal Anupam Jamatia Kunal Chakma, Amitava Das, Björn Gambäck. The 12th International Conference on Natural Language Processing (ICON), Trivandrum, Kerala, India.

Deliverables

Attachments (32)

Download all attachments as: .zip