wiki:InterimResults

Interim Results of the HaBiT project

Peer-review Board

  • board members:
    • Tomaž Erjavec (chairman)
    • Václav Matoušek (previous HaBiT opponent)
    • Diana McCarthy
  • each member has to sign a Confidentiality Declaration​
  • Skype meeting (initiated by MU) on

11:00 CEST (GMT+2), Tuesday, June 27, 2017

the meeting result is summarized in Agreed minutes (empty template Agreed_minutes.doc, editable form in Google Doc)

Outputs

The final version of HaBiT system

The HaBiT system is accessible at http://corpora.fi.muni.cz/habit

The system includes selected corpus processing tools and the following HaBiT corpora:

Amharic WIC corpus (News from Walta Information Center), manually tagged.

Amharic Web corpus. Crawled by SpiderLing in August 2013, October 2015 and January 2016. Cleaned, de-duplicated. Tagged by TreeTagger trained on Amharic WiC.
Corpus deliverable/technical report

Oromo spoken corpus containing 1205 utterances. Built by Text Laboratory, University of Oslo.

Oromo Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.
Corpus deliverable/technical report

Somali Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.
Corpus deliverable/technical report

Tigrinya Web corpus crawled by SpiderLing in January 2016. Cleaned, de-duplicated.
Corpus deliverable/technical report

Czech-Norwegian parallel corpus from subtitles, OpenSubtitles2016 subcorpus of OPUS2, filtered for Czech and Norwegian.
Corpus deliverable/technical report

Publications

D - conference paper, J - journal paper, R - software

2014-2015:

  1. D - Vít Baisa, Jane Bradbury, Silvie Cinková, Ismaïl El Maarouf, Adam Kilgarriff, Octavian Popescu. SemEval-2015 Task 15: A CPA dictionary-entry-building task. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, Colorado: Association for Computational Linguistics, 2015. s. 315-324, 10 s. ISBN 978-1-941643-40-2. https://is.muni.cz/publication/1308719
  2. D - Adam Kilgarriff, Vít Baisa, Miloš Jakubíček, Pavel Rychlý. Longest-commonest Match. In Kosem, I., Jakubíček, M., Kallas, J., Krek, S.. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Jlubljana: Trojina, Institute for Applied Slovene Studies, 2015. s. 397-404, 8 s. ISBN 978-961-93594-3-3. https://is.muni.cz/publication/1308616
  3. D - Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář, Vít Baisa. Interactive Visualizations of Corpus Data in Sketch Engine. In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk. Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015. s. 17-22, 6 s. ISBN 978-91-7519-035-8. https://is.muni.cz/publication/1299713
  4. D - Adam Rambousek, Aleš Horák. DEBWrite: Free Customizable Web-based Dictionary Writing System. In Kosem, I., Jakubiček, M., Kallas, J., Krek, S.. Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd., 2015. s. 443-451, 9 s. ISBN 978-961-93594-3-3. https://is.muni.cz/publication/1308365
  5. D - Vít Baisa, Ondřej Herman, Miloš Jakubíček. Towards Automatic Finding of Word Sense Changes in Time. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015. s. 33-41, 9 s. ISBN 978-80-263-0974-1. https://is.muni.cz/publication/1318600
  6. D - Zuzana Nevěřilová. Annotation of Multi-Word Expressions in Czech Texts. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Ninth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2015. s. 103-112, 10 s. ISBN 978-80-263-0974-1. https://is.muni.cz/publication/1320593
  7. D - Marek Medveď, Vít Baisa, Aleš Horák. Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods. In Constantin Orasan and Rohit Gupta. Proceedings of The Workshop on Natural Language Processing for Translation Memories (NLP4TM). Bulgaria: INCOMA Ltd. Shoumen, 2015. s. 31-35, 5 s. ISBN 978-954-452-032-8. https://is.muni.cz/publication/1311833
  8. D - Negation Scope Detection for Twitter Sentiment Analysis - Johan Reitan, Jørgen Faret, Björn Gambäck, Lars Bungum. The 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA) at the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP), Lisbon, Portugal. September 2015, pp. 99–108, Association for Computatioal Linguistics. http://aclweb.org/anthology/W15-2914
  9. D - Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages - Anupam Jamatia, Björn Gambäck, Amitava Das. The 10th Conference on Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria. September 2015, pp. 239–248. http://aclweb.org/anthology/R15-1033
  10. D - Multi-Domain Adapted Machine Translation Using Unsupervised Text Clustering - Lars Bungum, Björn Gambäck. Modeling and Using Context: 9th International and Interdisciplinary Conference, CONTEXT 2015, Lanarca, Cyprus, November 2-6, 2015. Proceedings. Editors: Henning Christiansen, Isidora Stojanovic, George A. Papadopoulos. Springer Verlag, Lecture Notes in Computer Science Volume 9405, pp. 201-213. http://link.springer.com/chapter/10.1007/978-3-319-25591-0_15
  11. D - Self-Organizing Maps for Classification of a Multi-Labeled Corpus - Lars Bungum, Björn Gambäck. The 12th International Conference on Natural Language Processing (ICON), Trivandrum, Kerala, India. December 2015.
  12. D - Sentence Boundary Detection for Social Media Text - Dwijen Rudrapal Anupam Jamatia Kunal Chakma, Amitava Das, Björn Gambäck. The 12th International Conference on Natural Language Processing (ICON), Trivandrum, Kerala, India. December 2015.

2016-2017:

  1. D - Collecting and Annotating Indian Social Media Code-Mixed Corpora - Anupam Jamatia, Björn Gambäck, Amitava Das. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Konya, Turkey. April 2016.
  2. D - Comparing the Level of Code-Switching in Corpora - Björn Gambäck, Amitava Das. The 10th International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia. May 2016. http://www.lrec-conf.org/proceedings/lrec2016/summaries/669.html
  3. D - NTNUSentEval at SemEval-2016 Task 4: Combining General Classifiers for Fast Twitter Sentiment Analysis - Brage Ekroll Jahren, Valerij Fredriksen, Björn Gambäck, Lars Bungum. 10th International Workshop on Semantic Evaluation (SemEval) at the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'2016), San Diego, California. June 2016, pp. 103–108. http://aclweb.org/anthology/S/S16/S16-1014.pdf
  4. D - Linguistic Domains and Adaptable Companionable Agents - Björn Gambäck, Lars Bungum. 1st International Workshop on Domain Adaptation for Dialog Agents at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Riva del Garda, Italy. September 2016.
  5. D - Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet - Utpal Kumar Sikdar, Björn Gambäck. The 2nd Workshop on Computational Approaches to Code Switching at the 2016 Conference on Empirical Methods on Natural Language Processing (EMNLP), Austin, Texas. November 2016. http://www.aclweb.org/anthology/W/W16/W16-5817.pdf
  6. D - Feature-Rich Twitter Named Entity Extraction and Classification - Utpal Kumar Sikdar, Björn Gambäck. The 2nd Workshop on Noisy User-generated Text (W-NUT) at the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan. December 2016. http://www.aclweb.org/anthology/W/W16/W16-3922.pdf
  7. D - Twitter Named Entity Extraction and Linking Using Differential Evolution - Utpal Kumar Sikdar, Björn Gambäck. The 13th International Conference on Natural Language Processing (ICON), Varanasi, Uttar Pradesh, India. December 2016, pp. 198-207.
  8. D - Pavel Rychlý, Vít Suchomel. Annotated Amharic Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016. s. 295-302, 8 s. ISBN 978-3-319-45509-9. https://is.muni.cz/publication/1353390
  9. Zuzana Nevěřilová. Annotation of Czech Texts with Language Mixing. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016. s. 279-286, 8 s. ISBN 978-3-319-45509-9. doi:10.1007/978-3-319-45510-5_32. https://is.muni.cz/publication/1358121
  10. D - Marek Medveď, Aleš Horák. AQA: Automatic Question Answering System for Czech. In Sojka Petr, Horák Aleš, Kopeček Ivan, Pala Karel. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer International Publishing, 2016. s. 270-278, 9 s. ISBN 978-3-319-45510-5. doi:10.1007/978-3-319-45510-5_31. https://is.muni.cz/publication/1353405
  11. Vít Baisa. Czech Grammar Agreement Dataset for Evaluation of Language Models. In RASLAN 2016 Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016. s. 63-67, 5 s. ISBN 978-80-263-1095-2. https://is.muni.cz/publication/1362555
  12. D - Ondřej Herman, Vít Suchomel, Vít Baisa, Pavel Rychlý. DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model. In Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). Osaka: The COLING 2016 Organizing Committee, 2016. s. 114-118, 5 s. ISBN 978-4-87974-716-7. https://is.muni.cz/publication/1366107
  13. D - Marek Medveď, Vojtěch Kovář, Miloš Jakubíček. English-French Document Alignment Based on Keywords and Statistical Translation. In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers. Berlin: Association for Computational Linguistics, 2016. s. 728-732, 5 s. ISBN 978-1-945626-10-4. https://is.muni.cz/publication/1352922
  14. D - Vít Baisa, Jan Michelfeit, Marek Medveď, Miloš Jakubíček. European Union Language Resources in Sketch Engine. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. s. 2799-2803, 5 s. ISBN 978-2-9517408-9-1. https://is.muni.cz/publication/1346032
  15. D - Vojtěch Kovář. Evaluating Natural Language Processing Tasks with Low Inter-Annotator Agreement: The Case of Corpus Applications. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. s. 127-134, 8 s. ISBN 978-80-263-1095-2. https://is.muni.cz/publication/1365039
  16. D - Vojtěch Kovář, Jakub Machura, Kristýna Zemková, Michal Rott. Evaluation and Improvements in Punctuation Detection for Czech. In Petr Sojka; Aleš Horák; Ivan Kopeček; Karel Pala. ext, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings. Switzerland: Springer, 2016. s. 287-294, 8 s. ISBN 978-3-319-45509-9. https://is.muni.cz/publication/1358120
  17. D - Vojtěch Kovář, Monika Močiariková, Pavel Rychlý. Finding Definitions in Large Corpora with Sketch Engine. In Nicoletta Calzolari (Conference Chair) et al.. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. s. 391-394, 4 s. ISBN 978-2-9517408-9-1. https://is.muni.cz/publication/1360550
  18. D - Silvie Cinkova, Ema Krejčová, Anna Vernerová, Vít Baisa. Graded and Word-Sense-Disambiguation Decisions in Corpus Pattern Analysis: a Pilot Study. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. s. 848-854, 7 s. ISBN 978-2-9517408-9-1. https://is.muni.cz/publication/1346038
  19. D - Miloš Jakubíček, Pavel Šmerk. Large Scale Keyword Extraction using a Finite State Backend. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. s. 143-146, 4 s. ISBN 978-80-263-1095-2. https://is.muni.cz/publication/1365139
  20. J - Aleš Horák, Adam Rambousek. Lexicographic Tools to Build New Encyclopaedia of the Czech Language. The Prague Bulletin of Mathematical Linguistics, Prague (Czech Republic): Charles University, 2016, vol. 2016, no. 106, s. 205-213. ISSN 0032-6585. https://is.muni.cz/publication/1353279
  21. D - Vít Baisa, Sara Može, Irene Renau. Multilingual CPA: Linking Verb Patterns across Languages. In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. s. 410-417, 8 s. ISBN 978-9941-13-542-2. https://is.muni.cz/publication/1352903
  22. D - Vojtěch Kovář, Miloš Jakubíček, Aleš Horák. On Evaluation of Natural Language Processing Tasks: Is Gold Standard Evaluation Methodology a Good Solution? In Jaap van den Herik and Joaquim Filipe. Proceedings of the 8th International Conference on Agents and Artificial Intelligence. Rome: SCITEPRESS, 2016. s. 540-545, 6 s. ISBN 978-989-758-172-4. https://is.muni.cz/publication/1322854
  23. D - Valentina Apresjan, Vít Baisa, Olga Buivolova, Olga Kultepina. RuSkELL: Online Language Learning Tool for Russian Language. In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. s. 292-299, 8 s. ISBN 978-9941-13-542-2 https://is.muni.cz/publication/1352900
  24. J - Vojtěch Kovář, Vít Baisa, Miloš Jakubíček. Sketch Engine for Bilingual Lexicography. International Journal of Lexicography, Oxford: Oxford University Press, 2016, vol. 29, no. 2, s. 1-14. ISSN 0950-3846. https://is.muni.cz/publication/1349930
  25. D - Vít Baisa, Silvie Cinkova, Ema Krejčová, Anna Vernerová. VPS-GradeUp: Graded Decisions on Usage Patterns. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. s. 823-827, 5 s. ISBN 978-2-9517408-9-1. https://is.muni.cz/publication/1347072
  26. R - Vít Suchomel, Pavel Rychlý. Set of Ethiopian Web Corpora (software). 2016. https://is.muni.cz/publication/1381970
  27. D - GAMBÄCK, Björn and Utpal Kumar SIKDAR. Named Entity Recognition for Amharic Using Deep Learning. In Paul Cunningham and Miriam Cunningham (Eds). IST-Africa 2017 Conference Proceedings. IIMC International Information Management Corporation, Windhoek, Namibia, June 2017. ISBN: 978-1-905824-56-4.
  28. D - GAMBÄCK, Björn and Utpal Kumar SIKDAR. Using Convolutional Neural Networks to Classify Hate-Speech. To appear in the 1st Workshop on Abusive Language Online to be held at the 55th Annual Meeting of the Association of Computational Linguistics, Vancouver, Canada, August 2017.
  29. D - KUMAR, Upendra, Aishwarya N. REGANTI, Tushar MAHESHWARI, Tanmoy CHAKROBORTY, Björn GAMBÄCK and Amitava DAS. Inducing Personalities and Values from Language Use in Social Network Communities. To appear in Information Systems Frontiers, Special Issue on Mining Human Psycholinguistic Behaviour from Social Media. ISSN 1387-3326. Springer.
  30. D - MAHESHWARI, Tushar, Aishwarya N. REGANTI, Samiksha GUPTA, Anupam JAMATIA, Upendra KUMAR, Björn GAMBÄCK and Amitava DAS. A Societal Sentiment Analysis: Predicting the Values and Ethics of Individuals by Analysing Social Media Content. In Mirella Lapata, Phil Blunsom and Alexander Koller (eds.): Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 731–741, Valencia, Spain, April, 2017. ISBN 978-1-945626-34-0. ACL.
  31. D - RÆDER, Johan G. Cyrus M. and Björn GAMBÄCK. Sarcasm Annotation and Detection in Tweets. In Alexander Gelbukh (ed.): The 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2017), Budapest, Hungary, May 2017, Springer Lecture Notes in Computer Science.
  32. D - SIKDAR, Utpal Kumar and Björn GAMBÄCK. Named Entity Recognition for Amharic Using Stack-Based Deep Learning. In Alexander Gelbukh (ed.): The 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2017), Budapest, Hungary, May 2017, Springer Lecture Notes in Computer Science.
  33. D - STEINSKOG, Asbjørn Ottesen, Jonas Foyn THERKELSEN and Björn GAMBÄCK. Twitter Topic Modeling by Tweet Aggregation. In Jörg Tiedemann (ed.): Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 77-86, Göteborg, Sweden, May 2017. ISBN 978-91-7685-601-7. NEALT.
  34. R - Pavel Rychlý. Corpus Annotation Tool (software). 2017. https://is.muni.cz/publication/1381994
  35. R - Karel Pala, Aleš Horák, Pavel Rychlý, Vít Suchomel, Vít Baisa, Miloš Jakubíček, Vojtěch Kovář, Zuzana Nevěřilová, Adam Rambousek, Björn Gambäck, Utpal Sikdar, Lars Bungum. HaBiT system (software). 2017. https://is.muni.cz/publication/1381969

Reports

Deliverables


Last modified 7 years ago Last modified on Jun 27, 2017, 11:53:29 AM

Attachments (32)

Download all attachments as: .zip