Changes between Version 3 and Version 4 of SpiderLingImprovement


Ignore:
Timestamp:
Jan 17, 2017, 11:59:13 PM (7 years ago)
Author:
xsuchom2
Comment:

SpiderLing? crawler documentation and source

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLingImprovement

    v3 v4  
    3434- Better performance (more pages downloaded per second) and less resources used (approx. 25 % less operational memory consumed) achieved by better spreading of domains in the crawling queue, switching to !PyPy from Python (the script is compiled before execution instead of interpreting during execution), rewriting chunked HTTP reponse and URL handling methods and generally improving the code overall.
    3535
     36== !SpiderLing crawler documentation and source ==
     37http://corpus.tools/wiki/SpiderLing
     38
    3639== References ==
    3740 - [1] Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.