Changes between Version 3 and Version 4 of SpiderLingImprovement
- Timestamp:
- Jan 17, 2017, 11:59:13 PM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SpiderLingImprovement
v3 v4 34 34 - Better performance (more pages downloaded per second) and less resources used (approx. 25 % less operational memory consumed) achieved by better spreading of domains in the crawling queue, switching to !PyPy from Python (the script is compiled before execution instead of interpreting during execution), rewriting chunked HTTP reponse and URL handling methods and generally improving the code overall. 35 35 36 == !SpiderLing crawler documentation and source == 37 http://corpus.tools/wiki/SpiderLing 38 36 39 == References == 37 40 - [1] Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.