A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms

Chul-Won Na; Byung-Won On

A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms

Chul-Won Na, Byung-Won On, Journal of Internet Computing and Services, Vol. 20, No. 3, pp. 43-59, Jun. 2019

10.7472/jksii.2019.20.3.43, Full Text:

Keywords: Web Page, Web Crawling Algorithms, Web Crawler

Abstract

Today, with the spread of smartphones and the development of social networking services, structured and unstructured big data have stored exponentially. If we analyze them well, we will get useful information to be able to predict data for the future. Large amounts of data need to be collected first in order to analyze big data. The web is repository where these data are most stored. However, because the data size is large, there are also many data that have information that is not needed as much as there are data that have useful information. This has made it important to collect data efficiently, where data with unnecessary information is filtered and only collected data with useful information. Web crawlers cannot download all pages due to some constraints such as network bandwidth, operational time, and data storage. This is why we should avoid visiting many pages that are not relevant to what we want and download only important pages as soon as possible. This paper seeks to help resolve the above issues. First, We introduce basic web-crawling algorithms. For each algorithm, the time-complexity and pros and cons are described, and compared and analyzed. Next, we introduce the state-of-the-art web crawling algorithms that have improved the shortcomings of the basic web crawling algorithms. In addition, recent research trends show that the web crawling algorithms with special purposes such as collecting sentiment words are actively studied. We will one of the introduce Sentiment-aware web crawling techniques that is a proactive web crawling technique as a study of web crawling algorithms with special purpose. The result showed that the larger the data are, the higher the performance is and the more space is saved.

Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.

Cite this article

[APA Style]

Na, C. & On, B. (2019). A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms. Journal of Internet Computing and Services, 20(3), 43-59. DOI: 10.7472/jksii.2019.20.3.43.

[IEEE Style]

C. Na and B. On, "A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms," Journal of Internet Computing and Services, vol. 20, no. 3, pp. 43-59, 2019. DOI: 10.7472/jksii.2019.20.3.43.

[ACM Style]

Chul-Won Na and Byung-Won On. 2019. A proposal on a proactive crawling approach with analysis of state-of-the-art web crawling algorithms. Journal of Internet Computing and Services, 20, 3, (2019), 43-59. DOI: 10.7472/jksii.2019.20.3.43.