SIGIR'98 poster: Experiments of Collecting WWW Information using Distributed WWW Robots

Experiments of Collecting WWW Information using Distributed WWW Robots


Hayato YAMANA
Computer Science Div., Electrotechnical Laboratory, Tsukuba, Ibaraki 305-8568 Japan.

Kent TAMURA
Tokyo Research Laboratory, IBM Corporation, Yamato, Kanagawa 242 Japan.

Hiroyuki KAWANO
Graduate School of Informatics, Kyoto University, Kyoto 606-8501 Japan.

Satoshi KAMEI
Graduate School of Engineering, Kyoto University, Kyoto 606-8501 Japan.

Masanori HARADA
Graduate School of Arts and Sciences, The University of Tokyo, Meguro, Tokyo 153 Japan.

Hideki NISHIMURA
Software Research Laboratories, Corporate R&D Group, Sharp Corporation, Tenri,Osaka 632 Japan.

Isao ASAI
College of Engineering, Osaka Prefecture University, Sakai, Osaka 599-8531 Japan.

Hiroyuki KUSUMOTO
Faculty of Environmental Information, Keio University, Fujisawa, Kanagawa 252-0816 Japan.

Yoichi SHINODA
School of Information Science, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa 923-12 Japan.

Yoichi MURAOKA
Department of Information & Computer Science, School of Science & Engineering, Waseda University, Shinjuku, Tokyo 169-8555 Japan.


Abstract

The world-wide web, in short the web, is a large distributed digital information space. It is the most popular internet service and is now indispensable for us. Since the web itself has no protocols for searching the web documents, we need to collect the documents on the web servers to make a database to search.

In this paper, we propose distributed WWW robots to collect the web documents quickly. Our final goal is to collect all of the documents on the web in Japan within one day. Currently, eight distributed WWW robots, whose system code is mostly written in Java with some C, are running in Japan. We have already found 13,320 domains that have jp domain.

The experimental results show that we are able to gain 5.8 to 9.7 times speedup when four distributed WWW robots are placed at different places in comparison with when only one WWW robot is used. We also expect that we are able to gain about ( 2.8 x n ) times speedup at most when we use n WWW robots to collect the web documents.


SIGIR'98
24-28 August 1998
Melbourne, Australia.
sigir98@cs.mu.oz.au.