alshukri2010 - Web Sub Site Boundary Detection
Web Sub Site Boundary Detection
by A. Alshukri, F. Coenen, M. Zito
Abstract
Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task.
In this paper a web-page clustering approach to boundary detection is suggested. The principal issue is feature selection, hampered by the observation that there is no clear understanding of what a web-site is.
This paper proposes a definition of a web-site, founded on the principle of user intention, directed at the boundary detection problem; and then reports on a sequence of experiments, using a number of clustering techniques, and a wide range of features and combinations of features to identify web-site boundaries.
The preliminary results reported seemto indicate that, in general, a combination of features produces the most appropriate result.
Reference
Web-Site Boundary Detection (A. Alshukri, F. Coenen, M. Zito), In Proceedings of the 10th Industrial Conference on Data Mining (ICDM’10). 12-14 July, Berlin, Germany, Springer, 2010.
Bibtex Entry
@inproceedings{Alshukri2010,
author = {Alshukri, A. and Coenen, F. and Zito, M.},
title = {Web-Site Boundary Detection},
booktitle = {Proceedings of the 10th Industrial Conference on Data Mining (ICDM'10). 12-14 July, Berlin, Germany},
year = {2010},
address = {Berlin, Germany},
pages = {529--543},
publisher = {Springer},
isbn = {978-3-642-14399-1},
series = {Lecture Notes in Computer Science},
}
For further details see the full paper.