//Web-Site Boundary Detection

Web-Site Boundary Detection

by A. Alshukri, F. Coenen, M. Zito
Abstract:
Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task. In this paper a web-page clustering approach to boundary detection is suggested. The principal issue is feature selection, hampered by the observation that there is no clear understanding of what a web-site is. This paper proposes a definition of a web-site, founded on the principle of user intention, directed at the boundary detection problem; and then reports on a sequence of experiments, using a number of clustering techniques, and a wide range of features and combinations of features to identify web-site boundaries.The preliminary results reported seemto indicate that, in general, a combination of features produces the most appropriate result.
Reference:
Web-Site Boundary Detection (A. Alshukri, F. Coenen, M. Zito), In Proceedings of the 10th Industrial Conference on Data Mining (ICDM’10). 12-14 July, Berlin, Germany, Springer, 2010. (slides)
Bibtex Entry:
@inproceedings{Alshukri2010,
	author = {Alshukri, A. and Coenen, F. and Zito, M.},
	title = {Web-Site Boundary Detection},
	abstract = {Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task. In this paper a web-page clustering approach to boundary detection is suggested. The principal issue is feature selection, hampered by the observation that there is no clear understanding of what a web-site is. This paper proposes a definition of a web-site, founded on the principle of user intention, directed at the boundary detection problem; and then reports on a sequence of experiments, using a number of clustering techniques, and a wide range of features and combinations of features to identify web-site boundaries.The preliminary results reported seemto indicate that, in general, a combination of features produces the most appropriate result.},
	booktitle = {Proceedings of the 10th Industrial Conference on Data Mining (ICDM'10). 12-14 July, Berlin, Germany},
	year = {2010},
	address = {Berlin, Germany},
	pages = {529--543},
	publisher = {Springer},
	isbn = {978-3-642-14399-1},
	series = {Lecture Notes in Computer Science},	
	url = {http://link.springer.com/chapter/10.1007/978-3-642-23199-5_31},
	url = {http://www.csc.liv.ac.uk/~frans/PostScriptFiles/icdm2010alshukri.pdf},	
	url = {http://cgi.csc.liv.ac.uk/~ash/Publications/Papers/Alshukri2010-ICDM_Web-Site_Boundary_Detection.pdf},
	url = {/pubs/MLDM-ICDM/Alshukri2010-ICDM_Web-Site_Boundary_Detection.pdf},
	comment={<a href="#">slides</a>}
}