alshukri2012 - Website boundary detection via machine learning

Website boundary detection via machine learning

by A. Alshukri

[PDF]

Abstract

This thesis describes research undertaken in the field of web data mining. More specifically this research is directed at investigating solutions to the Website Boundary Detection (WBD) problem. WBD is the problem of identifying the collection of all web pages that are part of a single website, which is an open problem. Potential solutions to WBD can be beneficial with respect to tasks such as archiving web content and the automated construction of web directories.

A pre-requisite to any WBD approach is that of a definition of a website. This thesis commences with a discussion of previous definitions of a website, and subsequently proposes a definition of a website which is used with respect to the WBD solution approaches presented later in this thesis.

The WBD problem may be addressed in either the static or the dynamic context. Both are considered in this thesis. Static approaches require all web page data to be available a priori in order to make a decision on what pages are within a website boundary. While dynamic approaches make decisions on portions of the web data, and incrementally build a representation of the pages within a website boundary. There are three main approaches to the WBD problem presented in this thesis; the first two are static approaches, and the final one is a dynamic approach.

The first static approach presented in this thesis concentrates on the types of features that can be used to represent web pages. This approach presents a practical solution to the WBD problem by applying clustering algorithms to various combinations of features. Further analysis investigates the “best” combination of features to be used in terms of WBD performance.

The second static approach investigates graph partitioning techniques based on the structural properties of the web graph in order to produce WBD solutions. Two variations of the approach are considered, a hierarchical graph partitioning technique, and a method based on minimum cuts of flow networks.

The final approach for the evaluation of WBD solutions presented in this research considers the dynamic context. The proposed dynamic approach uses both structural properties and various feature representations of web pages in order to incrementally build a website boundary as the pages of the web graph are traversed.

The evaluation of the approaches presented in this thesis was conducted using web graphs from four academic departments hosted by the University of Liverpool. Both the static and dynamic approaches produce appropriate WBD solutions, however. The reported evaluation suggests that the dynamic approach to resolving the WBD problem offers additional benefits over a static approach due to the lower resource cost of gathering and processing typically smaller amounts of web data.

Reference

Website boundary detection via machine learning (A. Alshukri), PhD thesis, Department of Computer Science, School of Electrical Engineering, Electronics and Computer Science, University of Liverpool, 2012.

Bibtex Entry

@phdthesis{Alshukri2012,
	author = {Alshukri, A.},
	title = {Website boundary detection via machine learning},
	school = {Department of Computer Science, School of Electrical Engineering, Electronics and Computer Science, University of Liverpool},
	year = {2012},
	type ={},
	address = {University of Liverpool},
	month = {August},
}

For further details see the full paper.

Creating your first programming language is easier than you think,
...also looks great on your resume/cv.