Website boundary detection via machine learning
[Started: September 2008]
This thesis describes research undertaken in the field of web data mining. More specifically this research is directed at investigating solutions to the Website Boundary Detection (WBD) problem. WBD is the problem of identifying the collection of all web pages that are part of a single website, which is an open problem. Potential solutions to WBD can be beneficial with respect to tasks such as archiving web content and the automated construction of web directories.
Thesis: [pdf] [html]
The majority of the software written for this research was mostly in java. Some linux shell scripting and an extenxive number of open source libraries were used. If you would like further details contact me.
If using these datasets can you please cite this paper [bibtex].
|||Mining the Information Architecture of the WWW using Automated Website Boundary Detection. , In Journal of Web Intelligence, IOS Press, 2017.|
|||A Dynamic Approach To The Website Boundary Detection Problem Using Random Walks , In Proceedings of the Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference (WIC'14), 11–14 August 2014, Warsaw, Poland, IEEE Computer Society, 2014. (slides)|
|||Website boundary detection via machine learning , PhD thesis, Department of Computer Science, School of Electrical Engineering, Electronics and Computer Science, University of Liverpool, 2012.|
|||Web-Site Boundary Detection Using Incremental Random Walk Clustering , In Proceedings of the 31st SGAI International Conference (SGAI'11), 13-15th December, Cambridge, England UK, Springer, 2011. (slides)|
|||Incremental Web-Site Boundary Detection Using Random Walks , In Proceedings of the 7th International Conference on Machine Learning and Data Mining (MLDM'11). 30th August-3rd September, New York, USA, Springer, 2011. (slides)|
|||Web-Site Boundary Detection , In Proceedings of the 10th Industrial Conference on Data Mining (ICDM'10). 12-14 July, Berlin, Germany, Springer, 2010. (slides)|