Bayesian techniques for website boundary detection

This work involved developing detection system that can discover new knowledge about the structural relationship of web resources using a process of web crawling and modelling based on Bayesian classifiers. The new information that was discovered is expressed as a machine readable map of resources.

The set of technologies used as part of this work:

Java, Python
JSON, XML, HDFS
Weka, Pandas, Scikit-Learn, NumPy, Matplotlib, Statsmodels
Matlab, SPSS, Rapidminer
Linux Shell scripting
HTML DOM, regex, python - beautiful soup lib.

The skills I developed on this project:

Web crawling strategies
Classification algorithms
Robust web scraping systems
Development of HTML parsing systems
Data parsing algorithms

This work was conducted as part of my advanced computer science research.