Bayesian techniques for website boundary detection

This work involved developing detection system that can discover new knowledge about the structural relationship of web resources using a process of web crawling and modelling based on Bayesian classifiers. The new information that was discovered is expressed as a machine readable map of resources.

The set of technologies used as part of this work:

  • Java, Python
  • JSON, XML, HDFS
  • Weka, Pandas, Scikit-Learn, NumPy, Matplotlib, Statsmodels
  • Matlab, SPSS, Rapidminer
  • Linux Shell scripting
  • HTML DOM, regex, python - beautiful soup lib.

The skills I developed on this project:

  • Web crawling strategies
  • Classification algorithms
  • Robust web scraping systems
  • Development of HTML parsing systems
  • Data parsing algorithms

This work was conducted as part of my advanced computer science research.