Bayesian techniques for website boundary detection

This work involved developing detection system that can discover new knowledge about the structural relationship of web resources using a process of web crawling and modelling based on Bayesian classifiers. The new information that was discovered is expressed as a machine readable map of resources.

The set of technologies used as part of this work:

  • Java, Python
  • JSON, XML, HDFS
  • Weka, Pandas, Scikit-Learn, NumPy, Matplotlib, Statsmodels
  • Matlab, SPSS, Rapidminer
  • Linux Shell scripting
  • HTML DOM, regex, python - beautiful soup lib.

The skills I developed on this project:

  • Web crawling strategies
  • Classification algorithms
  • Robust web scraping systems
  • Development of HTML parsing systems
  • Data parsing algorithms

This work was conducted as part of my advanced computer science research.

Creating your first programming language is easier than you think,
...also looks great on your resume/cv.