Bayesian techniques for website boundary detection
This work involved developing detection system that can discover new knowledge about the structural relationship of web resources using a process of web crawling and modelling based on Bayesian classifiers. The new information that was discovered is expressed as a machine readable map of resources.
The set of technologies used as part of this work:
- Java, Python
- JSON, XML, HDFS
- Weka, Pandas, Scikit-Learn, NumPy, Matplotlib, Statsmodels
- Matlab, SPSS, Rapidminer
- Linux Shell scripting
- HTML DOM, regex, python - beautiful soup lib.
The skills I developed on this project:
- Web crawling strategies
- Classification algorithms
- Robust web scraping systems
- Development of HTML parsing systems
- Data parsing algorithms
This work was conducted as part of my advanced computer science research.