If we can build machines that understand the information on the web in a more intuitive way, they could be useful for improving or automating applications such as:
- Website archiving
- Digital preservation
- Web directory generation
- Website map generation
- Web spam detection
This research looked at extracting information from the architecture of the web using machine learning techniques. The main motivation behind this work was to look into the difference between: How a human understands information on the web Compared with how a machine understands the same information
Humans vs machines
To illustrate how the same information is interpreted differently by a human and a machine an example is used. We use the example of a news article published to the web.
The intention is to produce, and indeed consume, the content of the article as a single coherent entity. This article may well have multiple pages, images, or videos. The multi-page setup for the article may segment content on several pages. Each of these resources may be hosted on different services. Youtube for the video for example.
Each of these characteristics are almost not worth mentioning to the human consuming this article. With very little web experience, a user can read the text, navigate the multiple pages by clicking the numbering, or next pages link. They can play video or look at imagery.
For a machine, this is a different matter.
How does a machine see this content? A machine can only read the web page, which contains a whole host of links and text content. With a variety embedded images and videos. Some of which maybe from the news vendor. Some of which maybe related articles, from the same vendor, or even partner organisations. This content could even be provided from third party advertising companies.
There are a number of ways a machine can interpret this information. Some of which are using:
- the link structure
- host IP or server name of the content
- textual content
- the linked images
- consistent styling of the pages
Non of which individually can identify with any certainty all elements which would correspond to a single article.
What is the problem?
The human perceived structure (or architecture) exists in a layer above the encoded structure of the web. The process whereby humans derive this architectures is founded on relationships and/or similarities between features encoded in the web, such as the topic or subject of web pages, layout, navigation menus, imagery and so on. Although humans are able to interpret the WWW in this way, it is a difficult task for a machine.
How can we solve this problem?
This research paper proposed and evaluation a number of techniques proposed to solve this problem. The technique which worked best in the context of this research was a web crawling method with a random element (Metropolis Hastings Random Walk - MHRW) used to group similar pages over time (Incremental KMeans algorithm -IKM) as the web crawler visited web pages. The title of a web page was found to offer a considerable representation of the relationship between pages.
This work was published as part of my research into the structure and architecture analysis of the web. This post discusses an article that was published in the Journal of Web Intelligence. See publication reference for more information.