Web mining

Web mining is the application of data mining techniques to discover patterns from the World Wide Web. As the name proposes, this is information gathered by mining the web. It makes utilization of automated apparatuses to reveal and extricate data from servers and web2 reports, and it permits organizations to get to both organized and unstructured information from browser activities, server logs, website and link structure, page content and different sources.
The goal of Web structure mining is to generate structural summary about the Web site and Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of the hyperlinks at the inter-document level. Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and generate the information, such as the similarity and relationship between different Web sites.
Web structure mining can also have another direction – discovering the structure of Web document itself. This type of structure mining can be used to reveal the structure of Web pages, this would be good for navigation purpose and make it possible to compare/integrate Web page schemes. This type of structure mining will facilitate introducing database techniques for accessing information in Web pages by providing a reference schema.

Web mining types

Web mining can be divided into three different types – Web usage mining, Web content mining and Web structure mining.

Web usage mining

Web usage mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications.
Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site.
Web usage mining itself can be classified further depending on the kind of usage data considered:

Web server data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.
Application server data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.
Application level data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. Many end applications require a combination of one or more of the techniques applied in the categories above.

Studies related to work are concerned with two areas: constraint-based data mining algorithms applied in Web usage mining and developed software tools. Costa and Seco demonstrated that web log mining can be used to extract semantic information about the user and a given community.

Pros

Web usage mining essentially has many advantages which makes this technology attractive to corporations including government agencies. This technology has enabled e-commerce to do personalized marketing, which eventually results in higher trade volumes. Government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of mining applications can benefit society by identifying criminal activities. Companies can establish better customer relationship by understanding the needs of the customer better and reacting to customer needs faster. Companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find customers who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer or customers.
More benefits of web usage mining, particularly in the area of personalization, are outlined in specific frameworks such as the probabilistic latent semantic analysis model, which offer additional features to the user behavior and access pattern. This is because the process provides the user with more relevant content through collaborative recommendation. These models also demonstrate a capability in web usage mining technology to address problems associated with traditional techniques such as biases and questions regarding validity since the data and patterns obtained are not subjective and do not degrade over time. There are also elements unique to web usage mining that can show the technology's benefits and these include the way semantic knowledge is applied when interpreting, analyzing, and reasoning about usage patterns during the mining phase.

Cons

Web usage mining by itself does not create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web usage mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without the individual's knowledge or consent. The obtained data will be analyzed, made anonymous, then clustered to form anonymous profiles. These applications de-individualize users by judging them by their mouse clicks rather than by identifying information. De-individualization in general can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific purpose might use the data for totally different purposes, and this essentially violates the user's interests.
The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of one's privacy being invaded. The companies which buy the data are obliged make it anonymous and these companies are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data.
Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation. This situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to one's privacy, however additional information can be inferred by the application by combining two separate unscrupulous data from the user.

Web structure mining

Web structure mining uses graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.
Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

Web structure mining terminology:

Web graph: directed graph representing web.
Node: web page in graph.
Edge: hyperlinks.
In degree: number of links pointing to particular node.
Out degree: number of links generated from particular node.

Techniques of web structure mining:

PageRank: this algorithm is used by Google to rank search results. The name of this algorithm is given by Google-founder Larry Page. The rank of a page is decided by the number of links pointing to the target node.
Web content mining

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, Aliweb, MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. These factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.
Web content mining is differentiated from two different points of view: Information Retrieval View and Database View. summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.
There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. This representation does not realize the importance of words in a document. To resolve this, tf-idf is introduced.
By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating function to evaluate the features. As feature set, information gain, cross entropy, mutual information, and odds ratio are usually used.
The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are classification accuracy, precision and recall and information score.
Web mining is an important component of content pipeline for web portals. It is used in data confirmation and validity verification, data integrity and building taxonomies, content management, content generation and opinion mining.
Web mining can complement the retrieval of structured data transmitted with open protocols like OAI-PMH: an example is the aggregation of works from academic publications, which are mined to identify open access versions through a mix of open source and open data methods by academic databases like Unpaywall.

Web content mining in foreign languages

Chinese

The language code of Chinese words is very complicated compared to that of English. The GB, Big5 and HZ code are common Chinese word codes in web documents. Before text mining, one needs to identify the code standard of the HTML documents and transform it into inner code, then use other data mining techniques to find useful knowledge and useful patterns.

Resources

Books

Zdravko Markov, Daniel T. Larose , Wiley, 2007
Jesus Mena, "Data Mining Your Website", Digital Press, 1999
Soumen Chakrabarti, "Mining the Web: Analysis of Hypertext and Semi Structured Data", Morgan Kaufmann, 2002
Bing Liu, , Springer, 2007
Advances in Web Mining and Web Usage Analysis 2005 - revised papers from 7 th workshop on Knowledge Discovery on the Web, Olfa Nasraoui, Osmar Zaiane, Myra Spiliopoulou, Bamshad Mobasher, Philip Yu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, LNAI 4198, 2006
Web Mining and Web Usage Analysis 2004 - revised papers from 6 th workshop on Knowledge Discovery on the Web, Bamshad Mobasher, Olfa Nasraoui, Bing Liu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, 2006
Mike Thelwall, , 2004, Academic Press
Bibliographic references
Baraglia, R. Silvestri, F. , In Communications of the ACM 50: 63-67
Cooley, R. Mobasher, B. and Srivastave, J. “Web Mining: Information and Pattern Discovery on the World Wide Web” In Proceedings of the 9th IEEE International Conference on Tool with Artificial Intelligence
Cooley, R., Mobasher, B. and Srivastava, J. “”, Journal of Knowledge and Information System, Vol.1, Issue. 1, pp. 5–32, 1999
Costa, RP and Seco, N. , 11th Ibero-American Conference on Artificial Intelligence, 2008 October.
Kohavi, R., Mason, L. and Zheng, Z. “” Machine Learning, Vol 57, pp. 83–113
Lillian Clark, I-Hsien Ting, Chris Kimble, Peter Wright, Daniel Kudenko Journal of Information Research, Vol. 11 No. 2, January 2006
Eirinaki, M., Vazirgiannis, M. "", ACM Transactions on Internet Technology, Vol.3, No.1, February 2003
Mobasher, B., Cooley, R. and Srivastava, J. “” Communications of the ACM, Vol. 43, No.8, pp. 142–151
Mobasher, B., Dai, H., Luo, T. and Nakagawa, M. “” In Proceedings of WIDM 2001, Atlanta, GA, USA, pp. 9–15
Nasraoui O., Petenes C., , in Proc. of WebKDD 2003 – KDD Workshop on Web mining as a Premise to Effective and Intelligent Web Applications, Washington DC, August 2003, p. 37
Nasraoui O., Frigui H., Joshi A., and Krishnapuram R., , Proceedings of the Eighth International Fuzzy Systems Association Congress, Hsinchu, Taiwan, August 1999
Nasraoui O., Invited chapter in “Encyclopedia of Data Mining and Data Warehousing”, J. Wang, Ed, Idea Group, 2005
Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos C. D. “Web usage mining as a tool for personalization: a survey”, User modelling and user adapted interaction journal, Vol.13, Issue 4, pp. 311–372
I-Hsien Ting, Chris Kimble, Daniel Kudenko ""
I-Hsien Ting, Chris Kimble, Daniel Kudenko
Weichbroth, P., Owoc, M., Pleszkun, M. ""

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...