Novelty Mining{1}


Introduction

The rapid growth of worldwide corporations leads to a continuous increasing data. These corporations often create community spaces for them to share their information with the rest of the world. This also creates opportunities for competitors to find out what they are up to by reading thought this information they provided on the business blogs. However, often times when they issue a new posts the content is repeated in the old ones. “The current available search engine, like Google, can not tell whether a newly posted article contains fresh content or not, as compared to all the previous posted articles” (Tsai and Kwee, 2011). Therefore, when people try find flesh content by using search engines, they usually come across tons of posts with old or already known content before they can find the flesh ones. For decision makers in corporations spending time on reading old or known information is not desirable. Business decisions are time sensitive, if the decision makers receive the information late they are likely to miss the opportunities to outperforming their competitors. With the huge amount of information add to the internet each day the necessity of locating flesh and related content continue to increase. Novelty mining system can help filter out known information and identify the flesh ones for the users. In this article we will first have a brief look on how novelty mining works and we will look at some business applications.

 

Novelty mining

Novelty mining system can look the content of the posts and decide if the content is related to the user’s preference. The decisions of the system are based on three steps: preprocessing, categorization, and novelty mining. When the system first receive the document it will build a model by using various machine learning algorithms. Then the system will begin the categorization and decide if the document is relevant based on its relevance score. Normally, in text mining system the words will be categorized into predefined groups. In Tsai and Kwee’s paper the categorization of the articles is based on the different section of a business such as finance, and strategy. And this step is done manually. Lastly, if the document is considered relevant to the user’s preference it would be sent to the novelty mining process. In the novelty mining process the document is sent to the database to compare with other historical data to check if the document content is novel.

 

Methodologies

There are two standard methods for novelty mining techniques: one-to-one comparison, and all-to-one comparison. In the one-to-one comparison the current sentence is compared with the sentence before it and generates the redundancy score if the sentence is repeated with the ones before it. When a maximum redundancy score is reached the sentence is consider redundant. In all-to-one comparison, the current sentence is used to compare with a pool of all the sentences before it and generate the redundancy score for decision.

 

Text mining and novelty mining

According to Dr Diane McDonald the text mining has 4 stages. In the enhanced information retrieval stage, the system such for keyword and retrieve possible relevant documents. In the linguistic analysis entity recognition, the system will be turn “into a form that allows a computer to extract structured data” (McDonald, 2012). Then information extraction may retrieve things such as chemical names, and formulas. Lastly the data mining stage will find meaningful patterns across the texts. These processes are very similar to the processes of the novelty mining, but normal text mining does not have the ability of identify if the information is new.

 

Benefit of text and novelty mining

Some of the more visible benefit of text mining system are:

   Suggestions creation for customers on online-stores such as amazon.

   Monitoring public opinions.

   Measuring customer preference.

Novelty mining is basically an improved text mining system, so it can have the same benefit. However, the system is more focus on finding new information that were not in the databases. As mentioned before, most corporations has community spaces to provide information to their customers or receive opinions from them. This include the announcement or post the associates made in their blogs or other social spaces. The business decision makers can take advantage of this to allow the have better insight of their competitors. However, there are too many posts add to the internet each day is; it is nearly impossible to read them all, not to mention many of them may be repeated. The novelty mining system can filter out those already collected content and provide decision makers more timely and useful information.

 

Conclusion

Even though the novelty mining system can save lots of time for the decision makers, in Tsai and Kwee’s model they still require other people to categorize the web-sites. This may take up lots of human resources which increase the cost if using this system. And they also will need a huge database to hold those historical data for comparison. Over all, the system and idea maybe beneficial for businesses but the cost maybe too high for countries with high wages.

 

References

Chen, Xin. , & Wu, Yi-fang. (2005). Web Mining for Business Intelligence: Discovering Novel Association Rules from Competitors’ Websites. Idea Group Publishing.

McDonald, Diane (2012). Value and benefits of text mining. Jisc.

Tsai, F. , & Kwee, A. (2011). Database optimization for novelty mining of business blogs. Expert Systems with Applications, 38(9), 11040-11047.