An Effective Tokenization Algorithm for Information Retrieval Systems

Vikram Singh and Balwinder Saini, National Institute of Technology - Kurukshetra, India; Vikram Singh and Balwinder Saini, National Institute of Technology - Kurukshetra, India

An Effective Tokenization Algorithm for Information Retrieval Systems

Authors

Vikram Singh and Balwinder Saini, National Institute of Technology - Kurukshetra, India

Abstract

In the web, amount of operational data has been increasing exponentially from past few decades, the expectations of data-user is changing proportionally as well. The data-user expects more deep, exact, and detailed results. Retrieval of relevant results is always affected by the pattern, how they are stored/ indexed. There are various techniques are designed to indexed the documents, which is done on the token’s identified with in documents. Tokenization process, primarily effective is to identifying the token and their count. In this paper, we have proposed an effective tokenization approach which is based on training vector and result shows that efficiency/ effectiveness of proposed algorithm.Tokenization of a given documents helps to satisfy user’s information need more precisely and reduced search sharply, is believed to be a part of information retrieval. Tokenization involves pre-processing of documents and generates its respective tokens which is the basis of these tokens probabilistic IR generate its scoring and gives reduced search space. No of Token generated is the parameters used for result analysis.

Keywords

In the web, amount of operational data has been increasing exponentially from past few decades, the expectations of data-user is changing proportionally as well. The data-user expects more deep, exact, and detailed results. Retrieval of relevant results is always affected by the pattern, how they are stored/ indexed. There are various techniques are designed to indexed the documents, which is done on the token’s identified with in documents. Tokenization process, primarily effective is to identifying the token and their count. In this paper, we have proposed an effective tokenization approach which is based on training vector and result shows that efficiency/ effectiveness of proposed algorithm.Tokenization of a given documents helps to satisfy user’s information need more precisely and reduced search sharply, is believed to be a part of information retrieval. Tokenization involves pre-processing of documents and generates its respective tokens which is the basis of these tokens probabilistic IR generate its scoring and gives reduced search space. No of Token generated is the parameters used for result analysis. Information Retrieval (IR), Indexing/Ranking, Stemming, Tokenization.

CS&IT Conference Proceedings

An Effective Tokenization Algorithm for Information Retrieval Systems