Raveena.S, Nandini.V
This paper addresses the problem of Near Duplicate document. Propose a new method to detect near duplicate document from a large collection of document set. This method is classified into three steps. Feature selection, similarity measures and discriminant function. Feature selection performs pre-processing; calculate the weight of each terms and heavily weighted term is selected as a features of input document. As a result, Feature selection helps to select a set of features from an input document. Similarity measure measures the similarity degree between two documents. Discriminant derivation use SVM classifier to determine the discriminate function from document set based on supervised learning. As a result of this method, discriminant function is to check whether the document is near duplicate or not based on similarity degree. These document-level feature selections provide better (or) more efficient result than sentence-level feature selection.