Automated Solution for Normalization of Duplicate Records from Multiple Data Sources

IJARCSSE

View Publication Info
 
 
Field Value
 
Title Automated Solution for Normalization of Duplicate Records from Multiple Data Sources
 
Creator Sri, K. Jaya
Rao, K. Ramachandra
 
Description There has been an exponential growth of data in the last decade both in public and private domain. The main aim of this project is to identify the duplicate records which represent the same real world entity by using a mechanism which does not require any training data. An unsupervised method is used where no manual labeling is required. Detecting data sources records that are approximate duplicates is an important task. Query and data from multiple data sources will result with duplicates. When information is retrieved from different data sources duplicates occur due to various format specifications. A data sources having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Data sources may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple data sources. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time data source scenario, where the records to match are query results dynamically generated in online. I present a Dynamic Duplicate Detection, for a given query the algorithm can effectively identify duplicates from the query result records of multiple data sources. In the algorithm proposed, I start from the non-duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from data sources. Additional to these two classifiers which are used in Unsupervised Duplicate Detection algorithm, a third classifier called Blocking Classifier is used which helps in detecting the duplicate records. Various experiments are conducted on a data set to verify the effectiveness of the algorithm in detecting the duplicate records.
 
Publisher International Journal of Advanced Research in Computer Science and Software Engineering
 
Contributor
 
Date 2019-10-03
 
Type info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Peer-reviewed Article
 
Format application/pdf
 
Identifier http://ijarcsse.com/index.php/ijarcsse/article/view/1068
 
Source International Journal of Advanced Research in Computer Science and Software Engineering; Vol 9, No 9 (2019): September 2019; 7-11
2277128X
22776451
10.23956/ijarcsse.v9i9
 
Language eng
 
Relation http://ijarcsse.com/index.php/ijarcsse/article/view/1068/622
 
Rights Copyright (c) 2019 International Journal of Advanced Research in Computer Science and Software Engineering
 

Contact Us

The PKP Index is an initiative of the Public Knowledge Project.

For PKP Publishing Services please use the PKP|PS contact form.

For support with PKP software we encourage users to consult our wiki for documentation and search our support forums.

For any other correspondence feel free to contact us using the PKP contact form.

Find Us

Twitter

Copyright © 2015-2018 Simon Fraser University Library