Tuesday, June 23, 2009

How To Remove Duplicate Records In A Dataset?



Lets create a dataset-m arks

/*reading dataset from external file in temporary libary*/
data marks;
infile 'c:\users\bobby\desktop\marks.txt';
input id gender$ marks_m1 marks_m2;
/*sending output*/
proc print;
run;

sas output of the dataset marks is—

id gender marks_m1 marks_m2

1 m 50 50
1 m 50 50
2 f 60 65
2 f 70 70
3 f 80 70
3 f 90 70
3 f 80 80
4 m 85 60
5 m 90 60
5 m 90 90
5 m 70 100
6 f 75 80
6 f 75 80
7 m 80 80
8 m 90 70
9 m 90 70
9 m 90 80
10 f 60 70
10 f 90 80

As you can see that there are duplicate records in the dataset.few records are identical (all values same) while others with id but other values different.let us see how we can eliminate them…



/*eliminating duplicate records*/
/*using nodupkey options*/
proc sort data=marks out=no_dup nodupkey ;
by id;
run;
/*using dupkey option of proc sort deleted duplicate observations/records with same key
variable/by variable*/

proc print data=no_dup noobs;
run;


output of no_dup dataset—



id gender marks_m1 marks_m2

1 m 50 50
2 f 60 65
3 f 80 70
4 m 85 60
5 m 90 60
6 f 75 80
7 m 80 80
8 m 90 70
9 m 90 70
10 f 60 70

We have used the nodupkey option of proc sort which deleted records with same value of key variable(id in this case).

Now ,there is other option to eliminate duplicate rows ..
The ‘noduprecs’ option ….this option is useful when we want to delete identical records.

proc sort data=marks out=no_dup noduprecs ;
by id;
run;


id gender marks_m1 marks_m2

1 m 50 50
2 f 60 65
2 f 70 70
3 f 80 70
3 f 90 70
3 f 80 80
4 m 85 60
5 m 90 60
5 m 90 90
5 m 70 100
6 f 75 80
7 m 80 80
8 m 90 70
9 m 90 70
9 m 90 80
10 f 60 70
10 f 90 80



/*in the output we see that there are still observations/records with duplicate keyword
,this is where noduprecs differs from nodupkey.
nodupkey eliminates records where it detects repeated by varialbe value whereas noduprecs
deletes only identical records i.e where all values are exactly the same*/


*warning when using nodeprecs*/
/*though noduprecs deletes records with identical values but still there are cases when it
deviates from its normal working...let us see in the following example--*/..

the given dataset is…


id height weight

1 123 45
1 124 56
1 123 45
1 125 68

Two records are identical.to remove second record we use nodeprecs option..the output is…

Obs id height weight

1 1 123 45
2 1 124 56
3 1 123 45
4 1 125 68

/*here in the output you can see the duplicate still exists,this is where you need to be cautious,
thing to remember is that noduprecs deletes only successive duplicates .*/
one way to solve this situation is to consider more than one variable in by statement to make sure that identical records are successive.

Send link to --

TopBlogDir.blogspot.com button Education Blog Directory Academic,  Learning & Educational Blogs - Blog Catalog Blog Directory SEO Court Directory myblog Visit blogadda.com to discover Indian blogs Add to Technorati Favorites Subscribe with Bloglines DigNow.net

Join My Community at MyBloglog!

Free Web Directory - Add Your Link
The Little Web Directory
Monster Directory A List Sites