Extract and count number of Duplicate rows

03-08-2013

Registered User

51, 0

Join Date: Dec 2012

Last Activity: 11 June 2014, 4:30 AM EDT

Posts: 51

Thanks Given: 4

Thanked 0 Times in 0 Posts

Extract and count number of Duplicate rows

Hi All,

I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records.
i have a command

Code:

awk '
{s[$0]++}
END {
  for(i in s) {
    if(s[i]>1) {
      print i
    }
  }
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}

but this doesnt solve my problem.

HTML Code:

Input:
A
  A
  A
  B
  B
  C

HTML Code:

Desired Output:
  
A
  A
  B

Count of bad records=3
But when i run my script i get out put as:
A
B
Count of bad records=2. Which is not true.
As always any help appreciated.

Arun Mishra

View Public Profile for Arun Mishra

Find all posts by Arun Mishra

03-08-2013

Registered User

68, 12

Join Date: Feb 2013

Last Activity: 6 February 2014, 9:56 AM EST

Posts: 68

Thanks Given: 0

Thanked 12 Times in 12 Posts

I hope that this is what you want:

Code:

awk '
{s[$0]++}
END {
  for(i in s) {
  for(j=1;j<s[i];j++){
      print i;
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

franzpizzo

View Public Profile for franzpizzo

Find all posts by franzpizzo

03-08-2013

Registered User

51, 0

Join Date: Dec 2012

Last Activity: 11 June 2014, 4:30 AM EDT

Posts: 51

Thanks Given: 4

Thanked 0 Times in 0 Posts

Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically.

Thanks again for the help!

Arun Mishra

View Public Profile for Arun Mishra

Find all posts by Arun Mishra

03-08-2013

Registered User

68, 12

Join Date: Feb 2013

Last Activity: 6 February 2014, 9:56 AM EST

Posts: 68

Thanks Given: 0

Thanked 12 Times in 12 Posts

Code:

awk '
{s[$0]++}              # this populate an array, the number of elements is the distinct value in the file (A B C) 
END {                  # and the value is the count of each element: eg. if i=A --> s[i]=3
  for(i in s) {        # for each distinct value i in s
  for(j=1;j<s[i];j++){ # s[i] is the count of element i: in this way
      print i;         # print s[i]-1 times the element i
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

franzpizzo

View Public Profile for franzpizzo

Find all posts by franzpizzo

03-08-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I don't see the need for the END clause for this problem. Doesn't:

Code:

awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

produce the same output?
When reading records, if the record has been seen more than one time, print it then.

But, looking at it again, this is the same as the script you initially provided that you said was not working.
If what you want is the input lines that are not duplicated that would be:

Code:

awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:

A
  A
  B
  C

which is not what was originally requested.

If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try:

Code:

awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:

  A
  A
  B

but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description.

Last edited by Don Cragun; 03-08-2013 at 03:36 PM.. Reason: Noticed that output doesn't match original request...

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-09-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Sounds like you want to know:

1) identity of duplicated (bad) rows.
2) count of duplicated (bad) rows.

What about the much simpler:

Code:

$ uniq -c temp.x | grep -v " 1 "
      3 A
      2 B

If you want to change 2 -> 1, 3 -> 2 in further step, that would be easy.

hanson44

View Public Profile for hanson44

Find all posts by hanson44

Shell Programming and Scripting

Extract and count number of Duplicate rows

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reseting row count every given number of rows

Discussion started by: Xterra

2. Shell Programming and Scripting

Extract and exclude rows based on duplicate values

Discussion started by: CHoggarth

3. Shell Programming and Scripting

Extract duplicate rows with conditions

Discussion started by: jiam912

4. UNIX for Dummies Questions & Answers

Script to count number of rows

Discussion started by: ssk250

5. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: chromatin

6. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Discussion started by: itsme999

7. Shell Programming and Scripting

how to add the number of row and count number of rows

Discussion started by: juelillo

8. UNIX for Dummies Questions & Answers

how to count number of rows and sum of column using awk

Discussion started by: pistachio

9. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: bobbygsk

10. Shell Programming and Scripting

Extract duplicate fields in rows

Discussion started by: anhtt