awk function to remove lines that contain contents of another file

09-19-2017

Registered User

20, 0

Join Date: Apr 2016

Last Activity: 13 December 2018, 4:40 AM EST

Posts: 20

Thanks Given: 10

Thanked 0 Times in 0 Posts

awk function to remove lines that contain contents of another file

Hi,

I'd be grateful for your help with the following. I have a file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:

Code:

ID     Col1    Col2  Col3....
a        4         2       8
b        5         6       1
c        8         4       1
d        3         5       9
e        8         5       2

I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:

Code:

ID     Col1    Col2  Col3....
a        4         2       8
c        8         4       1
e        8         5       2

In reality, there are about 100,000 lines that I want to remove.
I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.

I am using grep at the moment, and while it works, it is proving painfully slow.

Code:

grep -v -f referencefile.txt file.txt

Is there a way of using awk (or anything else for that matter) to speed up the process?

Many thanks.

AB

aberg

View Public Profile for aberg

Find all posts by aberg

09-19-2017

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

This requires a lot of memory depending on what you have in reference.txt
Simple awk which can be rewritten as something difficult to read for non-awkers.
We have posters who do that, which is okay as long as you can get what they show you.

Code:

# code assumes that the reference.txt file has field #1 from inputfile

awk ' FILENAME=="reference.txt" {! arr[$0]++; next}  # create an array of values 
         FILENAME=="inputfile" { if(! $1 in arr) {print $0}; next} ' reference.txt inputfile > outputfile

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-19-2017

Registered User

20, 0

Join Date: Apr 2016

Last Activity: 13 December 2018, 4:40 AM EST

Posts: 20

Thanks Given: 10

Thanked 0 Times in 0 Posts

Thanks Jim - that works. Much appreciated.

A.B.

aberg

View Public Profile for aberg

Find all posts by aberg

09-19-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

It would be interesting what performance gain you see - can you time both approaches and post the results?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-20-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

I do not understand the ! and ++ in {! arr[$0]++; next}
Replace by {arr[$1]; next}. Not storing a value in the array saves sone memory! $1 strips spaces, can make sense if there is invisible trailing space (and embedded spaces wouldn't work anyway when later comparing with $1). The next jumps to the next cycle, no need for checking the FILENAME again. {print $0} is a default action if there is just a condition.

Code:

awk ' FILENAME=="reference.txt" {arr[$1]; next}  # create an array without values 
        !($1 in arr)' reference.txt inputfile > outputfile

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

UNIX for Beginners Questions & Answers

awk function to remove lines that contain contents of another file

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove contents from file which are under bracket?

Discussion started by: ghpradeep

2. Shell Programming and Scripting

awk to remove lines that do not start with digit and combine line or lines

Discussion started by: cmccabe

3. Shell Programming and Scripting

Using awk to remove lines from file that match text

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to remove lines in file if specific field matches

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk remove/grab lines from file with pattern from other file

Discussion started by: SDohmen

6. Shell Programming and Scripting

Perl script for Calling a function and writing all its contents to a file

Discussion started by: crypto87

7. Shell Programming and Scripting

Remove lines based on contents of another file

Discussion started by: bashshadow1979

8. Shell Programming and Scripting

Compare two files and remove all the contents of one file from another

Discussion started by: royalibrahim

9. Solaris

remove the contents of a file

Discussion started by: surjyap