Compute average ignoring outliers of different segments within a dat file using awk
I have data files that look like this, say data.txt
As you can see the data has various segments based on column 1. I use the following code to compute the mean of each segment and output the value of column 1 for that segment and the mean of the values of column 2 and some other things just so I can check am doing the right thing.
Unfortunately as you can see, my data has outliers in these segments. I need to remove these outliers before I compute the mean so that they don't mess up my results. I am using awk to process my data.
This is what I have been able to do so far, if I get one segment to a file say temp.txt I am able to use the following code to remove the outlier in that segment
But I need to able to do this within the code that computes the average so that my mean value excludes this outlier.
Any assistance will be highly appreciated.
Malandisa
Last edited by Scott; 09-18-2014 at 03:40 PM..
Reason: Moved from Programming forum
Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.
This can be done with associative arrays in awk i.e. if you are familiar with them and if you're not then I'd suggest reading up on them...
---------- Post updated at 01:12 PM ---------- Previous update was at 12:22 PM ----------
Quote:
Originally Posted by malandisa
Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.
Here is how you'd go about eliminating outliers from your data in order to compute the mean...
Thank you for your suggestion, but interestingly it suggestion for a small file, when I run this on a big file it complains about attempted division by 0. Let me attach a large file and you see what I am talking about. However I am very grateful for your response, it gives me a starting point.
Shamrock please help me learn something. What does this line
Code:
val[$1] = (val[$1] ? val[$1] "," $2 : $2)
do exactly in this code! Am sure once I understand this, I would be able to see where the problem could be
---------- Post updated at 05:58 AM ---------- Previous update was at 03:04 AM ----------
Okay I understand that this creates a string with the values in column 2 separated by a comma. so that it is latter split
---------- Post updated at 06:06 AM ---------- Previous update was at 05:58 AM ----------
and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
Code:
sum[i] -= val[i]
sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.
and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
Code:
sum[i] -= val[i]
sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.
Good catch...it should be sum[i] -= a[k] and I did it run the modified code on "temp.txt" and it gave me no such errors...
The awk below executes and is close (producing the first 4 columns in desired). However, when I add the sum of $7, I get nothing returned. Basically, I am trying to combine all the matching $4 in f1 and output them with the average of $7 in each match. Thank you :).
f1
... (2 Replies)
Im looking for a way to average the values in field 14 (when field 2 is equal to 2016) and fields 3 and 4 (when field 2 is equal to 2017).
Any help is appreciated.
001001 2016 33.22 38.19 48.07 51.75 59.77 67.68 70.86 72.21 66.92 53.67 42.31 40.15
001001 2017 ... (10 Replies)
Hi,
I have a file which looks like this:
FID IID MISS_PHENO N_MISS N_GENO F_MISS
12AB43131 12AB43131 N 17774 906341 0.01961
65HJ87451 65HJ87451 N 10149 906341 0.0112
43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Hi All,
I am trying to run awk command on .DAT file and it is not working. The same command is working on .txt file:
Contents of the file ZZ_55555555_444444_ZZZZZZ_7777777_888_99.DAT:
HEADER|ZZ_55555555_444444_ZZZZZZ_7777777_888_99.DAT... (10 Replies)
Hi All,
I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side.
I have... (4 Replies)
Heya there,
A small selection of my data is shown below.
DATE TIME FRAC_DAYS_SINCE_JAN1
2011-06-25 08:03:20.000 175.33564815
2011-06-25 08:03:25.000 175.33570602
2011-06-25 ... (4 Replies)
Hi I want to use awk to print avg and st deviation but it does not go into a file for column 1 only.
I can do average and # of records but i cannot get st deviation.
awk '{sum+=$1} END { print "Average = ",sum/NR}'
thanks (1 Reply)
Hi there,
I need to split one huge file into separate files if the condition is fulfilled according to that the position between 97 and 98 matches with “IT” at the segment MAS. There is no delimiter file is fix-width with varous line length.
Could you please help me how I do split the file... (1 Reply)