Perl to adjust coordinates based on repeat string

08-18-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Perl to adjust coordinates based on repeat string

In the file below I am trying to count the given repeats of A,T,C,G in each string of letters. Each sequence is below the > and it is possible for a string of repeats to wrap from the line above. For example, in the first line the last letter is a T and the next lines has 3 more. I think the below would work, but I am also trying to count the position range of the repeat using the range=, where the first # represents the leftmost (in the first line that is aaa) and the second # rightmost (in the first line that is taa). So using the 4T as an example the output is in the example output.

The t{4} is the repeat that can change, for example if I am after 7g then that would be g{7}... the lower case letters in the sequence are counted along with the capital letters if they satisfy the criteria. I think both would be captured in the order there are seen, as that is important to know. For example, 4t occurs at chr2:166911127-166911130... even though there are 6t in that strech only the 4t satisfy the criteria and are counted. An example output is in the output for two sequences. Thank you

.

file

Code:

>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
aaattttttggatgcttgttttcagATACACCTTCACAGGAATATATACT
TTTGAATCACTTATAAAAATTATTGCAAGGGGATTCTGTTTAGAAGATTT
TACTTTCCTTCGGGATCCATGGAACTGGCTCGATTTCACTGTCATTACAT
TTGCgtaagtgccttttttgaaactttaa
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttgtgtgtgaactccctattacagGTACGTCACAGAGTTTGTGGACCTG
GGCAATGTCTCGGCATTGAGAACATTCAGAGTTCTCCGAGCATTGAAGAC

example output

Code:

TTTT chr2:166911173-166911176

description

Code:

the first T is 50 in so that is added to the 166911123 and that is the new value after the : and the last T is 53 so that is added to the 166911123 and that is the new value after the -.

perl

Code:

perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(t{4})/gi' file

output for two sequences

Code:

tttt chr2=166911127-166911130
TTTT chr2:166911173-166911176

Last edited by cmccabe; 08-18-2018 at 01:33 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-20-2018

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

In case that it is useful.

Code:

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
while(<>) {
    chomp;
    if(/^>/){
        range_match(\@fasta) if @fasta;
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
range_match(\@fasta);

sub range_match {
    my $fref = shift;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    while($seq =~ /[tT]{4}/g) {
        my $first = "@-" + 1;
        my $last = "@+";
        printf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
    }
}

Code:

$ ./read_fasta.pl fasta.file
tttt range=chr2:166911127-166911130
tttt range=chr2:166911142-166911145
TTTT range=chr2:166911173-166911176
TTTT range=chr2:166911221-166911224
tttt range=chr2:166911287-166911290

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

08-20-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you @Aia, your perl is much more complete than mine. Can the header be included with each sequence and if there is no repeat in that sequence Nothing Detected results? Thank you very much

Code:

>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttt range=chr2:166911127-166911130
tttt range=chr2:166911142-166911145
TTTT range=chr2:166911173-166911176
TTTT range=chr2:166911221-166911224
tttt range=chr2:166911287-166911290
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
Nothing Detected

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-20-2018

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

[...] Can the header be included with each sequence and if there is no repeat in that sequence Nothing Detected results? [...]

Two changes, then. Make the subroutine range_match() to return matches and create a display() subroutine.

Code:

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
my @m_results = ();

while(<>) {
    chomp;
    if(/^>/){
        if (@fasta) {
            @m_results = range_match(\@fasta);
            display();
        }
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
@m_results = range_match(\@fasta);
display();

sub range_match {
    my $fref = shift;
    my @matches;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    while($seq =~ /[tT]{4}/g) {
        my $first = "@-" + 1;
        my $last = "@+";
        push @matches, sprintf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
    }
    return @matches;
}

sub display {
    print "$fasta[0]\n";
    print @m_results ? @m_results : "Nothing Detected\n";
    @m_results = ();
}

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

08-20-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The line in green, specifically the tT{4}, is what I manually change to capture the different stretches of aA, tT, cC, gG stretches from 6-25. Is there a way to loop through the below possibilities, each changing automatically? It is always these combinations of letters only the length of the repeat changes. So, if I was capturing [aA]{8}/g for 8a/A or [aA]{18}/g for 18a/A. Thank you very much for your help I really appreciate it

.

That is:

Code:

[aA] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[tT] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[cC] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[gG] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}

while($seq =~ /[tT]{4}/g)

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

08-20-2018

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Maybe

Code:

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
my @m_results = ();
my @regexs;
my @rows = ('a', 't', 'c', 'g');
my @colums = (6 .. 25);

for my $r (@rows) {
    for my $c (@colums) {
        push @regexs, "$r\{$c\}";
    }
}

while(<>) {
    chomp;
    if(/^>/){
        if (@fasta) {
            @m_results = range_match(\@fasta);
            display();
        }
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
@m_results = range_match(\@fasta);
display();

sub range_match {
    my $fref = shift;
    my @matches;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    for my $regex (@regexs) {
        while($seq =~ /$regex/ig) {
            my $first = "@-" + 1;
            my $last = "@+";
            push @matches, sprintf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
        }
    }
    return @matches;
}

sub display {
    print "$fasta[0]\n";
    print @m_results ? @m_results : "Nothing Detected\n";
    @m_results = ();
}

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

08-21-2018

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you very much, works great.... very nice

.

------ Post updated at 02:17 PM ------

Looking into the data there are a couple of modifications that I am unsure how to incorporate. The streches are detected and the loop is amazing and very helpful. The math is slightly different and the repeat stretch is opposite, hopefully the examples will help to show but if strand=- then the bases in blue (rightmost) are the start or first number after the : and the repeats in green are opposite:

Code:

that is:
A or a is T or t
T or t is A or a
C or c is G or g
G or g is C or c

Thank you very much

.

input

Code:

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct
>hg19_ncbiRefSeq_SCN1A range=chr2:166895908-166896131 5'pad=25 3'pad=25 strand=- repeatMasking=none
tagattaattttgttttgatcttagGTTTTCACTGGGATCTTTACAGCAG
AAATGTTTCTGAAAATTATTGCCATGGATCCTTACTATTATTTCCAAGAA
GGCTGGAATATCTTTGACGGTTTTATTGTGACGCTTAGCCTGGTAGAACT
TGGACTCGCCAATGTGGAAGGATTATCTGTTCTCCGTTCATTTCGATTGg
taaaaaaaaaaaaaaaaagcacca
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551
>hg19_ncbiRefSeq_SLC2A1 range=chr1:43393251-43393504 5'pad=25 3'pad=25 strand=- repeatMasking=none
No Homopolymers Detected

desired output

Code:

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437
aaaaaa chr2:166905467-166905472
>hg19_ncbiRefSeq_SCN1A range=chr2:166895908-166896131 5'pad=25 3'pad=25 strand=- repeatMasking=none
ttttttttttttttt range=chr2:166896110-166896124
tttttttttttttttt range=chr2:166896110-166896125
ttttttttttttttttt range=chr2:166896110-166896126
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550
>hg19_ncbiRefSeq_SLC2A1 range=chr1:43393251-43393504 5'pad=25 3'pad=25 strand=- repeatMasking=none
Nothing Detected

In the line with > if the strand=+ then everything is good except that the calculated range=chr2:166170446-166170451 is 1 digit off. An example is

Code:

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551

Code:

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550

In the line with > if the strand=- then the math is slightly different in that the first # is really the end position. An example is,

Code:

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

Code:

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437
aaaaaa chr2:166905467-166905472

If Nothing Detected then the line is as is. Thank you again

Last edited by cmccabe; 08-21-2018 at 05:05 PM.. Reason: added details

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Perl to adjust coordinates based on repeat string

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to adjust text and count based on value in field

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to adjust coordinates in field based on sequential numbers in another field

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to combine matches and use a field to adjust coordinates in other fields

Discussion started by: cmccabe

4. Shell Programming and Scripting

Add specific string to last field of each line in perl based on value

Discussion started by: cmccabe

5. Homework & Coursework Questions

How to use xargs to repeat as a loop to grab date string?

Discussion started by: scopiop

6. UNIX for Dummies Questions & Answers

Length of a segment based on coordinates

Discussion started by: fadista

7. Shell Programming and Scripting

perl script to find, write, repeat...

Discussion started by: tgamble

8. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

Discussion started by: vivek d r

9. Shell Programming and Scripting

Matching 2 chars of a string that repeat

Discussion started by: sitney

10. Shell Programming and Scripting

appending string to text file based on search string

Discussion started by: malaymaru