Sponsored Content
Top Forums Shell Programming and Scripting awk to change value in field according to another Post 303025705 by bakunin on Saturday 10th of November 2018 03:55:36 AM
Old 11-10-2018
Quote:
Originally Posted by Don Cragun
may be classified as "intron",
may be classified as "exon"
may be classified as "splicing"
It certainly helps if one understands what this is all about and since it happens i have a biological researcher at home who explained it to me, here it is (errors/omissions are due to my limited understanding - i was told this is already the kindergarten version of what is really going on):

"exon", short for "expressed region", is a unit of a gene which codes something like a protein. Think of a "gene" as a text of describing something, then the "exon" would be one complete sentence of this text. When DNA is read (so that what it codes is actually produced) it is copied to "RNA"-pieces. This process is called RNA-splicing*) and these pieces contain always several whole such exons.

"intron", short for "intragenetic region" is (more or less meaningless) parts of the DNA between the exons. Think of it as some sort of punctuation and whitespace in the text. It is removed during RNA-splicing so that only the exons make it there.

*) RNA-splicing: the process of producing RNA from DNA works in several steps. First a complete DNA-piece is copied, including the introns. Then the real RNA is made from that ommitting the introns and only leaving the exons. This, in fact, is the "splicing".

In the human genome about 1% is exons (so this in fact makes up for the whole genetic information), about 25% is introns. The rest is intergenetic (that is: between genes and hence completely meaningless).

Thanks to my wife.

bakunin

Last edited by bakunin; 11-10-2018 at 05:05 AM..
This User Gave Thanks to bakunin For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

change field content awk

I have a line like this: I want to move HTTP/1.1 200 OK to the next line and put a blank line between the two lines i.e. How can i get it using awk? Thanks in advance (2 Replies)
Discussion started by: littleboyblu
2 Replies

2. Shell Programming and Scripting

dynamically change awk Field Separator FS

Hi All, I was wondering if anyone knew how to dynamically change the FS in awk to accept vairiable containing a field separator. the current code is as below and does not work when i introduce the dynamic FS change :-( validate_source_file() { source_file=$1 ... (2 Replies)
Discussion started by: satnamx
2 Replies

3. Shell Programming and Scripting

awk,cut fields by change field format

Hi Everyone, # cat 1.txt 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 1321631,77770132976455,19,20091001011859,20091001011907 # cat 1.txt | awk -F, '{OFS=",";print $1,$3,$4,$5}' 1321631,19,20091001011859,20091001011907... (7 Replies)
Discussion started by: jimmy_y
7 Replies

4. Shell Programming and Scripting

awk, comma as field separator and text inside double quotes as a field.

Hi, all I need to get fields in a line that are separated by commas, some of the fields are enclosed with double quotes, and they are supposed to be treated as a single field even if there are commas inside the quotes. sample input: for this line, 5 fields are supposed to be extracted, they... (8 Replies)
Discussion started by: kevintse
8 Replies

5. Shell Programming and Scripting

AWK: Pattern match between 2 files, then compare a field in file1 as > or < field in file2

First, thanks for the help in previous posts... couldn't have gotten where I am now without it! So here is what I have, I use AWK to match $1 and $2 as 1 string in file1 to $1 and $2 as 1 string in file2. Now I'm wondering if I can extend this AWK command to incorporate the following: If $1... (4 Replies)
Discussion started by: right_coaster
4 Replies

6. Shell Programming and Scripting

awk or sed? change field conditional on key match

Hi. I'd appreciate if I can get some direction in this issue to get me going. Datafile1: -About 4000 records, I have to update field#4 in selected records based on a match in the key field (Field#1). -Field #1 is the key field (servername) . # of Fields may vary # comment server1 bbb ccc... (2 Replies)
Discussion started by: RascalHoudi
2 Replies

7. UNIX for Dummies Questions & Answers

change field separator only from nth field until NF

Hi ! input: 111|222|333|aaa|bbb|ccc 999|888|777|nnn|kkk 444|666|555|eee|ttt|ooo|ppp With awk, I am trying to change the FS "|" to "; " only from the 4th field until the end (the number of fields vary between records). In order to get: 111|222|333|aaa; bbb; ccc 999|888|777|nnn; kkk... (1 Reply)
Discussion started by: beca123456
1 Replies

8. Shell Programming and Scripting

awk :how to change delimiter without giving all field name

Hi Experts, i need to change delimiter from tab to "," sample test file cat test A0000368 A29938511 072569352 5 Any 2 for £1.00 BUTCHERS|CAT FOOD|400G Sep 12 2012 12:00AM Jan 5 2014 11:59PM Sep 7 2012 12:00AM M 2.000 group 5 ... (2 Replies)
Discussion started by: Lakshman_Gupta
2 Replies

9. Shell Programming and Scripting

awk to change value of field using multiple conditions

In the below awk in the first step I default Classification NF-1 to VUS. Next, I am trying to change the value of Classification (NF) to whatever CLINSIG (NF-1) is. If there is only one condition everything works great, but if there are two conditions it does not work. Is the syntax used... (4 Replies)
Discussion started by: cmccabe
4 Replies

10. Shell Programming and Scripting

awk to change contents of field based on condition in same file

In the awk below I am trying to copy the entire contents of $6 there may be multiple values seperated by a ;, to $8, if $8 is . (lines 1 and 3 are examples). If that condition $8 is not . (line2 is an example) then that line is skipped and printed as is. The awk does execute but prints the output... (3 Replies)
Discussion started by: cmccabe
3 Replies
BP_GENBANK2GFF3(1p)					User Contributed Perl Documentation				       BP_GENBANK2GFF3(1p)

NAME
genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3 SYNOPSIS
genbank2gff3.pl [options] filename(s) # process a directory containing GenBank flatfiles perl genbank2gff3.pl --dir path_to_files --zip # process a single file, ignore explicit exons and introns perl genbank2gff3.pl --filter exon --filter intron file.gbk.gz # process a list of files perl genbank2gff3.pl *gbk.gz # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk | perl genbank2gff3.pl -noCDS -in stdin -out stdout | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata Options: --noinfer -r don't infer exon/mRNA subfeatures --conf -i path to the curation configuration file that contains user preferences for Genbank entries (must be YAML format) (if --manual is passed without --ini, user will be prompted to create the file if any manual input is saved) --sofile -l path to to the so.obo file to use for feature type mapping (--sofile live will download the latest online revision) --manual -m when trying to guess the proper SO term, if more than one option matches the primary tag, the converter will wait for user input to choose the correct one (only works with --sofile) --dir -d path to a list of genbank flatfiles --outdir -o location to write GFF files (can be 'stdout' or '-' for pipe) --zip -z compress GFF3 output files with gzip --summary -s print a summary of the features in each contig --filter -x genbank feature type(s) to ignore --split -y split output to separate GFF and fasta files for each genbank record --nolump -n separate file for each reference sequence (default is to lump all records together into one output file for each input file) --ethresh -e error threshold for unflattener set this high (>2) to ignore all unflattener errors --[no]CDS -c Keep CDS-exons, or convert to alternate gene-RNA-protein-exon model. --CDS is default. Use --CDS to keep default GFF gene model, use --noCDS to convert to g-r-p-e. --format -f Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work (GenBank is default) --GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available --quiet don't talk about what is being processed --typesource SO sequence type for source (e.g. chromosome; region; contig) --help -h display this message DESCRIPTION
This script uses Bio::SeqFeature::Tools::Unflattener and Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene containment hierarchies mapped for optimal display in gbrowse. The input files are assumed to be gzipped GenBank flatfiles for refseq contigs. The files may contain multiple GenBank records. Either a single file or an entire directory can be processed. By default, the DNA sequence is embedded in the GFF but it can be saved into separate fasta file with the --split(-y) option. If an input file contains multiple records, the default behaviour is to dump all GFF and sequence to a file of the same name (with .gff appended). Using the 'nolump' option will create a separate file for each genbank record. Using the 'split' option will create separate GFF and Fasta files for each genbank record. Notes 'split' and 'nolump' produce many files In cases where the input files contain many GenBank records (for example, the chromosome files for the mouse genome build), a very large number of output files will be produced if the 'split' or 'nolump' options are selected. If you do have lists of files > 6000, use the --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to load the gff and/ or fasta files. Designed for RefSeq This script is designed for RefSeq genomic sequence entries. It may work for third party annotations but this has not been tested. But see below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you don't mind some gene model unflattener errors (dgg). G-R-P-E Gene Model Don Gilbert worked this over with needs to produce GFF3 suited to loading to GMOD Chado databases. Most of the changes I believe are suited for general use. One main chado-specific addition is the --[no]cds2protein flag My favorite GFF is to set the above as ON by default (disable with --nocds2prot) For general use it probably should be OFF, enabled with --cds2prot. This writes GFF with an alternate, but useful Gene model, instead of the consensus model for GFF3 [ gene > mRNA> (exon,CDS,UTR) ] This alternate is gene > mRNA > polypeptide > exon means the only feature with dna bases is the exon. The others specify only location ranges on a genome. Exon of course is a child of mRNA and protein/peptide. The protein/polypeptide feature is an important one, having all the annotations of the GenBank CDS feature, protein ID, translation, GO terms, Dbxrefs to other proteins. UTRs, introns, CDS-exons are all inferred from the primary exon bases inside/outside appropriate higher feature ranges. Other special gene model features remain the same. Several other improvements and bugfixes, minor but useful are included * IO pipes now work: curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ... * GenBank main record fields are added to source feature, e.g. organism, date, and the sourcetype, commonly chromosome for genomes, is used. * Gene Model handling for ncRNA, pseudogenes are added. * GFF header is cleaner, more informative. --GFF_VERSION flag allows choice of v2 as well as default v3 * GFF ##FASTA inclusion is improved, and CDS translation sequence is moved to FASTA records. * FT -> GFF attribute mapping is improved. * --format choice of SeqIO input formats (GenBank default). Uniprot/Swissprot and EMBL work and produce useful GFF. * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions and more flexible usage. TODO
Are these additions desired? * filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY * handle Entrezgene, other non-sequence SeqIO structures (really should change those parsers to produce consistent annotation tags). Related bugfixes/tests These items from Bioperl mail were tested (sample data generating errors), and found corrected: From: Ed Green <green <at> eva.mpg.de> Subject: genbank2gff3.pl on new human RefSeq Date: 2006-03-13 21:22:26 GMT -- unspecified errors (sample data works now). From: Eric Just <e-just <at> northwestern.edu> Subject: bp_genbank2gff3.pl Date: 2007-01-26 17:08:49 GMT -- bug fixed in genbank2gff3 for multi-record handling This error is for a /trans_splice gene that is hard to handle, and unflattner/genbank2 doesn't From: Chad Matsalla <chad <at> dieselwurks.com> Subject: genbank2gff3.PLS and the unflatenner - Inconsistent order? Date: 2005-07-15 19:51:48 GMT AUTHOR
Sheldon McKay (mckays@cshl.edu) Copyright (c) 2004 Cold Spring Harbor Laboratory. AUTHOR of hacks for GFF2Chado loading Don Gilbert (gilbertd@indiana.edu) perl v5.14.2 2012-03-02 BP_GENBANK2GFF3(1p)
All times are GMT -4. The time now is 04:24 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy