Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

psi-cd-hit-2d-g1(1) [debian man page]

PSI-CD-HIT-2D-G1.PL(1)						   User Commands					    PSI-CD-HIT-2D-G1.PL(1)

NAME
psi-cd-hit-2d-g1.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format DESCRIPTION
Usage psi-cd-hit-2d [Options] Options -i in_dbname, required -o out_dbname, required -c clustering threshold (sequence identity), default 0.3 -ce clustering threshold (blast expect), default -1, it means by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either identity threshold or expect threshold -L coverage of shorter sequence ( aligned / full), default 0.0 -M coverage of longer sequence ( aligned / full), default 0.0 -R (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search -G (1/0) use global identity? default 1 sequence identity calculated as total identical residues of local alignments / length of shorter seq if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches. -d length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space -l length_of_throw_away_sequences, default 10 -p profile search para, default "-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500" -bfdb profile database, default nr80 -s blast search para, default "-F F -e 0.000001 -b 100000 -v 100000" -be blast expect cutoff, default 0.000001 -b filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts -pbs No of jobs to send each time by PBS querying system you can not use both ssh and pbs at same time -k (1/0) keep blast raw output file, default 1 -rs steps of save restart file and clustering output, default 5000 everytime after process 5000 sequences, program write a restart file and current clustering information -restart restart file, readin a restart file if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart" -rf steps of re format blast database, default 200,000 if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time -local dir of local blast db, when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY NOT FASTER -J job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself -single files of ids those you known that they are singletons so I won't run them as queries -i2 second input database -blastn run blastn, default 0 -lo how long can seq in db2 > db1 in a cluster, default 0 means, that seq in db2 should <= seqs in db1 in a cluster ============================== by Weizhong Li, liwz@sdsc.edu ============================== If you find cd-hit useful, please kindly cite: "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659 psi-cd-hit-2d-g1.pl 4.6-2012-04-25 April 2012 PSI-CD-HIT-2D-G1.PL(1)

Check Out this Related Man Page

CD-HIT-2D-PARA.PL(1)						   User Commands					      CD-HIT-2D-PARA.PL(1)

NAME
cd-hit-2d-para.pl - divide a big clustering job into pieces to run cd-hit-2d or cd-hit-est-2d jobs SYNOPSIS
cd-hit-2d-para.pl options DESCRIPTION
This script divide a big clustering job into pieces and submit jobs to remote computers over a network to make it parallel. After all the jobs finished, the script merge the clustering results as if you just run a single cd-hit-2d or cd-hit-est-2d. You can also use it to divide big jobs on a single computer if your computer does not have enough RAM (with -L option). Requirements: 1 When run this script over a network, the directory where you run the scripts and the input files must be available on all the remote hosts with identical path. 2 If you choose "ssh" to submit jobs, you have to have passwordless ssh to any remote host, see ssh manual to know how to set up passwordless ssh. 3 I suggest to use queuing system instead of ssh, I currently support PBS and SGE 4 cd-hit-2d cd-hit-est-2d cd-hit-div cd-hit-div.pl must be in same directory where this script is in. Options -i input filename for 1st db in fasta format, required -i2 input filename for 2nd db in fasta format, required -o output filename, required --P program, "cd-hit-2d" or "cd-hit-est-2d", default "cd-hit-2d" --B filename of list of hosts, requred unless -Q or -L option is supplied --L number of cpus on local computer, default 0 when you are not running it over a cluster, you can use this option to divide a big clustering jobs into small pieces, I suggest you just use "--L 1" unless you have enough RAM for each cpu --S Number of segments to split 1st db into, default 2 --S2 Number of segments to split 2nd db into, default 8 --Q number of jobs to submit to queue queuing system, default 0 by default, the program use ssh mode to submit remote jobs --T type of queuing system, "PBS", "SGE" are supported, default PBS --R restart file, used after a crash of run -h print this help More cd-hit-2d/cd-hit-est-2d options can be speicified in command line Questions, bugs, contact Weizhong Li at liwz@sdsc.edu cd-hit-2d-para.pl 4.6-2012-04-25 April 2012 CD-HIT-2D-PARA.PL(1)
Man Page