I need to extract the title (text between <title> and </title>) of a set of HTML documents.
I've found a command that makes the work of extracting the text, but it does not always work.
It works with the next example:
However, it does not works with a real example:
First off - dispose of the grep. One regex program is enough on every commandline and grep can do nothing which sed couldn't do too - and better so.
The reason is that sed works linewise - once a new line is read sed forgets (usually - we can overcome that) what it has done on the last line.
The following title would be extracted with your regex:
but the following would fail:
The reason is that sed would read the first line, notice that the search pattern (which specifies the opening AND the closing tag to be there) is not found and move on to the next line. On the next line the same is true so the output is null.
Fortunately there is a device to make sed less forgettable: the line-range.
When we write "s/x/y/" we imply that this rule is used on every line. Still, this is only the abbreviated form of a command, which would include a starting and an end line: "1,5 s/x/y/" would apply the rule only to lines 1-5. Try these with a test file to see the effect.
OK, using line numbers is a bit static, because usually we will not know on which line a certain rule has to be applied - at least not beforehand. But it is also possible to use additional regexes to define the first and the last line of the block where the rule will be applied:
Applying this to your problem, we could use "<title>" as the beginning and "</title>" of the block in question - it is legal to have only one line in a block - and apply your rule to the whole block instead of only one line:
This will print only the lines from the opening to the closing tag. Now we have to "trim" this to get a nice output.
There are three possible types of lines:
1. lines with a "<title>" in them. We want to delete everything up to "<title>" and display the rest
2. lines with a "</title>" in them. We want to keep everything up to "</title>" and dispose of the rest.
3. Lines in between. We want to keep them entirely.
Ok, lets do it - one more thing: it is possible to group commands in regex language like in any programming language. The curly braces "{}" are used to group several commands to a single one:
You might notice that there is no action for the type-3-lines, but in fact there is: its the "p" which prints all the resulting lines (or the parts which survived our trimming respectively) out. The "-n" makes sure no output is done save for explicitly ordered one.
I leave the task to concatenate the resulting lines to you as an exercise. If you still have troubles feel free to ask again.
I have a script which converts a .csv file to html nicely. Trying to add 3 colors, green, yellow and red to the output depending upon the values in the cells. Tried some printf command but just can't seem to get any where. Any ideas would be appreciated. nawk 'BEGIN{
FS=","
print ... (7 Replies)
Hi
I am new to string extractions in shell script... I am trying to extract a string such as #1753 from html tag looks like below.
<a class="model-link tl-tr" href="lastSuccessfulBuild/">Last successful build (#1753), 40 min ago</a>
and want the value as
1753
Could someone help me to... (3 Replies)
My file looks like this and i need to only extract those with PDT_AP21_B and output it to another file. Can anyone help? Thanks.
PDT_AP21_R,,, 11 TYS,,,,T17D1207230742TYO***T17DS,,C
PDT_AP21_L,,,9631166650001 ,,,,T17D1207230903TYOTYST17DS ,,C... (3 Replies)
Hi everyone:
I want to extract string which is in between certain html tag.
e.g.
I tried with grep,cut, awk but could not find exact syntax for this one. :wall:
PS>Sorry about bad english. (8 Replies)
Hi All,
I have some HTML files and my requirement is to extract all the anchor text words from the HTML files along with their URLs and store the result in a separate text file separated by space. For example, <a href="/kid/stay_healthy/">Staying Healthy</a>
which has /kid/stay_healthy/ as... (3 Replies)
Hi Team,
I am not able to extract string between parenthesis.I need to extract string between first parenthesis only.
Please find the sample data and code.
But the below my code is returning "DW_EFD_TXN_ID", "PRCS_DTE" & INITIAL 52428800 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645... (12 Replies)
Hi All,
I am not able to read my HTML form inputs properly in my script.
I have a textarea in my form where user needs to enter sql query... but when user enter query like below :
select * from order_queue where NUM_OF_PICKUP >=3 and TRANSACTION_TYPE=4 ;
its coming like :
select 171_arc... (3 Replies)
Code for the tweak (not my fave 'running process' but the more popular 'working directory') :
case "$TERM" in
xterm*|rxvt*|rxvt-unicode*)
PROMPT_COMMAND='echo -e "\033]0;$TERM: ${PWD}\007"'
;;
*)
;;
esac
Where it works: rxvt (the one I run 'rootless' outside of ... (0 Replies)
Hello folks,
I am facing a problem with the following korn shell script snippet:
ftp -n -i -v <<EOF
print -p open $CURR_HOST
print -p user $USER $PASSWD
print -p binary
print -p cd /mydir/subdir/datadir
print -p get $FILENAME
print -p bye
EOF
exit
It gives me the following... (3 Replies)