Problem when extracting the title of HTML doc

12-01-2008

Registered User

2, 0

Join Date: Dec 2008

Last Activity: 1 June 2009, 5:11 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Problem when extracting the title of HTML doc

Dear all.

I need to extract the title (text between <title> and </title>) of a set of HTML documents.
I've found a command that makes the work of extracting the text, but it does not always work.

It works with the next example:

Code:

cat a.txt 
htmltext<title>This is a HTML title</title>blablalbla

Code:

grep title a.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'
This is a HTML title

However, it does not works with a real example:

Code:

cat b.txt 
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></meta> <title>This my new page
</title> <link href...></link>

Code:

grep title b.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'

The last command do not return anything.

I appreciate any comment or suggestion.

i007

View Public Profile for i007

Find all posts by i007

12-01-2008

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

First off - dispose of the grep. One regex program is enough on every commandline and grep can do nothing which sed couldn't do too - and better so.

The reason is that sed works linewise - once a new line is read sed forgets (usually - we can overcome that) what it has done on the last line.

The following title would be extracted with your regex:

Code:

<title>blah</title>

but the following would fail:

Code:

<title>blah
</title>

The reason is that sed would read the first line, notice that the search pattern (which specifies the opening AND the closing tag to be there) is not found and move on to the next line. On the next line the same is true so the output is null.

Fortunately there is a device to make sed less forgettable: the line-range.

When we write "s/x/y/" we imply that this rule is used on every line. Still, this is only the abbreviated form of a command, which would include a starting and an end line: "1,5 s/x/y/" would apply the rule only to lines 1-5. Try these with a test file to see the effect.

OK, using line numbers is a bit static, because usually we will not know on which line a certain rule has to be applied - at least not beforehand. But it is also possible to use additional regexes to define the first and the last line of the block where the rule will be applied:

Code:

<regex1>,<regex2> <command>

Applying this to your problem, we could use "<title>" as the beginning and "</title>" of the block in question - it is legal to have only one line in a block - and apply your rule to the whole block instead of only one line:

Code:

sed -n '/<title>/,/<\/title>/ p'

This will print only the lines from the opening to the closing tag. Now we have to "trim" this to get a nice output.

There are three possible types of lines:

1. lines with a "<title>" in them. We want to delete everything up to "<title>" and display the rest

2. lines with a "</title>" in them. We want to keep everything up to "</title>" and dispose of the rest.

3. Lines in between. We want to keep them entirely.

Ok, lets do it - one more thing: it is possible to group commands in regex language like in any programming language. The curly braces "{}" are used to group several commands to a single one:

Code:

sed -n '/<title>/,/<\/title>/ {
            s/^.*<title>//
            s/<\/title>.*$//
            p
            }'

You might notice that there is no action for the type-3-lines, but in fact there is: its the "p" which prints all the resulting lines (or the parts which survived our trimming respectively) out. The "-n" makes sure no output is done save for explicitly ordered one.

I leave the task to concatenate the resulting lines to you as an exercise. If you still have troubles feel free to ask again.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

12-01-2008

Registered User

2, 0

Join Date: Dec 2008

Last Activity: 1 June 2009, 5:11 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

It definitely works.
Thank you very much bakunin for your excellent explanation, and for the fast reply.
I really appreciate your help

i007

View Public Profile for i007

Find all posts by i007

Shell Programming and Scripting

Problem when extracting the title of HTML doc

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Add Color To html Doc

Discussion started by: jimmyf

2. Shell Programming and Scripting

Extracting a string from html tag

Discussion started by: hicharbo

3. UNIX for Dummies Questions & Answers

problem with extracting line in file

Discussion started by: Alyssa

4. Shell Programming and Scripting

extracting Line between HTML tag

Discussion started by: newlook2011

5. Shell Programming and Scripting

Extracting anchor text and its URL from HTML files in BASH

Discussion started by: shoaibjameel123

6. UNIX for Dummies Questions & Answers

Problem in extracting the string between parenthesis

Discussion started by: suriyavignesh

7. Shell Programming and Scripting

Problem with while reading HTML inputs

Discussion started by: askumarece

8. UNIX Desktop Questions & Answers

Terminal title bar tweak discrepancy problem in Cygwin/X

Discussion started by: SilversleevesX

9. Shell Programming and Scripting

Problem with here doc operator in FTP script

Discussion started by: Rajat

10. Shell Programming and Scripting

Problem in extracting vector data

Discussion started by: ahjiefreak