Extracting a part of XML File

11-10-2008

Registered User

5, 0

Join Date: Nov 2008

Last Activity: 17 November 2008, 6:41 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Extracting a part of XML File

Hi Guys,

I have a very large XML feed (2.7 MB) which crashes the server at the time of parsing. Now to reduce the load on the server I have a cron job running every 5 min.'s. This job will get the file from the feed host and keep it in the local machine.

This does not solve the problem as the file still gets loaded in the server. The file looks something like this:

<?xml version="1.0" standalone="no"?>
<IRXML CorpMasterID="">
<NewsReleases PubDate="20081104" PubTime="16:48:03">
<NewsCategory Category="">
<NewsRelease ReleaseID="" DLU="20081104 16:47:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="16:33:00">11/4/2008 4:33:00 PM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>
<NewsRelease ReleaseID="" DLU="20081104 09:19:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="09:01:00">11/4/2008 9:01:00 AM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>

I want to write a shell script which will extract only the part starting from
<NewsRelease> till </NewsRelease>
Something like:

<NewsRelease ReleaseID="" DLU="20081104 09:19:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="09:01:00">11/4/2008 9:01:00 AM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>

Also there is one more problem, in unix when the file is downloaded there are no return carriage, so the complete file appears to be in one line

.

Any help would be appreciated. Thanks,
Shridhar

shridhard

View Public Profile for shridhard

Find all posts by shridhard

11-10-2008

Registered User

219, 3

Join Date: Jun 2006

Last Activity: 8 April 2015, 1:53 PM EDT

Location: Harpenden, UK

Posts: 219

Thanks Given: 0

Thanked 3 Times in 3 Posts

Code:

sed -n '/<NewsRelease R/,/<\/NewsRelease>/p' xmldump >outputfile

wempy

View Public Profile for wempy

Find all posts by wempy

11-10-2008

Registered User

219, 3

Join Date: Jun 2006

Last Activity: 8 April 2015, 1:53 PM EDT

Location: Harpenden, UK

Posts: 219

Thanks Given: 0

Thanked 3 Times in 3 Posts

regarding the end of line problem, what format is the file currently in i.e. does it have LF, CR/LF or CR as it's end of line marker?
depending on format depends on which tool to use.
to go from dos to unix use dos2unix or run the file up in vim and :set fileformat=unix

wempy

View Public Profile for wempy

Find all posts by wempy

11-11-2008

Registered User

5, 0

Join Date: Nov 2008

Last Activity: 17 November 2008, 6:41 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

copying the complete file

Thanks for the reply.

There seems to be some problem with the command. The command seems to execute, but when I see the outputfile, it is the complete copy of the xmlfeed.
I don't think there is a problem with the file format, because I do not see ^M in the file.
I think the problem could be with the multiple occurrences of "NewsRelease" in the file.

Also my requirement is that, I need the first 5 occurrences of <NewsRelease> ... </NewsRelease> from the XMLFeed to another file, as I need to Parse the first 5 news releases to HTML using XSL.

Please let me know if this is possible.

Thanks again.
Shridhar

shridhard

View Public Profile for shridhard

Find all posts by shridhard

11-12-2008

Registered User

1,305, 26

Join Date: Jun 2007

Last Activity: 11 November 2016, 3:44 AM EST

Location: Beijing China

Posts: 1,305

Thanks Given: 0

Thanked 26 Times in 26 Posts

Hope this can help you some.

it will only print out the first five part surrounded by <NewsRelease and /NewsRelease>.

Code:

awk '/<NewsRelease/,/\/NewsRelease/{
if(n<5)
	print
if(index($0,"/NewsRelease")!=0)
	n++
}' filename

summer_cherry

View Public Profile for summer_cherry

Find all posts by summer_cherry

11-12-2008

Registered User

5, 0

Join Date: Nov 2008

Last Activity: 17 November 2008, 6:41 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks got it almost working

Thanks for the reply, it worked

... I have to add few more things to make it work completely.

Warm Regards,
Shridhar

shridhard

View Public Profile for shridhard

Find all posts by shridhard

11-12-2008

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Quote:

Also my requirement is that, I need the first 5 occurrences of <NewsRelease> ... </NewsRelease> from the XMLFeed to another file, as I need to Parse the first 5 news releases to HTML using XSL.

Why not extract the first 5 releases using XSLT i.e.

Code:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:apply-templates>
      <xsl:with-param name="mycount" select="5"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="NewsReleases">
    <xsl:param name="mycount"/>
      <xsl:element name="NewsReleases">
      <xsl:attribute name="PubDate">
         <xsl:value-of select="@PubDate"/>
      </xsl:attribute>
      <xsl:attribute name="PubTime">
         <xsl:value-of select="@PubTime"/>
      </xsl:attribute>
      <xsl:text>&#xA;</xsl:text>
      <xsl:for-each select="//NewsRelease[position() &lt;=$mycount]">
        <xsl:copy-of select="."/>
      </xsl:for-each>
      <xsl:text>&#xA;</xsl:text>
      </xsl:element>
  </xsl:template>

</xsl:stylesheet>

This assumes that your irXML document is well formed (XML) - which not the case for the sample document you supplied.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

Shell Programming and Scripting

Extracting a part of XML File

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need Help in extracting data from XML File

Discussion started by: vx04

2. Shell Programming and Scripting

Extracting the tag name from an xml file

Discussion started by: Little

3. Shell Programming and Scripting

Reading XML file and extracting value

Discussion started by: sharsour

4. Shell Programming and Scripting

Extracting content from xml file

Discussion started by: suvendu4urs

5. Shell Programming and Scripting

Need help in extracting data from xml file

Discussion started by: abhishek2386

6. UNIX for Dummies Questions & Answers

Extracting data from an xml file

Discussion started by: Dolph

7. UNIX for Dummies Questions & Answers

Extracting values from an XML file

Discussion started by: sushant172

8. Shell Programming and Scripting

extracting part of a text file

Discussion started by: alinaqvi90

9. Shell Programming and Scripting

Extracting Data from xml file

Discussion started by: nishana

10. Shell Programming and Scripting

extracting XML file using sed

Discussion started by: pujansrt