Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

html::html5::sanity(3pm) [debian man page]

HTML::HTML5::Sanity(3pm)				User Contributed Perl Documentation				  HTML::HTML5::Sanity(3pm)

NAME
HTML::HTML5::Sanity - make HTML5 DOM trees less insane SYNOPSIS
use HTML::HTML5::Parser; use HTML::HTML5::Sanity; my $parser = HTML::HTML5::Parser->new; my $html5_dom = $parser->parse_file('http://example.com/'); my $sane_dom = fix_document($html5_dom); DESCRIPTION
The Document Object Model (DOM) generated by HTML::HTML5::Parser meets the requirements of the HTML5 spec, but will probably catch a lot of people by surprise. The main oddity is that elements and attributes which appear to be namespaced are not really. For example, the following element: <div xml:lang="fr">...</div> Looks like it should be parsed so that it has an attribute "lang" in the XML namespace. Not so. It will really be parsed as having the attribute "xml:lang" in the null namespace. "fix_document($document)" $sane_dom = fix_document($html5_dom); Returns a modified copy of the DOM and leaving the original DOM unmodified. "fix_element($element_node, $new_document_node, \%namespaces)" Don't use this. Not exported. "fix_attribute($attribute_node, $new_element_node, \%namespaces)" Don't use this. Not exported. $HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES $HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES = 2; $sane_dom = fix_document($html5_dom); If set to 1 (the default), the package will detect invalid values in @lang and @xml:lang, and remove the attribute if it is invalid. If set to 2, it will also attempt to canonicalise the value (e.g. 'EN_GB' will be converted to to 'en-GB'). If set to 0, then the value of language attributes is not checked. BUGS
Please report any bugs to <http://rt.cpan.org/>. SEE ALSO
HTML::HTML5::Parser, XML::LibXML, Task::HTML5. AUTHOR
Toby Inkster <tobyink@cpan.org>. COPYRIGHT AND LICENSE
Copyright (C) 2009-2011 by Toby Inkster This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. perl v5.14.2 2011-12-08 HTML::HTML5::Sanity(3pm)

Check Out this Related Man Page

Mojo::DOM(3pm)						User Contributed Perl Documentation					    Mojo::DOM(3pm)

NAME
Mojo::DOM - Minimalistic HTML5/XML DOM parser with CSS3 selectors SYNOPSIS
use Mojo::DOM; # Parse my $dom = Mojo::DOM->new('<div><p id="a">A</p><p id="b">B</p></div>'); # Find my $b = $dom->at('#b'); say $b->text; # Walk say $dom->div->p->[0]->text; say $dom->div->children('p')->first->{id}; # Iterate $dom->find('p[id]')->each(sub { say shift->{id} }); # Loop for my $e ($dom->find('p[id]')->each) { say $e->text; } # Modify $dom->div->p->[1]->append('<p id="c">C</p>'); # Render say $dom; DESCRIPTION
Mojo::DOM is a minimalistic and relaxed HTML5/XML DOM parser with CSS3 selector support. It will even try to interpret broken XML, so you should not use it for validation. CASE SENSITIVITY
Mojo::DOM defaults to HTML5 semantics, that means all tags and attributes are lowercased and selectors need to be lowercase as well. my $dom = Mojo::DOM->new('<P ID="greeting">Hi!</P>'); say $dom->at('p')->text; say $dom->p->{id}; If XML processing instructions are found, the parser will automatically switch into XML mode and everything becomes case sensitive. my $dom = Mojo::DOM->new('<?xml version="1.0"?><P ID="greeting">Hi!</P>'); say $dom->at('P')->text; say $dom->P->{ID}; XML detection can also be disabled with the "xml" method. # Force XML semantics $dom->xml(1); # Force HTML5 semantics $dom->xml(0); METHODS
Mojo::DOM implements the following methods. "new" my $dom = Mojo::DOM->new; my $dom = Mojo::DOM->new('<foo bar="baz">test</foo>'); Construct a new Mojo::DOM object. "all_text" my $trimmed = $dom->all_text; my $untrimmed = $dom->all_text(0); Extract all text content from DOM structure, smart whitespace trimming is enabled by default. # "foo bar baz" $dom->parse("<div>foo <p>bar</p>baz </div>")->div->all_text; # "foo barbaz " $dom->parse("<div>foo <p>bar</p>baz </div>")->div->all_text(0); "append" $dom = $dom->append('<p>Hi!</p>'); Append to element. # "<div><h1>A</h1><h2>B</h2></div>" $dom->parse('<div><h1>A</h1></div>')->at('h1')->append('<h2>B</h2>'); "append_content" $dom = $dom->append_content('<p>Hi!</p>'); Append to element content. # "<div><h1>AB</h1></div>" $dom->parse('<div><h1>A</h1></div>')->at('h1')->append_content('B'); "at" my $result = $dom->at('html title'); Find a single element with CSS3 selectors. All selectors from Mojo::DOM::CSS are supported. # Find first element with "svg" namespace definition my $namespace = $dom->at('[xmlns:svg]')->{'xmlns:svg'}; "attrs" my $attrs = $dom->attrs; my $foo = $dom->attrs('foo'); $dom = $dom->attrs({foo => 'bar'}); $dom = $dom->attrs(foo => 'bar'); Element attributes. "charset" my $charset = $dom->charset; $dom = $dom->charset('UTF-8'); Alias for "charset" in Mojo::DOM::HTML. "children" my $collection = $dom->children; my $collection = $dom->children('div'); Return a Mojo::Collection object containing the children of this element, similar to "find". # Show type of random child element say $dom->children->shuffle->first->type; "content_xml" my $xml = $dom->content_xml; Render content of this element to XML. # "<b>test</b>" $dom->parse('<div><b>test</b></div>')->div->content_xml; "find" my $collection = $dom->find('html title'); Find elements with CSS3 selectors and return a Mojo::Collection object. All selectors from Mojo::DOM::CSS are supported. # Find a specific element and extract information my $id = $dom->find('div')->[23]{id}; # Extract information from multiple elements my @headers = $dom->find('h1, h2, h3')->map(sub { shift->text })->each; "namespace" my $namespace = $dom->namespace; Find element namespace. # Find namespace for an element with namespace prefix my $namespace = $dom->at('svg > svg:circle')->namespace; # Find namespace for an element that may or may not have a namespace prefix my $namespace = $dom->at('svg > circle')->namespace; "parent" my $parent = $dom->parent; Parent of element. "parse" $dom = $dom->parse('<foo bar="baz">test</foo>'); Alias for "parse" in Mojo::DOM::HTML. # Parse UTF-8 encoded XML my $dom = Mojo::DOM->new->charset('UTF-8')->xml(1)->parse($xml); "prepend" $dom = $dom->prepend('<p>Hi!</p>'); Prepend to element. # "<div><h1>A</h1><h2>B</h2></div>" $dom->parse('<div><h2>B</h2></div>')->at('h2')->prepend('<h1>A</h1>'); "prepend_content" $dom = $dom->prepend_content('<p>Hi!</p>'); Prepend to element content. # "<div><h2>AB</h2></div>" $dom->parse('<div><h2>B</h2></div>')->at('h2')->prepend_content('A'); "replace" $dom = $dom->replace('<div>test</div>'); Replace elements. # "<div><h2>B</h2></div>" $dom->parse('<div><h1>A</h1></div>')->at('h1')->replace('<h2>B</h2>'); "replace_content" $dom = $dom->replace_content('test'); Replace element content. # "<div><h1>B</h1></div>" $dom->parse('<div><h1>A</h1></div>')->at('h1')->replace_content('B'); "root" my $root = $dom->root; Find root node. "text" my $trimmed = $dom->text; my $untrimmed = $dom->text(0); Extract text content from element only (not including child elements), smart whitespace trimming is enabled by default. # "foo baz" $dom->parse("<div>foo <p>bar</p>baz </div>")->div->text; # "foo baz " $dom->parse("<div>foo <p>bar</p>baz </div>")->div->text(0); "text_after" my $trimmed = $dom->text_after; my $untrimmed = $dom->text_after(0); Extract text content immediately following element, smart whitespace trimming is enabled by default. # "baz" $dom->parse("<div>foo <p>bar</p>baz </div>")->div->p->text_after; # "baz " $dom->parse("<div>foo <p>bar</p>baz </div>")->div->p->text_after(0); "text_before" my $trimmed = $dom->text_before; my $untrimmed = $dom->text_before(0); Extract text content immediately preceding element, smart whitespace trimming is enabled by default. # "foo" $dom->parse("<div>foo <p>bar</p>baz </div>")->div->p->text_before; # "foo " $dom->parse("<div>foo <p>bar</p>baz </div>")->div->p->text_before(0); "to_xml" my $xml = $dom->to_xml; Render this element and its content to XML. # "<div><b>test</b></div>" $dom->parse('<div><b>test</b></div>')->div->to_xml; "tree" my $tree = $dom->tree; $dom = $dom->tree(['root', [qw(text lalala)]]); Alias for "tree" in Mojo::DOM::HTML. "type" my $type = $dom->type; $dom = $dom->type('div'); Element type. # List types of child elements $dom->children->each(sub { say $_->type }); "xml" my $xml = $dom->xml; $dom = $dom->xml(1); Alias for "xml" in Mojo::DOM::HTML. CHILD ELEMENTS
In addition to the methods above, many child elements are also automatically available as object methods, which return a Mojo::DOM or Mojo::Collection object, depending on number of children. say $dom->p->text; say $dom->div->[23]->text; $dom->div->each(sub { say $_->text }); ELEMENT ATTRIBUTES
Direct hash reference access to element attributes is also possible. say $dom->{foo}; say $dom->div->{id}; SEE ALSO
Mojolicious, Mojolicious::Guides, <http://mojolicio.us>. perl v5.14.2 2012-09-05 Mojo::DOM(3pm)
Man Page