.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "HTML::TreeBuilder 3" .TH HTML::TreeBuilder 3 "2020-03-05" "perl v5.16.3" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" HTML::TreeBuilder \- Parser that builds a HTML syntax tree .SH "VERSION" .IX Header "VERSION" This document describes version 5.07 of HTML::TreeBuilder, released August 31, 2017 as part of HTML-Tree. .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& use HTML::TreeBuilder 5 \-weak; # Ensure weak references in use \& \& foreach my $file_name (@ARGV) { \& my $tree = HTML::TreeBuilder\->new; # empty tree \& $tree\->parse_file($file_name); \& print "Hey, here\*(Aqs a dump of the parse tree of $file_name:\en"; \& $tree\->dump; # a method we inherit from HTML::Element \& print "And here it is, bizarrely rerendered as HTML:\en", \& $tree\->as_HTML, "\en"; \& \& # Now that we\*(Aqre done with it, we must destroy it. \& # $tree = $tree\->delete; # Not required with weak references \& } .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" (This class is part of the HTML::Tree dist.) .PP This class is for \s-1HTML\s0 syntax trees that get built out of \s-1HTML\s0 source. The way to use it is to: .PP 1. start a new (empty) HTML::TreeBuilder object, .PP 2. then use one of the methods from HTML::Parser (presumably with \&\f(CW\*(C`$tree\->parse_file($filename)\*(C'\fR for files, or with \&\f(CW\*(C`$tree\->parse($document_content)\*(C'\fR and \f(CW\*(C`$tree\->eof\*(C'\fR if you've got the content in a string) to parse the \s-1HTML\s0 document into the tree \f(CW$tree\fR. .PP (You can combine steps 1 and 2 with the \*(L"new_from_file\*(R" or \&\*(L"new_from_content\*(R" methods.) .PP 2b. call \f(CW\*(C`$root\->elementify()\*(C'\fR if you want. .PP 3. do whatever you need to do with the syntax tree, presumably involving traversing it looking for some bit of information in it, .PP 4. previous versions of HTML::TreeBuilder required you to call \&\f(CW\*(C`$tree\->delete()\*(C'\fR to erase the contents of the tree from memory when you're done with the tree. This is not normally required anymore. See \*(L"Weak References\*(R" in HTML::Element for details. .SH "ATTRIBUTES" .IX Header "ATTRIBUTES" Most of the following attributes native to HTML::TreeBuilder control how parsing takes place; they should be set \fIbefore\fR you try parsing into the given object. You can set the attributes by passing a \s-1TRUE\s0 or \&\s-1FALSE\s0 value as argument. E.g., \f(CW\*(C`$root\->implicit_tags\*(C'\fR returns the current setting for the \f(CW\*(C`implicit_tags\*(C'\fR option, \&\f(CW\*(C`$root\->implicit_tags(1)\*(C'\fR turns that option on, and \f(CW\*(C`$root\->implicit_tags(0)\*(C'\fR turns it off. .SS "implicit_tags" .IX Subsection "implicit_tags" Setting this attribute to true will instruct the parser to try to deduce implicit elements and implicit end tags. If it is false you get a parse tree that just reflects the text as it stands, which is unlikely to be useful for anything but quick and dirty parsing. (In fact, I'd be curious to hear from anyone who finds it useful to have \f(CW\*(C`implicit_tags\*(C'\fR set to false.) Default is true. .PP Implicit elements have the \*(L"implicit\*(R" in HTML::Element attribute set. .SS "implicit_body_p_tag" .IX Subsection "implicit_body_p_tag" This controls an aspect of implicit element behavior, if \f(CW\*(C`implicit_tags\*(C'\fR is on: If a text element (\s-1PCDATA\s0) or a phrasal element (such as \&\f(CW\*(C`\*(C'\fR) is to be inserted under \f(CW\*(C`\*(C'\fR, two things can happen: if \f(CW\*(C`implicit_body_p_tag\*(C'\fR is true, it's placed under a new, implicit \f(CW\*(C`

\*(C'\fR tag. (Past DTDs suggested this was the only correct behavior, and this is how past versions of this module behaved.) But if \f(CW\*(C`implicit_body_p_tag\*(C'\fR is false, nothing is implicated \&\*(-- the \s-1PCDATA\s0 or phrasal element is simply placed under \&\f(CW\*(C`\*(C'\fR. Default is false. .SS "no_expand_entities" .IX Subsection "no_expand_entities" This attribute controls whether entities are decoded during the initial parse of the source. Enable this if you don't want entities decoded to their character value. e.g. '&' is decoded to '&' by default, but will be unchanged if this is enabled. Default is false (entities will be decoded.) .SS "ignore_unknown" .IX Subsection "ignore_unknown" This attribute controls whether unknown tags should be represented as elements in the parse tree, or whether they should be ignored. Default is true (to ignore unknown tags.) .SS "ignore_text" .IX Subsection "ignore_text" Do not represent the text content of elements. This saves space if all you want is to examine the structure of the document. Default is false. .SS "ignore_ignorable_whitespace" .IX Subsection "ignore_ignorable_whitespace" If set to true, TreeBuilder will try to avoid creating ignorable whitespace text nodes in the tree. Default is true. (In fact, I'd be interested in hearing if there's ever a case where you need this off, or where leaving it on leads to incorrect behavior.) .SS "no_space_compacting" .IX Subsection "no_space_compacting" This determines whether TreeBuilder compacts all whitespace strings in the document (well, outside of \s-1PRE\s0 or \s-1TEXTAREA\s0 elements), or leaves them alone. Normally (default, value of 0), each string of contiguous whitespace in the document is turned into a single space. But that's not done if \f(CW\*(C`no_space_compacting\*(C'\fR is set to 1. .PP Setting \f(CW\*(C`no_space_compacting\*(C'\fR to 1 might be useful if you want to read in a tree just to make some minor changes to it before writing it back out. .PP This method is experimental. If you use it, be sure to report any problems you might have with it. .SS "p_strict" .IX Subsection "p_strict" If set to true (and it defaults to false), TreeBuilder will take a narrower than normal view of what can be under a \f(CW\*(C`

\*(C'\fR element; if it sees a non-phrasal element about to be inserted under a \f(CW\*(C`

\*(C'\fR, it will close that \f(CW\*(C`

\*(C'\fR. Otherwise it will close \f(CW\*(C`

\*(C'\fR elements only for other \f(CW\*(C`

\*(C'\fR's, headings, and \f(CW\*(C`

\*(C'\fR (although the latter may be removed in future versions). .PP For example, when going thru this snippet of code, .PP .Vb 2 \&

stuff \&