class DOMUtil
DOMUtil provides utilty functions for working with DOM trees.
DOMUtil is designed to work with the DOM structure generated by Chris Winberry's node-htmlparser. DOMUtil doesn't have a strict dependency on node-htmlparser, but it expects DOM structures compatible with those generated by node-htmlparser. See the htmlparser documentation for more information about that format.
(Currently DOMUtil should work with any DOM format that supports the type
,
name
, children
, attribs
and raw
attributes as used in htmlparser, but
that may change in future verions.)
Method names that start with _
are subject to change without notice. Other methods may be considered a part of the public API.
class DOMUtil
The DOMUtil constructor.
While the DOMUtil methods are essentially stateless (and hence thread-safe, i.e., calls to the same DOMUtil instance can be safely interleaved), DOMUtil is implemented as an instantiable class to allow for alternative configurations.
The constuctor accepts an optional params
map. Currently one parameter key
is supported:
params.decode
(optionally) specifies a function to use
when converting HTML text nodes into "plain text" (in the to_text
and
inner_text
functions). This method can be used, for example,
to decode HTML entities into their text equivalents.
By default no conversion is made, the text nodes are output in exactly
the same format as they are found in the DOM. constructor:(params = {})->
@decode = params?['decode'] ? (str)->str
parse_html is a convenience function that parses a given HTML string
into one or more DOM trees using the htmlparser
library (if present).
If htmlparser
is not available, an error will be passed to the callback
function.
The html
parameter must be a string containing one or more HTML/XML trees.
The options
parameter is optional, and may contain a map of
options to pass to htmlparser.
The callback
parameter should contain a function with the signature
callback(err,dom)
, where:
The err
argument will be a non-null
value if an error occurs
during the parsing.
Otherwise the dom
argument will contain a single DOM object
(when there is a single root tag in the given html
string) or
an array of DOM objects (when there is more than one HTML/XML
structure in the given html
string).
parse_html:(html,options,callback)->
The options
parameter is optional, so swap options
and callback
if necessary.
if typeof options is 'function' and typeof callback isnt 'function'
[ options, callback ] = [ callback, options ]
If we haven't yet loaded htmlparser
, do so now.
unless @htmlparser?
try
@htmlparser = require 'htmlparser'
catch err
callback(err,null)
if @htmlparser?
Now create a simple handler that invokes the given callback
...
handler = new @htmlparser.DefaultHandler (err,domset)->
if err?
callback(err,null)
else if Array.isArray(domset) and domset.length <= 1
callback(null,domset[0])
else
callback(null,domset)
...create the parser...
parser = new @htmlparser.Parser(handler,options)
...and parse the HTML.
parser.parseComplete(html)
as_node returns nodeset[0]
if nodeset
is an array, nodeset
otherwise.
as_node: (nodeset)->
if Array.isArray(nodeset)
return nodeset[0]
else
return nodeset
as_nodeset returns node
if node
is an array, [ node ]
otherwise.
as_nodeset: (node)->
if Array.isArray(node)
return node
else if node?
return [node]
else
return []
_kt returns true
. It's the default filter for to_text
.
_kt: ()->true
to_text returns a concatenation of all text nodes found
within the given DOM elt
.
An optional filter
parameter may contain a function with
the signature filter(node)
that returns true
if the
text found in or beneath the given node
should be included
in the concatenation or false
if the text found at or
below the given node
should be excluded.
E.g., the function:
var skip_em = function(node) { return node.name != 'em' };
will cause to_text
to exclude any text found within an
<em>
tag.
to_text:(elt,filter = @_kt)->
buffer = ''
@walk_dom elt, visit:(node,node_metadata,all_metadata)=>
If node
is acceptable to filter
, then append any text, and visit its children.
if(filter(node,node_metadata,all_metadata))
buffer += @decode(node.raw) if node?.type is 'text' and node?.raw?
return {'continue':true,'visit_children':true}
else
If node
is not acceptable to filter
, then skip it and its children.
return {'continue':true,'visit_children':false}
return buffer
inner_text is an alias for to_text
(which see).
inner_text:(elt,filter)->@to_text(elt,filter)
to_html returns an HTML string representation of
the given elt
and its children (if any).
(Currently only text
and tag
node types are converted,
but that may change in the future.)
to_html:(elt)->
buffer = ''
@walk_dom elt, {
When visit
ing a node...
visit:(node)->
switch node.type
...concat the value of text
nodes.
when 'text'
buffer += node.raw
...concat the name and attributes of tag
nodes.
when 'tag'
buffer += "<#{node.name}"
if node.attribs?
for name,value of node.attribs
buffer += " #{name}=\"#{value}\""
buffer += ">"
return true
after_visit
ing a node...
after_visit:(node)->
switch node.type
...concat the "end tag" for tag
nodes.
when 'tag'
buffer += "</#{node.name}>"
return true
}
return buffer
inner_html returns an HTML string representation of
the the children (if any) of the given elt
.
(Otherwise it behaves just like to_html
, which see.)
inner_html:(elt)->
buffer = null
If elt
is an array, invoke to_html
on the children of each element of in the array.
if Array.isArray(elt)
buffer = ''
for node in elt
if node.children?
buffer += @to_html(node.children)
Otherwise to_html
on the childen of elt
.
else if elt?.children?
buffer = @to_html elt.children
return buffer
walk_dom performs a depth-first walk of the given DOM tree (or trees), invoking a specified "visit" function for each node.
The dom
parameter is either a single DOM node or an array of DOM nodes.
The callbacks
parameter is a map that contains (at minimum) an
attribute named visit
containing a function with the signature:
visit(node,node_metadata,all_metadata)
where:
node
is the DOM node currently being visited,node_metadata
is a map containing parent
, path
, siblings
and sib_index
keys, andall_metadata
is an array of node_metadata
values
for each previously visited nodes, indexed by the value
stored at node._stew_node_id
.The callbacks.visit
function should return a map containing
continue
and visit-children
attributes.
When visit-children
is true
, the children of
node
(if any) will be visited next. When false
,
the node
's children will be skipped, but processing
will continue with node
's siblings (or node
's
parent's, siblings, etc.)
When continue
is false
, all subsequent processing
will be aborted and the walk_dom
method will exit
as soon as possible.
If the value returned by visit
is a boolean, that
value will be used for both continue
and visit-children
.
If callbacks
is a function (rather than a map) it be
used as the visit
function.
walk_dom:(dom,callbacks)->
Fiddle with the input parameters if needed.
if typeof callbacks is 'function'
callbacks = { visit:callbacks }
nodes = @as_nodeset(dom)
Create a container for all the node metadata.
dom_metadata = []
for node, sib_index in nodes
Create the metadata for this node...
node_metadata = { parent:null, path:[], siblings:nodes, sib_index: sib_index }
node._stew_node_id = dom_metadata.length
...add it to the container...
dom_metadata.push node_metadata
...visit the node...
should_continue = @_unguarded_walk_dom(node,node_metadata,dom_metadata,callbacks)
...and exit if needed.
if not should_continue
break
_unguarded_walk_dom is the "inner" implementation of walk_dom
.
See walk_dom
for more information
node
is the current DOM node to visit.node_metadata
is a map containing:parent
- the parent of this node, if anypath
- an array of this node's ancestors (from "root" to parent)siblings
- an array of this node's parent's childrensib_index
- the index of this node in the siblings
arraydom_metadata
is an array of node_metadata
objects, indexed by
node._stew_node_id
. Only the already visited nodes are contained
in this array.callbacks
is the map of callbacks passed to walk_dom
, which see._unguarded_walk_dom
will return true
if processing should continue
(typically with node
's next sibling), or false
if processing is
complete an no more nodes should be visited.
_unguarded_walk_dom:(node,node_metadata,dom_metadata,callbacks)->
Visit the current node.
response = {'continue':true,'visit_children':true}
if callbacks.visit?
response = callbacks.visit(node,node_metadata,dom_metadata)
If processing should continue...
if response is true or response?['continue'] is true or (not response?['continue']?)
...and this node's children should be processed...
if node.children? and (response is true or response?['visit_children'] is true or (not response?['visit_children']?))
...create the path
to this node
's children...
new_path = [].concat(node_metadata.path)
new_path.push(node)
...and recursively visit each child in turn...
for child,index in node.children
new_node_metadata = { parent:node, path:new_path, siblings:node.children, sib_index: index }
child._stew_node_id = dom_metadata.length
dom_metadata.push new_node_metadata
should_continue = @_unguarded_walk_dom(child,new_node_metadata,dom_metadata,callbacks)
...aborting further processing if needed.
if not should_continue
return false
...invoke the post-visit callback, if any...
if callbacks['after_visit']?
response = callbacks.after_visit(node,node_metadata,dom_metadata)
...aborting further processing if needed.
return response is true or response?['continue'] is true or (not response?['continue']?)
else # no `after_visit` callback
return true
else # processing should not continue
return false
The DOMUtil class is exported under the name DOMUtil
.
exports = exports ? this
exports.DOMUtil = DOMUtil