class DOMUtilDOMUtil provides utilty functions for working with DOM trees.
DOMUtil is designed to work with the DOM structure generated by Chris Winberry's node-htmlparser. DOMUtil doesn't have a strict dependency on node-htmlparser, but it expects DOM structures compatible with those generated by node-htmlparser. See the htmlparser documentation for more information about that format.
(Currently DOMUtil should work with any DOM format that supports the type,
name, children, attribs and raw attributes as used in htmlparser, but
that may change in future verions.)
Method names that start with _ are subject to change without notice. Other methods may be considered a part of the public API.
class DOMUtilThe DOMUtil constructor.
While the DOMUtil methods are essentially stateless (and hence thread-safe, i.e., calls to the same DOMUtil instance can be safely interleaved), DOMUtil is implemented as an instantiable class to allow for alternative configurations.
The constuctor accepts an optional params map. Currently one parameter key
is supported:
params.decode (optionally) specifies a function to use
when converting HTML text nodes into "plain text" (in the to_text and
inner_text functions). This method can be used, for example,
to decode HTML entities into their text equivalents.
By default no conversion is made, the text nodes are output in exactly
the same format as they are found in the DOM. constructor:(params = {})->
@decode = params?['decode'] ? (str)->strparse_html is a convenience function that parses a given HTML string
into one or more DOM trees using the htmlparser library (if present).
If htmlparser is not available, an error will be passed to the callback
function.
The html parameter must be a string containing one or more HTML/XML trees.
The options parameter is optional, and may contain a map of
options to pass to htmlparser.
The callback parameter should contain a function with the signature
callback(err,dom), where:
The err argument will be a non-null value if an error occurs
during the parsing.
Otherwise the dom argument will contain a single DOM object
(when there is a single root tag in the given html string) or
an array of DOM objects (when there is more than one HTML/XML
structure in the given html string).
parse_html:(html,options,callback)->The options parameter is optional, so swap options and callback if necessary.
if typeof options is 'function' and typeof callback isnt 'function'
[ options, callback ] = [ callback, options ]If we haven't yet loaded htmlparser, do so now.
unless @htmlparser?
try
@htmlparser = require 'htmlparser'
catch err
callback(err,null)
if @htmlparser?Now create a simple handler that invokes the given callback...
handler = new @htmlparser.DefaultHandler (err,domset)->
if err?
callback(err,null)
else if Array.isArray(domset) and domset.length <= 1
callback(null,domset[0])
else
callback(null,domset)...create the parser...
parser = new @htmlparser.Parser(handler,options)...and parse the HTML.
parser.parseComplete(html)as_node returns nodeset[0] if nodeset is an array, nodeset otherwise.
as_node: (nodeset)->
if Array.isArray(nodeset)
return nodeset[0]
else
return nodesetas_nodeset returns node if node is an array, [ node ] otherwise.
as_nodeset: (node)->
if Array.isArray(node)
return node
else if node?
return [node]
else
return []_kt returns true. It's the default filter for to_text.
_kt: ()->trueto_text returns a concatenation of all text nodes found
within the given DOM elt.
An optional filter parameter may contain a function with
the signature filter(node) that returns true if the
text found in or beneath the given node should be included
in the concatenation or false if the text found at or
below the given node should be excluded.
E.g., the function:
var skip_em = function(node) { return node.name != 'em' };
will cause to_text to exclude any text found within an
<em> tag.
to_text:(elt,filter = @_kt)->
buffer = ''
@walk_dom elt, visit:(node,node_metadata,all_metadata)=>If node is acceptable to filter, then append any text, and visit its children.
if(filter(node,node_metadata,all_metadata))
buffer += @decode(node.raw) if node?.type is 'text' and node?.raw?
return {'continue':true,'visit_children':true}
elseIf node is not acceptable to filter, then skip it and its children.
return {'continue':true,'visit_children':false}
return bufferinner_text is an alias for to_text (which see).
inner_text:(elt,filter)->@to_text(elt,filter)to_html returns an HTML string representation of
the given elt and its children (if any).
(Currently only text and tag node types are converted,
but that may change in the future.)
to_html:(elt)->
buffer = ''
@walk_dom elt, {When visiting a node...
visit:(node)->
switch node.type...concat the value of text nodes.
when 'text'
buffer += node.raw...concat the name and attributes of tag nodes.
when 'tag'
buffer += "<#{node.name}"
if node.attribs?
for name,value of node.attribs
buffer += " #{name}=\"#{value}\""
buffer += ">"
return trueafter_visiting a node...
after_visit:(node)->
switch node.type...concat the "end tag" for tag nodes.
when 'tag'
buffer += "</#{node.name}>"
return true
}
return bufferinner_html returns an HTML string representation of
the the children (if any) of the given elt.
(Otherwise it behaves just like to_html, which see.)
inner_html:(elt)->
buffer = nullIf elt is an array, invoke to_html on the children of each element of in the array.
if Array.isArray(elt)
buffer = ''
for node in elt
if node.children?
buffer += @to_html(node.children)Otherwise to_html on the childen of elt.
else if elt?.children?
buffer = @to_html elt.children
return bufferwalk_dom performs a depth-first walk of the given DOM tree (or trees), invoking a specified "visit" function for each node.
The dom parameter is either a single DOM node or an array of DOM nodes.
The callbacks parameter is a map that contains (at minimum) an
attribute named visit containing a function with the signature:
visit(node,node_metadata,all_metadata)
where:
node is the DOM node currently being visited,node_metadata is a map containing parent, path, siblings
and sib_index keys, andall_metadata is an array of node_metadata values
for each previously visited nodes, indexed by the value
stored at node._stew_node_id.The callbacks.visit function should return a map containing
continue and visit-children attributes.
When visit-children is true, the children of
node (if any) will be visited next. When false,
the node's children will be skipped, but processing
will continue with node's siblings (or node's
parent's, siblings, etc.)
When continue is false, all subsequent processing
will be aborted and the walk_dom method will exit
as soon as possible.
If the value returned by visit is a boolean, that
value will be used for both continue and visit-children.
If callbacks is a function (rather than a map) it be
used as the visit function.
walk_dom:(dom,callbacks)->Fiddle with the input parameters if needed.
if typeof callbacks is 'function'
callbacks = { visit:callbacks }
nodes = @as_nodeset(dom)Create a container for all the node metadata.
dom_metadata = []
for node, sib_index in nodesCreate the metadata for this node...
node_metadata = { parent:null, path:[], siblings:nodes, sib_index: sib_index }
node._stew_node_id = dom_metadata.length...add it to the container...
dom_metadata.push node_metadata...visit the node...
should_continue = @_unguarded_walk_dom(node,node_metadata,dom_metadata,callbacks)...and exit if needed.
if not should_continue
break_unguarded_walk_dom is the "inner" implementation of walk_dom.
See walk_dom for more information
node is the current DOM node to visit.node_metadata is a map containing:parent - the parent of this node, if anypath - an array of this node's ancestors (from "root" to parent)siblings - an array of this node's parent's childrensib_index - the index of this node in the siblings arraydom_metadata is an array of node_metadata objects, indexed by
node._stew_node_id. Only the already visited nodes are contained
in this array.callbacks is the map of callbacks passed to walk_dom, which see._unguarded_walk_dom will return true if processing should continue
(typically with node's next sibling), or false if processing is
complete an no more nodes should be visited.
_unguarded_walk_dom:(node,node_metadata,dom_metadata,callbacks)->Visit the current node.
response = {'continue':true,'visit_children':true}
if callbacks.visit?
response = callbacks.visit(node,node_metadata,dom_metadata)If processing should continue...
if response is true or response?['continue'] is true or (not response?['continue']?)...and this node's children should be processed...
if node.children? and (response is true or response?['visit_children'] is true or (not response?['visit_children']?))...create the path to this node's children...
new_path = [].concat(node_metadata.path)
new_path.push(node)...and recursively visit each child in turn...
for child,index in node.children
new_node_metadata = { parent:node, path:new_path, siblings:node.children, sib_index: index }
child._stew_node_id = dom_metadata.length
dom_metadata.push new_node_metadata
should_continue = @_unguarded_walk_dom(child,new_node_metadata,dom_metadata,callbacks)...aborting further processing if needed.
if not should_continue
return false...invoke the post-visit callback, if any...
if callbacks['after_visit']?
response = callbacks.after_visit(node,node_metadata,dom_metadata)...aborting further processing if needed.
return response is true or response?['continue'] is true or (not response?['continue']?)
else # no `after_visit` callback
return true
else # processing should not continue
return falseThe DOMUtil class is exported under the name DOMUtil.
exports = exports ? this
exports.DOMUtil = DOMUtil