Hacking Stew

(Follow this link to go back to the README file.)

Stew is a JavaScript library, implemented in CoffeeScript. It is primarily intended to be used in a Node.js environment.1

Both the (original) CoffeeScript and (generated) JavaScript files are included in the binary distribution, so clients can use whichever they prefer.

Stew's source code is hosted at github.com/rodw/stew. Any issues or pull-requests you'd like to submit are appreciated.

Stew is published under an MIT license.

This document provides information that is primarily of interest to those that want to make changes to Stew. Most clients (users) of the Stew library will be more intersted in the README file.

How it Works

Stew is partioned into three classes: DOMUtil, PredicateFactory and Stew.

Stew is the real driver behind the API, parsing CSS selector expressions and collecting matching nodes from the DOM tree.

PredicateFactory defines methods that implement indvidual CSS selection rules.

DOMUtil provides fairly generic utility methods for working with DOM structures.

We'll cover those bottom-up, from the most generic to the most specific.

DOMUtil

DOMUtil provides generic utilities for working with the DOM (Document Object Model) structure generated by node-htmlparser.

For our purposes, the most important of these utilities is the walk_dom method, which implements a depth-first walk of a given DOM tree. walk_dom will invoke the given visit (callback) method for every node in the DOM.

For example, to convert a DOM structure into text, we might create a visit method like this:

var buffer = "";
var visit = new function(node,node_metadata,all_metadata) {
  if(node.type === 'text') {
    buffer = buffer + node.raw;
  }
  return true;
};

and invoke it like this:

domutil.walk_dom(dom,visit);
console.log("The text was");
console.log(buffer);

Stew uses DOMUtil.walk_dom to transverse the DOM tree.

(node_metadata and all_metadata contain metadata about the current node, and all previously visited nodes, respectively. For example, node_metadata.parent contains the parent of the current node and node_metadata.siblings contains an array of all of node_metadata.parent's children. See the comments with dom-util.coffee for more detail.)

See the annotated source for more detail.

PredicateFactory

PredicateFactory generates predicate functions that test whether a given node matches a specific CSS selector.

For example, the "universal selector" (*) matches any and every "tag" node. Here's a predicate function that implements the * selector:

function universal_selector_predicate(node) {
  return node.type === 'tag';
}

Here's a predicate that implements a "tag" selector, selecting all tags with the type (name) foo:

function foo_tag_predicate(node) {
  return node.type === 'tag' && node.name === 'foo';
}

PredicateFactory methods generate functions like these (bound to particular input parameters such as tag or attribute names).

PredicateFactory includes generators for each of the core CSS selectors (tag, ID, class, attribute name and attribute value) as well as combinators such as "and" (no space), "or" (,), "descendant" (space), "direct descendant" (>), "adjacent sibling" (+).

Stew uses these predicates to implement the CSS selection logic.

See the annotated source for more detail.

Stew

Stew is the main entry point for the overall library. Stew parses a String representation of a CSS Selector, generate the appropriate predicates (using PredicateFactory) and then processes the DOM tree (using DOMUtil) to select the matching nodes.

The CSS parsing is primarily accomplished via regular expressions. This is a multi-step process.

For example, lets assume complicated CSS expression such as:

'div#main .sidebar ul.links li:first-child a[rel="author"][href]'
  1. The expression is split into individual selectors by _parse_selectors using _SPLIT_ON_WS_REGEXP. Naively this the same as splitting the expression on white-space characters, but we also need to take into account the use of spaces within "quoted strings" and /regular expressions/ and non-whitespace delimiters like , or +. In our example, we obtain these five tokens:

    [ 'div#main',  '.sidebar',  'ul.links',  'li:first-child', 'a[rel="author"][href]' ]
  2. Each of these tokens is then parsed into one or more CSS specific selectors by _parse_selector using _CSS_SELECTOR_REGEXP (and where needed, _ATTRIBUTE_CLAUSE_REGEXP). For example, from the first token (div#main) we identify two individual predicates, one that implements "tag name is div" and another that implements "node id is main". These two predicates are then joined by an "and" predicate. All together, these five tokens are converted into predicates (something) like these:

    1. div#main becomes and( tag_name_is_div(), node_id_is_main() )
    2. .sidebar becomes class_name_is_sidebar()
    3. ul.links becomes and( tag_name_is_ul(), class_name_is_links() )
    4. li:first-child becomes and( tag_name_is_li(), tag_is_parents_first_child() )
    5. a[rel="author"][href] becomes and( tag_name_is_a(), rel_attr_is_author(), has_href_attr() )
  3. Back in _parse_selectors these five predicates are joined into a "descendant selector" predicate, yielding a single predicate that returns true if and only if the current node matches the complete CSS expression.

CSS-Selector-implementing predicate in hand, Stew's select method then visits every node in the DOM tree, collecting each node that matches the predicate.

See the annotated source for more detail.

Unit Tests

The ./test directory contains unit tests for each of these types. These tests can be executed by running

make test

or

npm test

The test-coverage report identifies the lines of code2 that are exerciesd by the test suite. These report can be generated by running:

make coverage

How you can help.

Your contributions, bug reports and pull-requests are greatly appreciated.

Areas that need work.

If you're looking for areas in which to contribute, here are a few ideas:

How to contribute.

We're happy to accept any help you can offer, but the following guidelines can help streamline the process for everyone.

Please Note: We'd rather have a contribution that doesn't follow these guidelines than no contribution at all. If you are confused or put-off by any of the above, your contribution is still welcome. Feel free to contribute or comment in whatever channel works for you.

Nuts and Bolts

Run-time Depenencies

Technically Stew doesn't have any run-time dependencies. No external libraries are required.

Practically speaking, Stew depends upon Chris Winberry's node-htmlparser. Stew assumes the structure of the DOM object passed to select and similiar methods is compatible with that generated by node-htmlparser.

If node-htmlparser is available (via a require call) then some (optional) DOMUtil methods will make use of it.

Stew makes use of several libraries to support development, documentation and testing. These are enumerated in the package.json file.

Building and Testing

Downloading

You can clone Stew's Git repository via:

git clone git@github.com:rodw/stew.git

You can also download a ZIP archive of the latest source.

Installing

Once you have Stew cloned into a local working directory, you can use npm to install any build-time dependencies, as follows:

npm install

(This may take a few minutes, as some external libraries may need to be downloaded and natively compiled.)

Testing

Once installed, you can also run Stew's unit test suite using npm:

npm test

If everything is working properly, you should expect to see a message like 68 tests complete (633 ms) (although the specific numbers might be different, of course).

Compiling the CoffeeScript files into JavaScript

You can run

npm run-script compile

to generate JavaScript files from the CoffeeScript files in ./lib.

Using Make

If you have GNU Make installed, the best and easiest way to work with Stew's source code is using the provided makefile.

Installing and Testing

You can use:

make install

and:

make test

and:

make js

in place of the npm equivalents above, but the makefile can help you to do much more than that.

Generating Documentation

make markdown will generate Stew's HTML documention from various Markdown files in the repository. Most of these files will be written to the ./docs directory. Note that the Makefile uses Pandoc to generate HTML from the Markdown sources, but in theory other Markdown processors could be used.

make docco will generate an annotated version of Stew's source code using the nifty Docco documentation generator. These files will be written to ./docs/docco/.

make docs will do both of these at once.

Test Coverage

make coverage will generate a report that shows which source code lines are touched (and not touched) by the test suite. This runs the same unit tests as make test, but uses JSCoverage to evaluate the test coverage. The coverage report is written to ./docs/coverage.html.

npm Packaging

make module will generate a package suitable for distribution via npm (into a directory called ./module).

make test-module-install will generate the ./module directory and then validate it by trying to install it into a temporary directory. You should expect to see It worked! as the last line of output.

Some other targets

make clean will remove various generated files.

make todo will display a list of "TODO" and related comments found in the source code.

make targets will list all available targets.


  1. Although it probably wouldn't be difficult to make Stew work in a browser context, we haven't had any need for that, and so we haven't (yet) attempted to do it. Drop us a note if this is something you'd like to see Stew support.

  2. The generated JavaScript code, not source CoffeeScript, for better or worse.