This document contains Java code fragments that illustrate the use of
XQL and the
PDOM. They are
provided to help you getting a quick start. The authorative source
of documentation, of course, are the API docs. Helper functions, e.g.
printing XML or creating a DOM from an XML file, is in the new
de.gmd.ipsi.domutil package, described in the API docs.
A brief introduction how to configure the XQL/PDOM
Servlet is
also included.
This section illustrates the use of the XQL query language through the Java 1.1
API of the engine. First it is shown how simple queries may be evaluated
against arbitrary DOMs, secondly it is shown how a user is able to extend
the XQL language to fit his/her demands.
Querying the W3C-DOM
The engine offers three different interfaces to apply XQL queries to DOM
trees. You may
- request a result object containing the results and iterate over them,
- create a new Document containing the results,
- print the results into a stream as XML, using the XMLWriter utility class
The query processor only uses the W3C-DOM interfaces and is expected
to work with any conformant implementation.
Variant 1 illustrates the use of XQLResult objects to
store the results of the query. The object may store the results of
different queries and by this aggregate them.
Variant 2 shows how a user provided, empty DOM can be directly populated
with the results of a query. Both DOMs don't have to be instances of the
same implementation, so this method can be used to copy results
between different vendors' DOMs.
Variant 3
demonstrates three different techniques. First it introduces the
XQLQuery object that is used to store preparsed queries.
Secondly it is shown a empty Document is directly populated with
results of a query. Thirdly it is shown how the results from
different queries can be aggregated into a single result DOM Document.
This Document is written into an output stream, using the XMLWriter
utility class, which will do the proper XML-tagging.
import de.gmd.ipsi.xql.*;
import de.gmd.ipsi.domutil.*;
import org.w3c.dom.*;
import java.io.*;
// Input: A Document created by your favorite DOM implementation
Document doc = DOMUtil.createDocument();
//
// Variant 1: Using result objects to process query results
//
XQLResult r = XQL.execute("//paragraph", doc);
for(int i=0; i<r.getLength(); i++) {
Element para = (Element) r.getItem(i);
// Do something useful...
}
// Remember that query results are not always element nodes
r = XQL.execute("//paragraph!text()", doc);
for(int i=0; i<r.getLength(); i++) {
String s = (String) r.getItem(i);
// Do something useful...
}
//
// Variant 2: Creating a new DOM from query results
//
Document resultdoc = ...;
XQL.execute("//a $union$ //b", doc, resultdoc);
//
// Variant 3: Aggregate results and prettyprint them in XML
//
resultdoc = null; // Get rid of last result
Document doc = DOMUtil.createDocument();
Node root = resultdoc.createNode("myroot"); // Create a document root
resultdoc.appendChild(root);
XQL.execute( "//some/query", doc, root);
XQL.execute( "//another/query", doc, root);
XQL.execute( "//yet/another/one", doc, root);
XMLWriter out = new XMLWriter(System.out, "ISO-8859-1");
out.formatOutput(true); // prettyprinting wanted
out.write( resultdoc );
|
Customized Relationship Operators
The XQL class allows to register user defined relationship operators,
e.g. comparisons. The example illustrates the implementation of the
$contains$ operator, that checks for substring containment of the
right-hand-side argument in the left-hand-side argument.
The new operator may be used in all queries submitted after its registration, e.g.
//article/title $contains$ "XML" would be possible.
import de.gmd.ipsi.xql.*;
XQL.addRelationship("$contains$",
new XQLRelationship() {
public boolean holdsBetween(Object l, Object r) {
return XQL.text(l).indexOf(XQL.text(r)) >= 0;
}
});
|
Customized Methods
The XQL class allows to register user defined methods,
e.g. arithmetic operations. The example illustrates the implementation of the
plus method, that returns the value of the selected object incremented
by a value given as the method's parameter.
The new function may be used in all queries submitted after its registration, e.g.
//number!plus(1) would be possible.
import de.gmd.ipsi.xql.*;
XQL.addNodeFunction("plus",
new XQLNodeFunction()
{
public Object call(int refNodeIndex, XQLResult refNodeSet, Object[] args) {
Object val = refNodeSet.getItem( refNodeIndex );
if (args.length==0) return val;
return new Double(XQL.number( val ) + XQL.number( args[0]) );
}
}
);
|
Customized Collection Functions
The XQL class allows to register user defined collection functions,
e.g. for aggregation. The example illustrates the implementation of the
ELEMENT function, which returns a set of elements that are children of
the current reference node/object and whose names consist only of uppercase letters.
If the current set of reference nodes is empty or the reference object is not a
org.w3c.dom.Node null is returned, which behaves exactly as if an empty
result set would be returned.
The new collection function may be used in all queries submitted after
its registration, e.g.
//foo!ELEMENT() would be possible.
// 1. CollectionFunction: 'ELEMENT'
import de.gmd.ipsi.xql.*;
import org.w3c.dom.*;
XQL.addCollectionFunction("ELEMENT",
new XQLCollectionFunction() {
public XQLResult call(int refNodeIndex, XQLResult refNodeSet, Object[] args)
{
if (refNodeSet.getLength()==0) // No reference node!
return null;
// creating result consisting of children of reference nodes which are
// element nodes
XQLResult xres = new XQLResult();
Object val = refNodeSet.getItem(refNodeIndex);
if (! (val instanceof Node)) return null;
Node n = (Node)val;
Node c = n.getFirstChild();
while(c != null) {
String name = c.getNodeName();
if ( (c instanceof Element)
&& name.equals(name.toUpperCase())
) xres.add(c);
c=c.getNextSibling();
}
return xres;
}
}
);
|
The PDOM class allows to generate binary, indexed files containing a persistent
W3C-DOM. A PDOM file immediately offers all DOM operations
without the cost of parsing XML or building an in-memory DOM representation first.
Combined with servlets and XQL, PDOM files offer an efficient method to serve XML
fragments from large documents. A PDOM file may be created from any XML file or
programmatically using W3C-DOM methods.
When creating PDOM files from XML files, SAX events are used
to communicate with the XML parser. Using the event based SAX API there never has to be a full
representation of your XML file in main memory. Because of this the size of a PDOM file is only
limited by disk space, not by main memory.
The de.gmd.ipsi.pdom.PDocument class implements org.w3c.dom.Document,
so the PDOM may be used anywhere a W3C compliant DOM implementation is needed. As the PDOM
API supports all methods of the W3C-DOM, including updates and inserts, programatic creation
and modification of PDOM files is possible.
Overview of the PDOM Features
Caching:
A PDOM file is organized in pages, each containing 128 DOM nodes of variable length.
When a PDOM Node is accessed by a W3C-DOM method, the containing page is loaded into a
main memory cache.
Starting with a default cache size of 100 pages (12.800 DOM Nodes), the main
memory cache can be resized any time. It will, however, never shrink below 20 pages (2.560 DOM Nodes).
It is recommended to use the largest cache size your machine's main memory can hold without
swapping, as a larger cache improves overall PDOM performance. The same cache is shared by all
PDOM documents opened with the same instance of the PDOM engine. The caching strategy used is
"least recently used" (LRU).
Defragmentation:
When a node is programmatically inserted, updated or delete by W3C-DOM methods, the page
containing the node is invalidated ("dirty page"). If a dirty page is displaced from the cache,
the modified page is appended at the end of the PDOM file. So a PDOM file will grow during write
operations, as the file space occupied by invalidated pages will not be removed or reused automatically.
Note that just reading and or querying a PDOM file, however, will
never change the file size.
The PDOM file can be defragmented at any time by removing unused pages. During this operation
a temporary file containing only valid pages is created and finally the fragmented
PDOM file is replaced with the unfragmented copy. It is possible to define the directory
where the temporary file is created. The slack ratio, that is the percentage of wasted file space
divided by physical file size can be accessed by user applications. The number is normalized to a
double between 0.0 and 1.0. It is up to the user
application to start a defragmentation, probably if the slack ratio grows beyond a tolerable
mark.
Full garbage collection:
Defragmentation does work on a per-page basis and does not free space occupied
by DOM nodes that have been deleted within pages. To also free this space, a full garbage collection
is required. To avoid dangeling object references, a garbage collection is only safe
if the PDOM file is not opened by another PDOM engine and no PDocument object is currently bound
to the PDOM file. This also includes any child nodes of PDocument, which may still be in main memory
left from previous operations. It is the duty of the user application to enforce this conditions,
else you are in danger to garble the PDOM file. Full garbage collection includes defragmentation.
Commit points:
At any time a user application doing update, delete or insert operations on a PDOM can decide
to commit the current status quo of the PDOM. In the commit operation the main file index, normally maintained in
main memory, is written back to disk. If the user application crashes, e.g. because of a "disk full"
error, the PDOM will be in the state is was immediately before the last successful commit
operation when re-opened. Great care was taken to ensure file consistency even after crashes.
There is, however, a minimal chance of corrupting a file if the user application dies during
a commit operation. Keep in mind that the PDOM does not try to be a fully fledged database.
Compression with gzip:
Optionally a PDOM file can be compressed on the fly using the gzip algorithm. This will result
in smaller files, usually half the size of an uncompressed PDOM file. The tradeoff here is
speed: a compressed PDOM file usually increases the execution for reading and writing pages by
20%. Compression is a one time decision take at creating time of the PDOM file. A file can not
be compressed later. All operations opening PDOM files will automatically recognize compression
and handle this fact transparently. User applications never have to care or know about compression
when dealing with existing PDOM files.
Multithreaded access:
The same PDOM file can be read by multiple threads in parallel without problem.
Update operations block read and write operations for other threads. Given this, all atomic operations
on a PDOM file are thread safe. However, composed update operations (e.g. reading a node, modifying
it and write back to the PDOM) suffer from from the well known transaction difficulties. To
ensure atomicity of complex updates, the application has to synchronize the critical block of code
with the PDocument object.
Creating a PDOM file
There are two ways to create a PDOM file, either by writing an in-memory DOM to disk
or by creating it from an XML InputStream.
Variant 1 demonstrates the creation of a PDOM file from an in-memory
instance of another DOM. Any W3C-DOM implementation can be used. The example
does use the gzip compression option to create a compressed PDOM file.
Variant 2 demonstrates the creation of a PDOM file from a vanilla
plain XML file. The builtin validating SAX parser, extending xml4j2's
com.ibm.xml.parsers.SAXParser, is used. As we decide to use validation,
it is feasible to suppress ignorable whitespace. This way a lot of
unnecessary Text nodes holding only whitespace are suppressed, resulting in
a smaller, faster PDOM file.
import de.gmd.ipsi.pdom.*;
import de.gmd.ipsi.domutil.*;
import org.w3c.dom.Document;
//
// Variant 1: Writing an in-memory DOM Document to disk
//
// A Document created by your favorite DOM implementation
Document in_memory_doc = DOMUtil.createDocument();
PDOM.writeDOMFile(
"mydoc.pdom",
in_memory_doc,
true // false = no gzip compression, true = create gziped PDOM
);
//
// Variant 2: Create a PDOM by parsing an XML input stream
//
Document pdoc = new PDocument("mydoc.pdom");
DOMUtil.parseXML(
new FileInputStream("valid_with_dtd.xml"),
pdoc, // The Document's factory is used to create PDOM Nodes
true, // Parse mode: true = validating, false = non-validating
DOMUtil.SKIP_IGNORABLE_WHITESPACE // Whitespace treatment, see API docs
);
((PDocument)pdoc).commit(); // be sure to flush to disk
|
Querying a PDOM File with XQL
As a PDOM object implements the full W3C-DOM API, the XQL
engine can be used to query the persistent file. Special glue code is
included to automatically support the indexing algorithm used by the
XQL engine.
Example 1 demonstrates how to query a PDOM file with XQL. All
techniques from the examples in the XQL section may be applied. This
example chooses to use a XMLWriter object to print the
query results to stdout.
Example 2 shows how the cache can be controlled at runtime.
import de.gmd.ipsi.pdom.*;
import de.gmd.ipsi.xql.*;
import de.gmd.ipsi.domutil.*;
import org.w3c.dom.Document;
//
// Example 1: Querying the PDOM
//
Document doc = new PDocument ("mydoc.pdom");
XQLQuery q = new XQLQuery( "//sometag" );
q.execute(doc, new XMLWriter(System.out));
//
// Example 2: Configuring the cache
//
// set cache size to 1000 node pages each containing 128 DOM nodes
PDOM.setCacheSize(1000);
// remove all node pages which are currently in cache
PDOM.clearCache();
|
Using the W3C-DOM API for Update Operations
This section contains a long running example that demonstrates update
and insert operations using W3C-DOM methods.
Example 1 demonstrates how a PDOM is created from scratch and
programmatically is filled with two nodes, a root element and some text.
Example 2 demonstrates how a PDOM is re-opened and accessed
using read-only methods to traverse the DOM. Please note that after
the access the file is not committed, but only closed because there
are no changes that need to be committed.
Example 3 performs update, insert and delete operations.
Afterwards the PDOM file size will be increased. This is detected by
the getSlack method and a defragmentation is started. Finally,
all changes including the defragmentation are committed.
Example 4: Because we deleted an element in example 3, there
still is wasted space in the PDOM file. To get rid of it, we do a
full garbage collection. First of all the doc object is nulled,
to be sure the virtual machine can garbage collect it. In addition the
cache is deleted to be sure we don't operate on in-memory data.
There are still dangling objects in scope, e.g. n from example 2.
If these objects are reused after the garbage collections took place,
the engine will probably crash and garble your file. Don't do this.
import de.gmd.ipsi.pdom.*;
import de.gmd.ipsi.domutil.*;
import org.w3c.dom.*;
//
// Example 1: Programatic creation of Nodes
//
// Assuming "newdoc.pdom" does not exist,
// create a new PDOM file
Document doc = new PDocument ("newdoc.pdom");
// Now insert Nodes
Element e = doc.createElement("root");
e.appendChild(doc.createTextNode("Hello World"));
doc.appendChild(e);
// Synchronize PDOM file and in-memory buffers
doc.commit();
// loose object reference to be sure we really re-open
// the document.
doc = null;
//
// Example 2: Reading existing PDOM files
//
// re-open the document we just created
doc=new PDocument("newdoc.pdom");
// traverse the DOM tree
Node n = doc.getFirstChild();
while(n!=null) {
System.out.println(n.getNodeName());
n=n.getNextSibling();
}
// Just close the PDOM file. As we did no update
// operations, no full commit is necessary.
doc.close();
//
// Example 3: Update exercises
//
// reopen the document once again
doc = new PDocument("newdoc.pdom");
// Get the TextNode containing "Hello World" and update it
Text t = (Text) doc.getFirstChild().getFirstChild();
t.setData("Back Again");
// Create and delete a node
Element e = doc.createElement("garbage");
doc.getFirstChild().appendChild(e);
doc.getFirstChild().removeChild(e);
// a slack of 0.2 means 20% wasted file space
if ( doc.getSlack() > 0.20 ) {
doc.defragment();
} else {
doc.commit();
}
//
// Example 4: Full Garbage Collection
//
// loose object reference to be sure we really re-open
doc = null;
PDOM.clearCache();
// ... and do the garbage collection
PDOM.collectDOMFileGarbage("newdoc.pdom");
|
Using the XQL/PDOM Servlet
The XQL/PDOM Servlet gives access to collections of PDOM files via
vanilla plain HTTP connections. It does interpret the PATH_INFO
part of the submitted URL to select documents to query, and the
QUERY_STRING part to transport an XQL query. Results are
returned as text/xml, suited for display in XML aware browsers
or further processing.
The Servlet requires
JSDK
2.0 or better and has successfully been tested with
Apache Jserv
and
W3C-Jigsaw.
To set up a XQL/PDOM based service add the JAR file from this
distribution to the CLASSPATH of your Servlet enabled http-server.
The class you need to register with the server is
de.gmd.ipsi.pdom.Servlet. You also need to register a single
property called properties, pointing to a file containing the
configuration. Example configuration files for Apache/Jserv1.0b3 are included in the
samples/ directory of this distribution. The next table shows a typical
configuration.
# Example XQL/PDOM Servlet configuration file
##
# file - set alias and path for PDOM files served.
#
# Value: Absolute paths of PDOM files we will serve
# Note: The last component after the "pdom.file" prefix will
# be used as an alias for that document in URL encoded
# queries.
# Note: For Windows9x/NT all double backslashes in path names are need
##
pdom.file.william = d:\\webserver\\pdom\\william.pdom
pdom.file.darkness = d:\\webserver\\pdom\\darkness.pdom
pdom.file.ot = d:\\webserver\\pdom\\ot.pdom
pdom.file.nt = d:\\webserver\\pdom\\nt.pdom
pdom.file.bom = d:\\webserver\\pdom\\bom.pdom
pdom.file.quran = d:\\webserver\\pdom\\quran.pdom
##
# Options for XML output
##
# output.indent - Formatting of XML output.
# Values: true => prettyprint XML output
# false => print XML as in PDOM file
# Default: false
pdom.output.indent = true
# output.encoding - Encoding of XML output.
# Determine the encoding of the XML documents returned by the Servlet
# Values: Any valid MIME encoding
# Default: UTF-8
pdom.output.encoding = ISO-8859-1
##
# Options for PDOM engine
##
# cache - Set cachesize for PDOM engine (in pages of 128 DOM nodes)
# The cache is shared by all open documents
# Values: Integers between 20 and MAXINT (given you have that much memory)
# Default: 100
pdom.cache = 555
# keepfilesopen - Determine when PDOM-files are opened and closed.
# Keeping all files open (true) results in faster responses, opening on
# demand only (false) saves memory.
# Default: true
pdom.keepfilesopen = true
|
Let's assume you mapped the applet into the servlet zone alias foo, using
the above configuration. Three types of output are possible:
- To get the applet status simply submit an empty query
http://myhost.com/foo?
- To query a single document (e.g. william), add the alias name between the servlet name and the query part
http://myhost.com/foo/william?//PLAY/TITLE
- To query multiple documents (e.g. william and nt), add the alias names as a pipe-symbol ("|") separated list
before the query.
http://myhost.com/foo/william|nt?//PLAY/TITLE $union$ //bktlong
The output will be MIME-typed as text/xml with no stylesheet, so currently only MS-IE5 will give
you a nice collapsable tree view. Older browser most likely will display garbage or ask you what external
viewer to spawn.